kdd - Githubissues

dichika commented 9 years ago

0.841504:old 0.8428461 0.851 rm filter 0.8529791 source 0.8519965 source count 0.8630154 start yearmonth 0.8688813 start date 0.8693542 interval 0.87072 unique obj 0.8727769 browser_flag 0.8726645 obj_category 0.8726432 wday 0.8725883 year 0.8732568 first last 0.8739745 session 0.873695 -keep_median, -keep_mean 0.8738611 session 0.87535 bug fix 0.8749137 bug fix 0.8738832 -year 0.8749045 day>1 -browser, -server 0.874532 -day>1 -browser, -server 0.8742906 day>=1 -browser, -server 0.8743647 -first 0.8741762 day>2 0.8744576 add session_first_last 0.8741655 scaled_session_first_last 0.874472 object top20 0.8747894 last action 0.8750612 -all -browser 0.8750528　-browser_page_close,-all_0, -browser, -server, -all, -all_lastday 0.8752374　categoryprop

dichika commented 9 years ago

明らかに0になるであろう人をヒューリスティックに修正 →スコア下がった

dichika commented 9 years ago

train結果のデータ確認

dichika commented 9 years ago

Rによるfeature hashingの記事 http://amunategui.github.io/feature-hashing/

dichika commented 9 years ago

https://github.com/owenzhang/kaggle-avazu http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf 特にこれは使えそう https://github.com/diefimov/amazon_employee_access_2013

dichika commented 9 years ago

https://github.com/wush978/FeatureHashing 下記は有用だがfeature hashing 0.8のコード as.dgCmatrixとか含まれているので注意 http://amunategui.github.io/feature-hashing/

dichika commented 9 years ago

Rcppのクイックレフ http://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-quickref.pdf

dichika commented 9 years ago

sessionの開始時間

dichika commented 9 years ago

スコアが下がった場合、変化した人間のみをヒューリスティックに修正 →0.0002向上程度なのでやる必要なし

dichika commented 9 years ago

feature hashingに加えてPer-Coordinate FTRL-Proximalも掲載されている http://cran.r-project.org/web/packages/FeatureHashing/README.html

dichika commented 9 years ago

FTRLは以下の通り

FTRL-Proximal is equivalent to Online (Stochastic) Gradient Descent when no regularization is used [1] http://courses.cs.washington.edu/courses/cse599s/14sp/kdd_2013_talk.pdf

dichika commented 9 years ago

GBDTについてovefittingの説明含めて非常にわかりやすい http://nbviewer.ipython.org/urls/s3.amazonaws.com/datarobotblog/notebooks/gbm-tutorial.ipynb

dichika commented 9 years ago

ntree 300 interaction 15でgbmがglmの結果を下回っている。 gbmの結果を最適化する必要あり。

dichika commented 9 years ago

MOOCからの脱落予測 https://oerknowledgecloud.org/sites/oerknowledgecloud.org/files/In_depth_37_1%20(1).pdf

dichika commented 9 years ago

GLMとGBMでいくつかモデルを作ってみてその結果の変化を比較してみる

dichika commented 9 years ago

MOOC prediction http://web.stanford.edu/~halawa/emoocs_slides_final.pdf http://educationaldatamining.org/EDM2014/uploads/procs2014/short%20papers/273_EDM-2014-Short.pdf

dichika commented 9 years ago

過去コンペの離脱予測 http://sucrose.hatenablog.com/entry/2013/04/19/001748 http://d.hatena.ne.jp/repose/20130419/1366375616 http://d.hatena.ne.jp/tks23/20130427/1367078299

dichika commented 9 years ago

xgbのパラメータ https://gist.github.com/nagadomi/f348ce2a68bac967c3c3

dichika commented 9 years ago

caretでAUCが計算できていないときはpROCをアップデートすること

dichika commented 9 years ago

xgboostの活用例 http://sssslide.com/www.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit

dichika commented 9 years ago

feature engineering http://ufal.mff.cuni.cz/~zabokrtsky/courses/npfl104/html/feature_engineering.pdf

dichika commented 9 years ago

sparse.matrixを作る際fnames == names(mf)というエラーが出たら列名が数字で始まっている可能性あり。

dichika commented 9 years ago

xgboostのコンセプト http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf なぜxgboostはすごいのか https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14054/why-do-you-think-neural-networks-is-not-comparable-with-xgboost-for-this/77129

dichika commented 9 years ago

practical approach http://blog.david-andrzejewski.com/machine-learning/practical-machine-learning-tricks-from-the-kdd-2011-best-industry-paper/

dichika commented 9 years ago

avazu https://www.kaggle.com/c/avazu-ctr-prediction/forums/t/12460/congrats-to-the-winners/63909#post63909 otto https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13252/i-am-stuck-after-getting-a-score-of-0-54-using-random-forests-and-svm/69822

dichika commented 9 years ago

avazu https://github.com/infinitezxc/kaggle-avazu/blob/master/doc.pdf

dichika commented 9 years ago

infotheoのdiscretize http://cran.r-project.org/web/packages/infotheo/infotheo.pdf

dichika commented 9 years ago

3idiots https://www.kaggle.com/c/criteo-display-ad-challenge/forums/t/10555/3-idiots-solution-libffm

dichika commented 9 years ago

kdd2014 https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/forums/t/9774/congrats-to-straya-nccu-and-adamaconguli

dichika commented 9 years ago

h2oのensembleコード、おそらくこのままでは動かない https://github.com/h2oai/h2o-2/tree/master/R/ensemble

dichika commented 9 years ago

hidden の数を多くするならepochも増やさないと収束しないとかいろいろコツがある https://www.kaggle.com/c/afsis-soil-properties/forums/t/10568/ensemble-deep-learning-from-r-with-h2o-starter-kit

dichika commented 9 years ago

blending/stackingについてottoコンテストの結果から http://blog.kaggle.com/2015/06/09/otto-product-classification-winners-interview-2nd-place-alexander-guschin/ こっちは網羅的でより詳しい http://mlwave.com/kaggle-ensembling-guide/

dichika commented 9 years ago

calibration https://medium.com/@chris_bour/6-tricks-i-learned-from-the-otto-kaggle-challenge-a9299378cd61 ここではscikit-learnを使っているがcaretにもcalibration plotの形で実装されている。 Platt scalingは同義のようだがSVMでは当然のように使われていた方法 https://en.wikipedia.org/wiki/Platt_scaling

http://stats.stackexchange.com/questions/5196/why-use-platts-scaling http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression/ http://stackoverflow.com/questions/27927420/calibration-and-liftchart-with-caret-r-package

dichika commented 9 years ago

lasagneのチューニングとその考え方が参考になりそう https://github.com/christophebourguignat/notebooks/blob/master/Tuning%20Neural%20Networks.ipynb

dichika commented 9 years ago

Bayesian Optは使わないと思うけどこの資料がわかりやすかった https://tech.d-itlab.co.jp/kuto/2013/07/26/%E8%AB%96%E6%96%87%E7%B4%B9%E4%BB%8Bpractical-bayesian-optimization-of-machine-learning-algorithmsnips2012/

dichika commented 9 years ago

ガウス過程の説明これも良かった http://heartruptcy.blog.fc2.com/blog-entry-142.html

dichika commented 9 years ago

キャリブレーションはPythonだとscikit-learnだけどRだとCORElearnパッケージか http://cran.r-project.org/web/packages/CORElearn/index.html

DAISUKEICHIKAWA commented 9 years ago

data leakage https://www.kaggle.com/wiki/Leakage

dichika commented 9 years ago

caretEnsemble http://cran.r-project.org/web/packages/caretEnsemble/vignettes/caretEnsemble-intro.html

dichika commented 9 years ago

Feature Weighted Linear Stackingについて簡潔で分かりやすい http://d.hatena.ne.jp/jetbead/20150514/1431612867

dichika commented 9 years ago

FWLSの日本語文献 http://www.anlp.jp/proceedings/annual_meeting/2014/pdf_dir/C1-4.pdf

dichika commented 9 years ago

BellKorチームの論文 Global Effectについてチェック http://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf

dichika commented 9 years ago

読み物としてNetflixの盛り上がりが感じられて良い http://steps.dodgson.org/bn/2008/08/14/ http://www.gravityrd.jp/netflix%E8%B3%9E%E7%89%A9%E8%AA%9E https://web.archive.org/web/20100609225146/http://sciencereview.berkeley.edu/articles.php?issue=14&article=briefs_2

dichika commented 9 years ago

クラソルコンペの提出結果を使った集団学習。 blendingする際に、過学習を避けるためにコンペ時のテストデータの一部を検証用データとして用いていること（いわゆるblendingのProbeデータセット）に注意。 https://kaigi.org/jsai/webprogram/2014/pdf/265.pdf

dichika commented 9 years ago

ただの平均ではうまくいっていない例相関が小さい結果を平均しましょうとは書いてある http://www.isif.org/fusion/proceedings/fusion99CD/C-169.pdf

dichika commented 9 years ago

ベル研究所の簡単な記事 http://www2.research.att.com/~volinsky/papers/chance.pdf

dichika commented 9 years ago

まさに集団学習 http://pslcdatashop.org/KDDCup/workshop/papers/kdd2010ntu.pdf

dichika commented 9 years ago

mlrを使ってみたいけどxgboostがcranから削除されたという理由でremoveされている。 https://github.com/mlr-org/mlr/issues/263

dichika commented 9 years ago

kddcupのサイトはkaggleの職員が作った？ http://www.reddit.com/r/MachineLearning/comments/34uvub/kdd_cup_2015_mooc_dropout_prediction/

dichika / memo

kdd #2