criteo / CriteoDisplayCTR-TFOnSpark

Apache License 2.0
27 stars 11 forks source link

do you know code example to run fully locally , without spark? kind of online learning? #1

Open Sandy4321 opened 4 years ago

Sandy4321 commented 4 years ago

do you know code example to run fully locally , without spark? kind of online learning?

trams commented 4 years ago

@Sandy4321, no. There is no way to run it without Spark but your can use Spark local mode to run it on one machine but it may prove to be difficult because of the size of the dataset (unless your machine has ~500-1000Gb of memory the dataset won't fit in memory and performance degrade dramatically)

Here I am guessing what you want. But if you just want Spark out of the picture you can follow this Google blog post https://cloud.google.com/blog/products/gcp/using-google-cloud-machine-learning-to-predict-clicks-at-scale to build similar prediction model using only Tensorflow without Spark

Sandy4321 commented 4 years ago

Great for answer I mean read data row by row Like streaming online data Seems to widely used package is wappal wabit Also I found this link https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#criteo_tb So data maybe not so big Thanks for link Will check All ideas are very welcome

Sandy4321 commented 4 years ago

Data on this link is already preprocessed https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#criteo_tb And size is 370gb zipped So it would be great to try something simple like logistics regression By the link to code from this Google post is broken Still I do not know how work with clouds So looking for local windows computer code Though I do have 50gb ram memory

trams commented 4 years ago

@Sandy4321 if you want some help please write exact question you are looking answer for. What are you trying to archive?Are you trying to build GBDT model on Criteo 1Tb dataset with let's say 100 trees on one machine in a reasonable time?

If so I am afraid I do not know the way to do this aside from sampling data. In practice to build a model on dataset which does not fit in memory I used two techniques

  1. get more machines or more memory so it fits in memory
  2. shrink the dataset (by sampling negatives, for example) to fit in memory

While I am sure there are tricks to enable out of memory learning of GBDTs I am not aware of any technique which is efficient enough As for "online" learning I do not think xgboost builds GBDTs online. You need at least one pass over all dataset to build 1 tree. So if you want a model with 100 trees you would be forced to read the dataset 100 times which will take ages (for 370GB dataset) if it is not in RAM

trams commented 4 years ago

Either way the purpose of this project is to demonstrate an ability to train a model in a distributed fashion

Sandy4321 commented 4 years ago

https://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost

This link had discussion of incremental learning As I understand it We can get chunk of data and create model Than take another chunk of data and create another model After we can joint these models together Technically we just provide for next model calculation as parameter previous model Each model is set of logical rules , so we can just combine rules together Regarding to this ticket, it would be great compare performance of these 2 approaches To find out how much incremental learning losses in comparison with optimal full data processing Thanks for caring Your answers are appreciated