dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.26k stars 8.72k forks source link

[RFC] 1Tb benchmark #4952

Open trams opened 5 years ago

trams commented 5 years ago

Hello nice people,

I would like to discuss with you building a big distributed benchmark so we can assess GPU/CPU performance. During 2019 Q4 I will have few days to pursue this topic and I would like to hear your opinion.

Is it useful? Will you run it from time to time?

Here is a gist of my idea. I want to build a reproducible benchmark which uses Criteo 1Tb machine learning dataset (more about it here https://labs.criteo.com/2013/12/download-terabyte-click-logs-2/)

The dataset has 39 features: 13 quantative and 26 categorical features So for milestone 1 I would just drop all categorical features and use it as a training dataset. I sized it recently and I've got ~100Gb if I store it in memory as Spark Dataframe which is big enough to test a distributed scenario but small enough to test it on 1 machine too (especially if downsampled)

As a milestone 2 (work in progress) I wanted to apply a simple version of mean encoding (something like https://towardsdatascience.com/why-you-should-try-mean-encoding-17057262cd0). I have a working prototype. I think I will just apply the mean transformation once and publish the resulting dataset somewhere.

This should give us a bigger version of a benchmark. I expect it to be ~300-400Gb

What do you think? Do you want to contribute in any way? Will you relaunch it from time to time (from release to release for example, or every month)

Do you have any cloud budget for this kind of things or should I also try to find a corporate sponsor?

trivialfis commented 5 years ago

@rongou

trivialfis commented 5 years ago

@wbo4958

hcho3 commented 5 years ago

Do you have any cloud budget for this kind of things or should I also try to find a corporate sponsor?

I'm in the process of enabling finer grained financial control over the CI server. This is also necessary to enable regression tests, to ensure consistency of training performance between releases.

trivialfis commented 5 years ago

@hcho3 We are currently trying to scale GPU algorithm with spark to tackle this dataset. I don't think it's possible to run it on Jenkins. Maybe we (nvidia) can run it on a regular basis.

hcho3 commented 5 years ago

@trivialfis It would be awesome if NVIDIA can host this benchmark.

trivialfis commented 5 years ago

@hcho3 Not really. That's not my decision to make (to claim the responsibility). But I can run it when needed.

hcho3 commented 5 years ago

Well, if you are running it using NVIDIA's resources, then from my perspective NVIDIA is "hosting" the benchmark. That's what I intended.

trivialfis commented 5 years ago

Yup. Definitely. :-) Clarified.

trams commented 5 years ago

@trivialfis feel free to contact me (us? Criteo?) if you need any help on preprocessing this dataset. I am preparing some notebooks now

Sandy4321 commented 4 years ago

if somebody tried to run locally ? like https://github.com/rambler-digital-solutions/criteo-1tb-benchmark

Sandy4321 commented 4 years ago

do you know code example to run fully locally , without spark? kind of online learning?

trams commented 4 years ago

@Sandy4321 , what do you mean by "running locally"? Bear in mind that the size of the dataset is 1Tb and unless the training set fits in memory the performance will be terrible. If you want to attempt to train on one machine (I assumed that this is what is meant by "run fully locally") you need either to sample data (1-5% to fit on machine with 16-32Gb of memory) or acquire a big fat machine with 512Gb or 1Tb of RAM

Sandy4321 commented 4 years ago

I mean online row by row processing Also this data https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#criteo_tb Is relatively small only 370gb zipped file Also seems to be xgboost has online mode When data is processed by chunks?

trams commented 4 years ago

@Sandy4321 the point of this ticket is to create a reliable distributed benchmark for xgboost GPU backend

I would not call 370Gb file a small file if you want to process it on one machine. With a typical speed of consecutive read for a spinning drive being 50-100 Mb per second. It would take ~1-2 hours to just read this file. If you want to build a model with 100 trees and you have not enough RAM to load the whole file you would end up reading it at least 100 times which gives a estimate of 4-8 days to build a model

What is your use case? I feel like I misunderstood you. Do you want to train GBDT model on Criteo 1TB dataset?

You can train a model on one machine if it fits in memory. As far as I've heard it is also possible to do even if it does not fit (keywords: external memory) but I never tried it and I think it will be VERY slow for dataset of this size (370Gb)

Sandy4321 commented 4 years ago

But somehow RTB streaming data is processed In real time? When only way to process by chunks

trams commented 4 years ago

I am sorry. I do not know what do you mean by "RTB streaming data"

Sandy4321 commented 4 years ago

It is when data coming to algorithm by portions In real live today you do not have data for tomorrow Also today data from yesterday is not relevant So you can train model only by using data form relatively small time window From this prospective building model on whole 1 TB data is mistake and pseudo science