dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.29k stars 8.73k forks source link

Saving out QuantileDMatrix to file #9545

Open dthiagarajan opened 1 year ago

dthiagarajan commented 1 year ago

Hi XGBoost community,

Are there any plans to add support for saving out QuantileDMatrix to file, like DMatrix.save_binary? Creating the QuantileDMatrix has been a RAM bottleneck for me, and I'm hoping to potentially decrease that by loading the QDM from file.

Thanks in advance!

trivialfis commented 1 year ago

Have you tried to construct it from an iterator to reduce RAM usage? Splitting the data into 3GB per batch can be a good starting point.

dthiagarajan commented 1 year ago

I have tried this, and this does indeed help, but I'm worried that this will slow down the time to build each tree. Is it expected that the time to build each tree slows down significantly when constructing the QDM/DM from an iterator? I've observed this locally on some much smaller examples, so I'm worried that this will cause the time per tree to balloon for my larger datasets that I need to train on.

trivialfis commented 1 year ago

It won't slow down the tree building, but the construction time of QDM might take longer since you are loading data batches from an external memory (QDM needs to iterate multiple times to finish construction).

Having said that, I think it's possible to have support for save_binary, given we don't promise the backward compatibility.

dthiagarajan commented 1 year ago

Interesting, so you wouldn't expect the iteration time to be slower if I construct a QDM using a DataIter subclass? Would you expect it to be slower if I construct a DMatrix using a DataIter? And how exactly does the construction happen? Are the data batches all iterated over once at the beginning of training? Or does it happen each iteration?

trivialfis commented 1 year ago

if I construct a DMatrix using a DataIter

That would be using the external memory version of XGBoost, which would indeed slow down xgboost significantly, especially pre-2.0. For details, please visit the documentation site.

Are the data batches all iterated over once at the beginning of training?

On the CPU, it's iterated over 4 times. On GPU, it's iterated over two times.

Or does it happen each iteration?

No, only at the beginning of training for QDM.

mlqmlq commented 1 month ago

Hey I was wondering if there is any updates on this? Also I wonder how much memory QDM can save if it was constructed from a dense numpy float32 matrix? For example, if my data matrix consists of 5 million observations, 1000 features ~20GB in numpy, is there a rule of thumb to use to estimate the QDM size? Thanks!

trivialfis commented 1 month ago

Not yet, but should be closer now that we can export the QDM to a Scipy CSR. Still need the import part.

It depends on the number of bins and the number of features, along with the CPU/GPU difference. For fully dense data will no missing values, 256 bins and float32 input, GPU QDM will be about a quarter of the input in the up coming release, current release uses larger memory. CPU is a bit more complicated, I don't have a simple description yet, will look into it.

mlqmlq commented 1 month ago

Thanks a lot for the reply! Really looking forward to the upcoming release that can further reduce GPU QDM size! In my experiment (xgb-2.1.1) GPU QDM currently roughly used 40-50% RAM compared to my float32 np dense matrix.

I wonder why there is a difference between CPU / GPU QDM size? My understanding was hist transforms a numerical float32 column into int_{0,...,256}, that's why we expect 1/4 RAM usage.

trivialfis commented 1 month ago

From an abstract view, it's mostly caused by the need to handle sparse data as well. So we have some extra structures there. We can eliminate them in the future, but the overhead is not particularly significant so we haven't prioritized it yet.

Your calculation is correct. GPU prioritizes memory usage over computation performance from time to time. It can compress the data even further if the number of bins is smaller than 256, and happy to not have any extra structures to save memory. For CPU, the priority is reversed as it has relatively more memory but is slower.