Optimize memory usage with pandas input.

dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

https://xgboost.readthedocs.io/en/stable/

Apache License 2.0

26.11k stars 8.7k forks source link

Optimize memory usage with pandas input. #8927

Open trivialfis opened 1 year ago

trivialfis commented 1 year ago

The special qid column introduced in https://github.com/dmlc/xgboost/pull/8859 is actually quite expensive, as pandas drop method makes a data copy. After some profiling, extracting a dictionary of columns actually saves memory. (reducing about 6GB for 5-fold cv with istella-s)
We might want to iterate through the columns in C like what we currently do for cuDF.

s-banach commented 1 year ago

I didn't read the code very carefully, but if you make your qid column a pyarrow-backed pandas Series, can it then be added and dropped without copying the other columns?

trivialfis commented 1 year ago

I can't be sure. That's internal to pandas and arrow, I will have to assume that even if it's true today, it can change in the future.

trivialfis commented 1 year ago

phofl commented 1 year ago

I'd recommend giving Copy-on-Write a shot if you are concerned with inefficient memory usage. We removed a lot of Deep copies and generally made stuff more efficient (I wouldn't recommend using it with pandas < 2.0 though).

I didn't read the code very carefully, but if you make your qid column a pyarrow-backed pandas Series, can it then be added and dropped without copying the other columns?

No. That is independent of the dtype.

s-banach commented 1 year ago

I thought the point of arrow was that the columns are stored separately, whereas the pandas default is to store columns of the same dtype in a 2d numpy array, which would obviously need to be reallocated if you add or drop a column.

phofl commented 1 year ago

You can still use views without reallocating the arrays. The problem is a bit different though:

pandas enables inplace modifications, e.g. mutating objects inplace. Most operations perform defensive copies to avoid side-effects