dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.11k stars 8.7k forks source link

Optimize memory usage with pandas input. #8927

Open trivialfis opened 1 year ago

trivialfis commented 1 year ago
s-banach commented 1 year ago

I didn't read the code very carefully, but if you make your qid column a pyarrow-backed pandas Series, can it then be added and dropped without copying the other columns?

trivialfis commented 1 year ago

I can't be sure. That's internal to pandas and arrow, I will have to assume that even if it's true today, it can change in the future.

trivialfis commented 1 year ago

Related: https://github.com/pandas-dev/pandas/pull/51463 .

phofl commented 1 year ago

I'd recommend giving Copy-on-Write a shot if you are concerned with inefficient memory usage. We removed a lot of Deep copies and generally made stuff more efficient (I wouldn't recommend using it with pandas < 2.0 though).

I didn't read the code very carefully, but if you make your qid column a pyarrow-backed pandas Series, can it then be added and dropped without copying the other columns?

No. That is independent of the dtype.

s-banach commented 1 year ago

I thought the point of arrow was that the columns are stored separately, whereas the pandas default is to store columns of the same dtype in a 2d numpy array, which would obviously need to be reallocated if you add or drop a column.

phofl commented 1 year ago

You can still use views without reallocating the arrays. The problem is a bit different though:

pandas enables inplace modifications, e.g. mutating objects inplace. Most operations perform defensive copies to avoid side-effects