Open trivialfis opened 1 year ago
I didn't read the code very carefully, but if you make your qid
column a pyarrow-backed pandas Series, can it then be added and dropped without copying the other columns?
I can't be sure. That's internal to pandas and arrow, I will have to assume that even if it's true today, it can change in the future.
I'd recommend giving Copy-on-Write a shot if you are concerned with inefficient memory usage. We removed a lot of Deep copies and generally made stuff more efficient (I wouldn't recommend using it with pandas < 2.0 though).
I didn't read the code very carefully, but if you make your qid column a pyarrow-backed pandas Series, can it then be added and dropped without copying the other columns?
No. That is independent of the dtype.
I thought the point of arrow was that the columns are stored separately, whereas the pandas default is to store columns of the same dtype in a 2d numpy array, which would obviously need to be reallocated if you add or drop a column.
You can still use views without reallocating the arrays. The problem is a bit different though:
pandas enables inplace modifications, e.g. mutating objects inplace. Most operations perform defensive copies to avoid side-effects
qid
column introduced in https://github.com/dmlc/xgboost/pull/8859 is actually quite expensive, as pandasdrop
method makes a data copy. After some profiling, extracting a dictionary of columns actually saves memory. (reducing about 6GB for 5-fold cv with istella-s)