Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
The memory usage increases by 2.5x when I artificially replaces some entries with np.nan.
# First creation, data is all 0
Time (s): 10.207394919998478
QDM Memory (GiB): 6.4759521484375
# Second creation, data contains NaN
Time (s): 122.22072060202481
QDM Memory (GiB): 15.553939819335938
The inflation can be worse when I create QDM with a different proprietary dataset with similar size and count of NaNs. For a 12.5G in-memory dataframe with 650k rows and 5100 columns, all in float32, the memory usage of QDM reduces from 28.3G to 7.9G simply if I add a .fillna(0) before calling QDM.
This problem also happens when I use a data iterator. I partition my dataset into 7 parquet files and use a data iterator to pass each chunk into the input_data handler sequentially. The QDM takes 44.0G at first, but if I add .fillna(0) before passing chunks to input_data, it decreases to 10.0G.
Hello, I notice that the memory taken by the QDM is much higher when there are some NaN in the data.
To reproduce,
The memory usage increases by 2.5x when I artificially replaces some entries with np.nan.
The inflation can be worse when I create QDM with a different proprietary dataset with similar size and count of NaNs. For a 12.5G in-memory dataframe with 650k rows and 5100 columns, all in float32, the memory usage of QDM reduces from 28.3G to 7.9G simply if I add a
.fillna(0)
before calling QDM.This problem also happens when I use a data iterator. I partition my dataset into 7 parquet files and use a data iterator to pass each chunk into the
input_data
handler sequentially. The QDM takes 44.0G at first, but if I add.fillna(0)
before passing chunks toinput_data
, it decreases to 10.0G.I think it's a bug because the documentation says
However, the QDM can be much larger than the data in memory.
I can't find anything related to this in existing issues. Could someone please take a look?