Memory spikes when creating QuantileDMatrix from data containing NaN

bridgream commented 3 weeks ago

Hello, I notice that the memory taken by the QDM is much higher when there are some NaN in the data.

To reproduce,

from timeit import default_timer

import numpy as np
import psutil
import xgboost

mock_data = np.zeros((650000, 5000), dtype=np.float32)

def create_qdm_and_report_time_memory(df):
    memory_start = psutil.Process().memory_info().rss / 2**30
    time_start = default_timer()
    qdm = xgboost.QuantileDMatrix(df)
    time_end = default_timer()
    memory_end = psutil.Process().memory_info().rss / 2**30
    print("Time (s):", time_end - time_start)
    print("QDM Memory (GiB):", memory_end - memory_start)
    del qdm

create_qdm_and_report_time_memory(mock_data)
mock_data[:20, :600000] = np.nan
create_qdm_and_report_time_memory(mock_data)

The memory usage increases by 2.5x when I artificially replaces some entries with np.nan.

# First creation, data is all 0
Time (s): 10.207394919998478
QDM Memory (GiB): 6.4759521484375

# Second creation, data contains NaN
Time (s): 122.22072060202481
QDM Memory (GiB): 15.553939819335938

The inflation can be worse when I create QDM with a different proprietary dataset with similar size and count of NaNs. For a 12.5G in-memory dataframe with 650k rows and 5100 columns, all in float32, the memory usage of QDM reduces from 28.3G to 7.9G simply if I add a .fillna(0) before calling QDM.

This problem also happens when I use a data iterator. I partition my dataset into 7 parquet files and use a data iterator to pass each chunk into the input_data handler sequentially. The QDM takes 44.0G at first, but if I add .fillna(0) before passing chunks to input_data, it decreases to 10.0G.

I think it's a bug because the documentation says

Use the QuantileDMatrix (with iterator if necessary) when you can fit most of your data in memory.

However, the QDM can be much larger than the data in memory.

I can't find anything related to this in existing issues. Could someone please take a look?

trivialfis commented 3 weeks ago

Not surprising, the code path diverges between dense data and sparse data. With dense data, we can compress the histogram index count to a byte.

trivialfis commented 3 weeks ago

Feel free to reopen if there are further questions.

dmlc / xgboost

Memory spikes when creating QuantileDMatrix from data containing NaN #10255