dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.13k stars 8.7k forks source link

Unnecessary memory allocations for `hist` method? #4915

Closed thvasilo closed 4 years ago

thvasilo commented 4 years ago

The hist method makes heavy use of a HistCollection container that is supposed to maintain for each node (leaf) id a collection of histograms.

The backing data structure is a vector of vectors of GradStatHist objects.

Now, whenever we add a new node id to this collection, we resize this vector to hold enough elements so that the highest nid provide is a valid index.

From some print debugging, printing out nodeid here (printing the node id of every node being synced), it seems like node ids are non-contiguous. For example they might be [1,3,5,7,9,11,13]. However whenever we expand the size of the container, it looks like we allocate memory for all node ids, regardless of whether they represent an actual node or not. Is this the case, or are node ids actually contiguous, thereby it's necessary to always pre-populate every element?

As far as I understand the 2D data vector has dimensions <MAX_NODE_ID> x <NBINS>, where uint32_t nbins = gmat.cut.Ptrs().back();. For example for the rcv1 dataset that has ~42k features creates 748,378 bins per node, which each correspond I think to two float values per bin. So every time we add a redundant item in the _hist vector we are wasting num_bins * 2 float values, which, when we have say millions of features can be very significant.

So my impression is that we are currently using twice as much memory as is necessary, because we initialize every vector in the 2D vector of histogram values with nbins values, regardless of the fact that half of the node ids are not used (?).

Is my analysis correct? AFAIK in approx we get around this issue but having a translation between nodeid to a "working_id" or index, that is contiguous. That allows us to pack all histogram values in a contiguous array of doubles, of total size <num_featuresnum_bins2> (one for gradients, one for hessians), and then use indexing tricks to get the right values.

I'll note that in my distributed experiments the hist method produces OOM errors for datasets that are easily handled by the approx method, and I think this might be one of the reasons.

thvasilo commented 4 years ago

I noticed this as part of looking into #4679. While the second per-node sync seems straightforward to fix (see this WIP) getting rid of the first loop could entail changes to how we build and use hist_ which would require a lot of effort to untangle.

trivialfis commented 4 years ago

I haven't look into your references in detail yet. I believe the reason of nodes not being continuous is the use of pruner. On GPU side, we don't use pruner, instead we just don't apply the split when it's gain is not sufficient. See #4874. But will look into your investigation later, thanks for the detailed explanation.

thvasilo commented 4 years ago

Thanks @trivialfis . It's probably the case that something useful is done to every element in the 2D vector, because yesterday I tried to only populate the new node id and not all ids up to the new (<= id) and that led to crashes.

Weirdly enough the single machine code crashed, but I think I was able to run the training distributed.

I'd say the reason for the excessive memory use of hist is probably not what I described, but there's definitely something inefficient going on, especially for high dimensional data.

hcho3 commented 4 years ago

@thvasilo The generation of quantized matrix (gmat_) is another source of inefficiency. Basically we use quantile points to convert matrix elements in the input matrix into bin indices. Right now, all conversion is in memory, thus blowing memory usage. To save memory, we'll need to perform conversion in out-of-memory fashion:

  1. Load data
  2. Generate quantile sketches
  3. Now load data again, this time performing quantization on the fly.

Furthermore, column_matrix_ generation is also memory inefficient. This is worse because the memory usage of column_matrix_ is proportional to the number of features!

The code I wrote when I was a grad student came back and bit every one of us :(

Aside. Memory consumption of hist is also a big concern in my org (Amazon SageMaker) as well. I'm trying to get resource to address it.

RAMitchell commented 4 years ago

I would like to tackle memory usage issues for GPU and CPU hist algorithms with a common strategy. I think this will go hand in hand with further re-factoring of DMatrix.

hcho3 commented 4 years ago

@RAMitchell Good to know. Anything I can do on my end to make the job easier?

RAMitchell commented 4 years ago

I probably can't work on this for a while. I would start by removing extra layers inside the DMatrix pipeline, it is extremely difficult to reason about and work with. I think the DataSource classes are basically redundant and their functionality should live inside the respective DMatrix classes. I would also like to see a lot of the code constructing a DMatrix live inside actual constructors instead of manually constructing DataSource objects.

I am interested in trying to create a common iterator layer (without memory allocation) over input data sources (e.g. csv, numpy 2d, libsvm). If this is possible we could calculate quantiles directly on external data, or even train directly on external data with DMatrix as a thin wrapper.

hcho3 commented 4 years ago

@RAMitchell That sounds awesome. Yes, DMatrix class can be improved a lot.

hcho3 commented 4 years ago

The backing data structure is a vector of vectors of GradStatHist objects. Now, whenever we add a new node id to this collection, we resize this vector to hold enough elements so that the highest nid provide is a valid index.

This is fixed in #5156