Bring back distributed column training

dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

https://xgboost.readthedocs.io/en/stable/

Apache License 2.0

26.19k stars 8.71k forks source link

Bring back distributed column training #1832

Closed AbdealiLoKo closed 6 years ago

AbdealiLoKo commented 7 years ago

It would be nice to have a method to train a model in distributed mode. Currently, the only way to distribute the model training is by using colmaker, which loads the memory in all the nodes and does the split computation in a distributed manner.

It would be nice to have the distributed colmaker to also load up only chunks of the data (columnar chunks). This was users can have a distributed model training without binning.

tqchen commented 7 years ago

Yes! this is an already supported feature, though not used for a while, so I am not sure if it still works, the synchronization is done with bitmap.

We deleted the example at some point https://github.com/dmlc/xgboost/tree/df7c7930d09b707c93c21c3ebadae2e13769d808/multi-node/col-split

relevant code here https://github.com/dmlc/xgboost/blob/master/src/tree/updater_colmaker.cc#L781

would be nice to revive it:)

AbdealiLoKo commented 7 years ago

So the requirements for this task are:

Verify it still works, if not ... fix it
Write a demo application that uses this
Write CPP unittests for the functionality

AbdealiLoKo commented 7 years ago

The DistColMaker part makes sense to me. But after making the first tree, the predictions need to be updated so that the gradients/hessians are calculated using the leaf weights from the first tree.

This shows that the PredictRaw() and DoBoost() are being called one after the other. Hence, after the first tree is built, the PredictRaw should be updating the preds_ using the tree1.

But, as each node has only a small part of the columns, this cannot happen because PredictRaw() traverses the whole tree with the whole column data and would think that the columns not loaded in this machine are just missing (and hence use the default value) ...

@tqchen How was this being tackled before ? I can't seem to find any code which handles this

tqchen commented 7 years ago

The new leaf value is done by GetLeafPosition function.

https://github.com/dmlc/xgboost/blob/6dabdd33e330bb0cda5b91c75516ff73d53ca316/src/gbm/gbtree.cc#L342

I removed this piece of code during last refactor, but this logic can be added by when necessary. Specifically, GetLeafPosition gives the position of each instances of training matrix(which is obtained through the distributed sync procedure)

AbdealiLoKo commented 7 years ago

@tqchen I seem to need your help in debugging this again (sorry). So, I'm using the 0.6 code base for testing this feature.

The feature to read files with %d was removed so I had to add that
It is able to compute the first tree correctly in the distributed mode using the distcol updater (so the DistColmaker class works)
Now, I'm stuck in the second tree building.

Here, the second split feature, split condition and the loss_chg are exactly the same as the first tree. But the cover (sum_hessian) and leaf_value in the second tree are not the same as first tree. I found this weird as gain (or loss_chg) should have changed due to the hessian changing !

I printed the preds and the gpair in every iteration and they are updating correctly - so the Buffered pred values logic works. But the split conditions seem like they are cached or so (?) I tried disabling the cache_opt but that did not help.

Any ideas on what I couls look at to figure out why this could be happening ?

tqchen commented 7 years ago

The split condition is not cached. Did you try it on the simple mushroom example? I have not encountered this before. One thing you can do is to look at history of the dist-col example folder, do a bit of binary search on history revisions to check if we can get a version that is working, since there used to be one such version

AbdealiLoKo commented 7 years ago

Hmm, So when trying in on my own dataset (which has float columns) I can see that the second tree's root split condition and gain is same as tree1's root. In the case of the agaricus example, I can see that only the gain of the root node in the second tree is same (Probably because the split condition anyway does not matter as it's an indicator variable)

If I run agaricus for depth 4, 2 trees without distribution:

booster[0]:
0:[odor=pungent] yes=2,no=1,gain=4004.4,cover=6513
...
booster[1]:
0:[odor=pungent] yes=2,no=1,gain=3622.8,cover=6497.31

If I run agaricus for depth 4, 2 trees with column distribution:

booster[0]:
0:[odor=pungent] yes=2,no=1,gain=4004.4,cover=6513
...
booster[1]:
0:[odor=pungent] yes=2,no=1,gain=4004.4,cover=6497.31

(all other nodes: leaf values and splits match - I checked with a difftool) This is really unusual. it seems like:

The gain value has been "copied" from the tree1's root in distcol mode but the cover is not copied.
But in single machine mode we can see that the gain is different in tree1 and tree2.

AbdealiLoKo commented 7 years ago

Figured it out. The snode needs to be cleared in updater_colmaker.cc#L175 The stemp is cleared correctly, but snode is not being cleared. Because of which the snode[nid].best.Update() was comparing with last tree's split condition.

tqchen commented 7 years ago

Thanks, do you mind to get a PR on this?

AbdealiLoKo commented 7 years ago

Will do in a bit. need to clean up lots of stuff - Thanks for all the help debugging :)

holyglenn commented 7 years ago

Wonder if this work has been integrated yet?

AbdealiLoKo commented 6 years ago

It wasn't =( By the time I had completed there were too many changes in the cache of dmatrices and so on which is needed for distributed columns. And I would have to redo all the work and there are definitely parts I am not comfortable with editing in the core part.

So I'm still using a version from over a year ago due to this...