Closed AbdealiLoKo closed 6 years ago
Yes! this is an already supported feature, though not used for a while, so I am not sure if it still works, the synchronization is done with bitmap.
We deleted the example at some point https://github.com/dmlc/xgboost/tree/df7c7930d09b707c93c21c3ebadae2e13769d808/multi-node/col-split
relevant code here https://github.com/dmlc/xgboost/blob/master/src/tree/updater_colmaker.cc#L781
would be nice to revive it:)
So the requirements for this task are:
The DistColMaker part makes sense to me. But after making the first tree, the predictions need to be updated so that the gradients/hessians are calculated using the leaf weights from the first tree.
This shows that the PredictRaw() and DoBoost() are being called one after the other. Hence, after the first tree is built, the PredictRaw should be updating the preds_ using the tree1.
But, as each node has only a small part of the columns, this cannot happen because PredictRaw() traverses the whole tree with the whole column data and would think that the columns not loaded in this machine are just missing (and hence use the default value) ...
@tqchen How was this being tackled before ? I can't seem to find any code which handles this
The new leaf value is done by GetLeafPosition function.
https://github.com/dmlc/xgboost/blob/6dabdd33e330bb0cda5b91c75516ff73d53ca316/src/gbm/gbtree.cc#L342
I removed this piece of code during last refactor, but this logic can be added by when necessary. Specifically, GetLeafPosition gives the position of each instances of training matrix(which is obtained through the distributed sync procedure)
@tqchen I seem to need your help in debugging this again (sorry). So, I'm using the 0.6 code base for testing this feature.
Here, the second split feature, split condition and the loss_chg are exactly the same as the first tree. But the cover (sum_hessian) and leaf_value in the second tree are not the same as first tree. I found this weird as gain (or loss_chg) should have changed due to the hessian changing !
I printed the preds and the gpair in every iteration and they are updating correctly - so the Buffered pred values logic works. But the split conditions seem like they are cached or so (?) I tried disabling the cache_opt
but that did not help.
Any ideas on what I couls look at to figure out why this could be happening ?
The split condition is not cached. Did you try it on the simple mushroom example? I have not encountered this before. One thing you can do is to look at history of the dist-col example folder, do a bit of binary search on history revisions to check if we can get a version that is working, since there used to be one such version
Hmm, So when trying in on my own dataset (which has float columns) I can see that the second tree's root split condition and gain is same as tree1's root. In the case of the agaricus example, I can see that only the gain of the root node in the second tree is same (Probably because the split condition anyway does not matter as it's an indicator variable)
If I run agaricus for depth 4, 2 trees without distribution:
booster[0]:
0:[odor=pungent] yes=2,no=1,gain=4004.4,cover=6513
...
booster[1]:
0:[odor=pungent] yes=2,no=1,gain=3622.8,cover=6497.31
If I run agaricus for depth 4, 2 trees with column distribution:
booster[0]:
0:[odor=pungent] yes=2,no=1,gain=4004.4,cover=6513
...
booster[1]:
0:[odor=pungent] yes=2,no=1,gain=4004.4,cover=6497.31
(all other nodes: leaf values and splits match - I checked with a difftool) This is really unusual. it seems like:
Figured it out.
The snode
needs to be cleared in updater_colmaker.cc#L175
The stemp
is cleared correctly, but snode
is not being cleared. Because of which the snode[nid].best.Update()
was comparing with last tree's split condition.
Thanks, do you mind to get a PR on this?
Will do in a bit. need to clean up lots of stuff - Thanks for all the help debugging :)
Wonder if this work has been integrated yet?
It wasn't =( By the time I had completed there were too many changes in the cache of dmatrices and so on which is needed for distributed columns. And I would have to redo all the work and there are definitely parts I am not comfortable with editing in the core part.
So I'm still using a version from over a year ago due to this...
It would be nice to have a method to train a model in distributed mode. Currently, the only way to distribute the model training is by using colmaker, which loads the memory in all the nodes and does the split computation in a distributed manner.
It would be nice to have the distributed colmaker to also load up only chunks of the data (columnar chunks). This was users can have a distributed model training without binning.