Support outputting leaf indices for all the `predict*` functions

fredrikluo commented 3 years ago

Thanks to the popular paper https://research.fb.com/wp-content/uploads/2016/11/practical-lessons-from-predicting-clicks-on-ads-at-facebook.pdf

Many people use GBDT to extract features from a dataset and then instead of predicting results directly. The extracted features are the leaf indices from each estimator which makes the decision.

This is achieved by setting predleaf with lightGBM https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.predict

I am adding the same support in this pull request. Basically, the user can set this parameter and then predict function will return the leaf indices.

The test data are generated by lightGBM to make sure that the indices are generated in the same way.

coveralls commented 3 years ago

Pull Request Test Coverage Report for Build 168

63 of 129 (48.84%) changed or added relevant lines in 9 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.5%) to 68.18%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
misc.go	0	6	0.0%
xgblinear.go	0	6	0.0%
xgensemble_io.go	10	16	62.5%
lgensemble.go	13	20	65.0%
transformation/leaf_index.go	6	16	37.5%
leaves.go	19	33	57.58%
xgensemble.go	9	26	34.62%
<!--	Total:	63	129	48.84%	-->

Files with Coverage Reduction	New Missed Lines	%
xgensemble.go	1	58.49%
<!--	Total:	1		-->

Totals
Change from base Build 159:	-0.5%
Covered Lines:	1862
Relevant Lines:	2731

💛 - Coveralls

nikolaydubina commented 3 years ago

cc: @dmitryikh

dmitryikh commented 3 years ago

Hi! Thanks for your contribution and good job!

This feature definitely should be landed in leaves. But I have few concerns:

leaves could change the nodes order for better cache locality (more frequently used childs could be stored right after the parent nodes in the memory). I need to check that moment to be sure, that we return original leaves ids.
I don't like the idea to extend the function signatures with predLeaf everewhere:
```
func (e *Ensemble) Predict(fvals []float64, nEstimators int, predictions []float64, predleaf bool) ([][]uint32, error) {
```
Signature becomes error prone. The rule "don't pay for what you don't use" is violated here.
I suggest to work with leaf indices like raw output from the model. And the we can apply addition transformations on top of it (sum (gbdt), average (random forest), softmax, etc..). So the signature could be original:
```
func (e *Ensemble) Predict(fvals []float64, nEstimators int, predictions []float64) error {
```
But when transformation = LeafIndexes, then predictions array populated with leaf indices. I know there is type problem: index is not float64, but float64 seems like needed compromise here. User code can convert it back to int32 without any loss in precision.

What do you think about it?

fredrikluo commented 3 years ago

Yes, to use transformation seems to be a better idea.

The only thing that the client code needs to be aware of is that the length of predictions and length of predictLeafIndices are different, the latter needs to be dimension * nEstimator. But we can check it in our code, so shouldn't be a big deal.

for 1. I think I have checked, however, it has been for a while, I can double-check again.

dmitryikh commented 3 years ago

@fredrikluo , if you don't mind I will try to fix this PR based on points above. In particular I want to try to use transformation mechanism to support leaf indices.

fredrikluo commented 3 years ago

Absolutely, please go ahead

dmitryikh commented 3 years ago

@fredrikluo, can you please check the checkbox "Allow edits by maintainers" on the right. Thanks!

fredrikluo commented 3 years ago

Checked

dmitryikh commented 3 years ago

@fredrikluo , could you please check that current implementation is suitable for your case? If ok - i will merge.

The enter point to work with leaf indices:

// EnsembleWithLeafPredictions returns ensemble instance with TransformLeafIndex
// (return trees indices instead of numerical values)
func (e *Ensemble) EnsembleWithLeafPredictions() *Ensemble {
    // each predictions will produce NRawOutputGroups() * NEstimators() values
    return &Ensemble{e, &transformation.TransformLeafIndex{e.NRawOutputGroups() * e.NEstimators()}}
}

fredrikluo commented 3 years ago

Yes, I tested in my code base, everything works fine, this looks awesome!

dmitryikh commented 3 years ago

Thank you for your contribution! Merged.

dmitryikh / leaves

Support outputting leaf indices for all the `predict*` functions #76

Pull Request Test Coverage Report for Build 168

💛 - Coveralls