dmitryikh / leaves

pure Go implementation of prediction part for GBRT (Gradient Boosting Regression Trees) models from popular frameworks
MIT License
419 stars 72 forks source link

Understanding the output of Predict #52

Closed NikEyX closed 5 years ago

NikEyX commented 5 years ago

Hi,

I'm not sure I fully understand the output of the Predict() methods.

I have a fully trained model with 9 classes and 100 estimators. I then run:

predictions := make([]float64, 9)
err = model.Predict(values, 100, predictions)
util.SigmoidFloat64SliceInplace(predictions)
log.Infof("Prediction for %v:\n %v", values, predictions)

That yields:

Prediction for [110 0 12 0 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]: 
[0.2276 0.1822 0.2664 0.0594 0.0682 0.9859 0.1283 0.6349 0.0706]

I understand those are the probabilities for EACH of the 9 classes being the right one. However, how am I able to get the actual value of the class? In python if I do y_pred = model.predict(values), it will correctly show me the expected class values. E.g. my class values look like this: 1242, 1152, 1552, 6662, etc. How can I map the prediction output from above to the class values? I haven't provided any specific order of it to the model

NikEyX commented 5 years ago

I should note that in python I can use model.classes_ to get the class values. I guess my question boils down to how can I do this in your library (which seems awesome btw)?

dmitryikh commented 5 years ago

@NikEyX , thanks for your interest to leaves and sorry for late response.

Unfortunately, you can't obtain this information from xgboost's binary model file, because there is no such information.. Let me explain this by details:

  1. When you use XGBClassifier.fit in python it performs labels encoding for y, let's say labels 1242, 1152, 1552, 1242 goes to 0, 1, 2, 0 by using sklearn's LabelEncoder. Then only labels like 0, 1, 2, .. goes to xgboost core library, and model obtained can operate only these labels..
  2. There are warnings in XGBClassifier.save/load_model that points on that also:
        The model is saved in an XGBoost internal binary format which is
        universal among the various XGBoost interfaces. Auxiliary attributes of
        the Python Booster object (such as feature names) will not be loaded.
        Label encodings (text labels to numeric labels) will be also lost.
        **If you are using only the Python interface, we recommend pickling the
        model object for best results.

    So, python xgboost bindings will be also lost original class labels after save_mode -> load_model operations.

dmitryikh commented 5 years ago

btw, util.SigmoidFloat64SliceInplace is not what you want to use in class of multi class classification. In that case you would use softmax transformation on raw tress values in order to obtain probabilities of classes occurrences. Sum of all class probabilities should be 1.0 (this is a property of softmax function).

Currently I'm developing an update for leaves that make it possible apply transformation on tree results (sigmoid for binary classification, softmax for multi class classification, lambda rank for rank problems and so on). Stay tuned!

NikEyX commented 5 years ago

good to know, thanks for the updates! Love your work, keep it up!