dmitryikh / leaves

pure Go implementation of prediction part for GBRT (Gradient Boosting Regression Trees) models from popular frameworks
MIT License
427 stars 72 forks source link

Total incorrect python xgboost train model,use leaves load model and predict #69

Open luowencai opened 4 years ago

luowencai commented 4 years ago

We use spark to generate libsvm file, then use python sklearn to load it and xgboost to train and save model, finaly use leaves load it and predict. the predict result was total incorrect between python demo and go. just want to ask if leve not support or we use leaves wrong. the python code like:

my_workpath = 'D:\\project\\py\\train_demo\\'
X_train, y_train = load_svmlight_file(my_workpath + 'train')
X_test, y_test = load_svmlight_file(my_workpath + 'validation')
bst = XGBClassifier()
bst.fit(X_train, y_train)
bst.save_model(my_workpath + "train_model")
train_preds = [x[1] for x in bst.predict_proba(X_train)]
test_preds = [x[1] for x in bst.predict_proba(X_test)]

the go code like:

model, e := leaves.XGEnsembleFromFile(model_path,true)
    if e != nil{
        println(e)
    }
    if model.Transformation().Type() != transformation.Logistic {
        log.Fatalf("expected TransforType = Logistic (got %s)", model.Transformation().Name())
    }
    csr, err := mat.CSRMatFromLibsvmFile(validate_path, 0, true)
    if err != nil{
        println(err)
    }
    predictions := make([]float64, csr.Rows()*model.NOutputGroups())
    e = model.PredictCSR(csr.RowHeaders, csr.ColIndexes, csr.Values, predictions, 50, 5)
    if e != nil{
        println(e)
    }
    fmt.Printf("Prediction for %v\n", predictions)
dmitryikh commented 4 years ago

Hello! Thank for your report.

e = model.PredictCSR(csr.RowHeaders, csr.ColIndexes, csr.Values, predictions, 50, 5)

why do you use only 50 trees to predict? Try use all tress in ensemble, like in python script.

Also, If you can provide your train & test files, I can check the case precisely.

luowencai commented 4 years ago

Sorry for the mistake python code, here's the right python code we actually use:

from sklearn.datasets import load_svmlight_file
from xgboost import XGBClassifier

class train_classifier:
    bst = XGBClassifier(max_depth=8, n_estimators=50, learning_rate=0.1, silent=False, objective='binary:logistic',
                        min_child_weight=3, gamma=0, scale_pos_weight=45.1193405554875, subsample=0.9,
                        colsample_bytree=0.6, reg_alpha=3, reg_lambda=3, verbose=False)
    my_workpath = 'D:\\project\\py\\train_demo\\'

    def __init__(self):
        self.bst.load_model(self.my_workpath + "train_model")

    def train(self, train_path='train'):
        X_train, y_train = load_svmlight_file(self.my_workpath + train_path)
        self.bst.fit(X_train, y_train)
        self.bst.save_model(self.my_workpath + "train_model")

    def test_predict(self, test_file='validation'):
        X_test, y_test = load_svmlight_file(self.my_workpath + test_file)
        return [x[1] for x in self.bst.predict_proba(X_test)]

Here's the predict result we run python predict and go predict_csr: predict_result.zip