getml / getml-community

Fast, high-quality forecasts on relational and multivariate time-series data powered by new feature learning algorithms and automated ML.
https://getml.com
Other
86 stars 7 forks source link

XGBoostClassifier multiclass objective #9

Open paulbir opened 1 month ago

paulbir commented 1 month ago

Currently "objective" parameter for the XGBoostClassifier is limited to "reg:squarederror", "reg:tweedie", "reg:linear", "reg:logistic", "binary:logistic", "binary:logitraw". And these values are even forced with validation code:

        if kkey == "objective":
            if not isinstance(parameters["objective"], str):
                raise TypeError("'objective' must be of type str")
            if parameters["objective"] not in [
                "reg:squarederror",
                "reg:tweedie",
                "reg:linear",
                "reg:logistic",
                "binary:logistic",
                "binary:logitraw",
            ]:
                raise ValueError(
                    """'objective' must either be 'reg:squarederror', """
                    """'reg:tweedie', 'reg:linear', 'reg:logistic', """
                    """'binary:logistic', or 'binary:logitraw'"""
                )

This code was clearly added when XGBoost already supported multiclass classification. So why can't I use an objective like "multi:softmax"? Or maybe there is some workaround for the multiclass classification?

liuzicheng1987 commented 1 month ago

Hi @paulbir,

the issue is not so much XGBoost itself, but the feature learning algorithms. It can be very tricky to build features for very high-dimensional targets.

You can use the getml.data.make_target_columns(…), as exemplified in the CORA notebook:

https://nbviewer.org/github/getml/getml-demo/blob/master/cora.ipynb

paulbir commented 1 month ago

Hi @liuzicheng1987 , thanks for your reply.

I have target with 3 classes. This is a multiclass problem, so the objective should be multi:softmax or multi:softprob, but only binary targets are allowed.

srnnkls commented 1 month ago

Thanks for your question, @paulbir.

You can just materialize your features using transform, use the native XGBoost python API, and construct the DMatrix from the numpy arrays (or pandas dfs) returned by transform. This way, you can bring any ML-algorithm you want.

paulbir commented 1 month ago

Hi @srnnkls . My goal is to create new features using the Relboost method. In all the notebook examples I can see that when creating the pipeline like here:

pipe = getml.pipeline.Pipeline(
    data_model=time_series.data_model,
    tags=["memory=15", "logistic regression"],
    feature_learners=[feature_learner],
    predictors=[predictor],
)

And the docs do not explicitly state if the predictor is necessary for feature engineering itself. So I usually set it with getml.predictors.XGBoostClassifier. I assumed that it is necessary in some intermediate steps internally.

But what I thought now is do I really need to set the predictors parameter for feature engineering only?

liuzicheng1987 commented 1 month ago

@paulbir , no, if all you are interested in are the features, you don't really need predictors. It's nice to have predictors for things such as calculating feature importances, but they are not necessary for the feature engineering.

paulbir commented 1 month ago

@liuzicheng1987 thanks. So I have no issues with predictors anymore then.

Jogala commented 1 month ago

90.16Let is define the number of class labels as L.

@paulbir just to clarify:

You can do multiclass classification using getml and getml.predictors.XGBoostClassifier(objective="binary:logistic"). What getml does, it builds a set of features for each class label and trains a separate predictor for each of them. It performs a 1 vs all approach for each class label. If you call v=ipe.predict(container.test) on such a pipeline, then v_ij is the probability that the i-th row belongs to the label j and not any other label. This procedure requires declaring multiple targets as @liuzicheng1987 mentioned in his answer using the make_target_columns function.

You can also follow @srnnkls approach, that you use the getml pipeline for constructing the features and then use pipe.transform(container.test) to generate the flat population table containing the combined L features sets. Following that, you can train a XGBoost predictor using `"objective": "multi:softmax"``. Note however that you can not use a new split! You have to use the test/train partition that you used for generation the features in first place. Otherwise you will get a leak from the training data into the test data.

Example can be found here: https://github.com/Jogala/cora under scripts/ml_all.py

Note that using the L times 1 vs All approach I achieved slightly better results on that example. Overall, we outperform the best predictor of that ranking here: https://paperswithcode.com/sota/node-classification-on-cora Using the split of the leading paper (accuracy = 90.16%), we reach 91%.

If you have further questions, you can also drop me an email: image