dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.13k stars 8.7k forks source link

Parameters might not be used. But aren't used #7419

Closed gminghes closed 2 years ago

gminghes commented 2 years ago

I'm trying to do a Gridsearch with Cross Validation with xgboost and sklearn. But when I start the cross validation XGBoost issues a warning saying that some parameters are not used but, when printing the parameters used in the search those parameters issued are not there.

I'm using: Python version: 3.9.1 XGBoost version: 1.5.0

code:

    pipe = Pipeline(
        [
            ("vectorizer", TfidfVectorizer()),
            ("feature_selection", VarianceThreshold()),
            (
                "classifier",
                BinaryRelevance(
                    classifier=xgb.XGBClassifier(
                        n_jobs=-1, objective="binary:logistic", eval_metric="logloss"
                    )
            ),
        ),
    ]
)

param_grid = [
    {
        "classifier__classifier__booster": ["gbtree"],
        "classifier__classifier__eta": uniform(0, 1),
        "classifier__classifier__gamma": expon(),
        "classifier__classifier__max_depth": randint(1, 15),
        "classifier__classifier__max_delta_step": expon(),
        "classifier__classifier__subsample": uniform(1e-2, 1),
        "classifier__classifier__sapling_method": ["uniform", "gradient_based"],
        "classifier__classifier__colsample_by_*": beta(5, 0.5),
        "classifier__classifier__lambda": beta(5, 0.5),
        "classifier__classifier__alpha": beta(5, 0.5),
        "classifier__classifier__tree_method": ["auto"],
        "classifier__classifier__scale_pos_weight": truncnorm(0, 10, loc=1, scale=1),
        "feature_selection__threshold": uniform(0, threshold),
    },
    {
        "classifier__classifier__booster": ["dart"],
        "classifier__classifier__eta": uniform(0, 1),
        "classifier__classifier__gamma": expon(),
        "classifier__classifier__max_depth": randint(1, 15),
        "classifier__classifier__max_delta_step": expon(),
        "classifier__classifier__subsample": uniform(1e-2, 1),
        "classifier__classifier__lambda": beta(5, 0.5),
        "classifier__classifier__alpha": beta(5, 0.5),
        "classifier__classifier__tree_method": ["auto"],
        "classifier__classifier__scale_pos_weight": truncnorm(0, 10, loc=1, scale=1),
        "classifier__classifier__sample_type": ["uniform", "weighted"],
        "classifier__classifier__normalize_type": ["tree", "forest"],
        "classifier__classifier__rate_drop": loguniform(0.001, 1),
        "classifier__classifier__skip_drop": loguniform(0.001, 0.5),
        "feature_selection__threshold": uniform(0, threshold),
    },
    {
        "classifier__classifier__booster": ["gblinear"],
        "classifier__classifier__lambda": beta(5, 0.5),
        "classifier__classifier__alpha": beta(5, 0.5),
        "feature_selection__threshold": uniform(0, threshold),
    }
]

from pprint import pprint

n_folds = 3
n_iter = 1
k_fold = IterativeStratification(n_splits=n_folds, order=2)
X = train_data.preprocessed_sent
binarizer = MultiLabelBinarizer().fit(train_data.labels.apply(eval))
y = binarizer.transform(train_data.labels.apply(eval))
parameters_results = []

parameters_grid = ParameterSampler(param_distributions=param_grid, n_iter=n_iter)

for iteration_number, parameters in enumerate(parameters_grid):
    pprint(parameters)
    pipe.set_params(**parameters)
    f1s = []

    for train, test in k_fold.split(X, y)):
        X_train, y_train = X.iloc[train], y[train]
        X_eval, y_eval = X.iloc[test], y[test]

        pipe.fit(X=X_train, y=y_train)
        result = pipe.predict(X_eval)

        precision, recall, f_beta, _ = precision_recall_fscore_support(
            y_eval, result, average="samples"
        )
        parameters_results.append(
            {
                **parameters,
                "precision": precision,
                "recall": recall,
                "F-1": f_beta,
                "param_set": iteration_number,
            }
        )

Warning example:

{'classifier__classifier__alpha': 0.9999570798095335,
 'classifier__classifier__booster': 'gbtree',
 'classifier__classifier__colsample_by_*': 0.9136728483790303,
 'classifier__classifier__eta': 0.5128329125807101,
 'classifier__classifier__gamma': 0.09585631506731496,
 'classifier__classifier__lambda': 0.9966972829758989,
 'classifier__classifier__max_delta_step': 0.021031782976528482,
 'classifier__classifier__max_depth': 5,
 'classifier__classifier__sapling_method': 'uniform',
 'classifier__classifier__scale_pos_weight': 1.1004394756331664,
 'classifier__classifier__subsample': 0.09274633032403766,
 'classifier__classifier__tree_method': 'auto',
 'feature_selection__threshold': 0.0006301128939330729}

[16:46:07] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.0/src/learner.cc:576: 
Parameters: { "colsample_by_*", "sapling_method" } might not be used.
  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.

As you can see the parameters issued are not present in the gridsearch. What can cause this behaviour?

trivialfis commented 2 years ago

Hi, could you please share a self-contained, executable script?

gminghes commented 2 years ago

I can't provide the data I'm working on, trying the same script on google colab using the fetch_20newsgroups dataset from sklearn I don't have the warning

trivialfis commented 2 years ago

I saw that you have something like this in your code:

        "classifier__classifier__sapling_method": ["uniform", "gradient_based"],
        "classifier__classifier__colsample_by_*": beta(5, 0.5),

Could you please verify that's the cause of the warning?

trivialfis commented 2 years ago

Also, gradient_based is only available for gpu_hist tree method at the moment.

trivialfis commented 2 years ago

I believe the cause is found.