dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.29k stars 8.73k forks source link

using test set as validation set #9235

Closed aresreact closed 1 year ago

aresreact commented 1 year ago

While doing a walk-forward validation to assess the performances of my model, I specified the test set as validation set in the following way:

    booster = xgb.train(params,
                dmat_train,
                evals=[(dmat_train, "train"), (dmat_test, "test")], verbose_eval = False,
                num_boost_round=num_boost_round)
    preds = booster.predict(dmat_test

)

thinking that it would have no effect and would be only for metrics monitoring during training.

However, when I do not do it:

    booster = xgb.train(params,
                dmat_train,
                evals=[(dmat_train, "train")], verbose_eval = False,
                num_boost_round=num_boost_round)
    preds = booster.predict(dmat_test)

performances are significantly worse.

I guess the second way of doing is the right one. But why? How does specifying the test set as cross-validation set actually leads to overfitting in XGBoost? Note that I do not do any hyper-parameter optimizations.

trivialfis commented 1 year ago

Could you please share the parameters that you are using?

trivialfis commented 1 year ago

I couldn't reproduce using master or 1.7.5. Here is my simple script:

import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def main() -> None:
    X, y = make_regression(n_samples=4096, n_features=32, random_state=1)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    Xy_train = xgb.DMatrix(X_train, y_train)
    Xy_test = xgb.DMatrix(X_test, y_test)
    booster = xgb.train(
        {"tree_method": "hist"},
        Xy_train,
        num_boost_round=10,
        evals=[(Xy_train, "Train"), (Xy_test, "Valid")],
    )
    predt = booster.predict(Xy_test)
    error = mean_squared_error(y_test, predt)
    print("error:", error)

    booster = xgb.train(
        {"tree_method": "hist"},
        Xy_train,
        num_boost_round=10,
        evals=[(Xy_train, "Train")],
    )
    predt = booster.predict(Xy_test)
    error = mean_squared_error(y_test, predt)
    print("error:", error)

if __name__ == "__main__":
    main()