Closed aresreact closed 1 year ago
Could you please share the parameters that you are using?
I couldn't reproduce using master or 1.7.5. Here is my simple script:
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def main() -> None:
X, y = make_regression(n_samples=4096, n_features=32, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Xy_train = xgb.DMatrix(X_train, y_train)
Xy_test = xgb.DMatrix(X_test, y_test)
booster = xgb.train(
{"tree_method": "hist"},
Xy_train,
num_boost_round=10,
evals=[(Xy_train, "Train"), (Xy_test, "Valid")],
)
predt = booster.predict(Xy_test)
error = mean_squared_error(y_test, predt)
print("error:", error)
booster = xgb.train(
{"tree_method": "hist"},
Xy_train,
num_boost_round=10,
evals=[(Xy_train, "Train")],
)
predt = booster.predict(Xy_test)
error = mean_squared_error(y_test, predt)
print("error:", error)
if __name__ == "__main__":
main()
While doing a walk-forward validation to assess the performances of my model, I specified the test set as validation set in the following way:
thinking that it would have no effect and would be only for metrics monitoring during training.
However, when I do not do it:
performances are significantly worse.
I guess the second way of doing is the right one. But why? How does specifying the test set as cross-validation set actually leads to overfitting in XGBoost? Note that I do not do any hyper-parameter optimizations.