dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.31k stars 8.73k forks source link

training error bigger than testing (validation) error #5367

Closed haenphe closed 4 years ago

haenphe commented 4 years ago

I'm building a XGBRegressor() model to do time series forecasting with 96 rows of data. But after i tuned my model with grid search, i rarely have the testing error exceed the training error. The evaluation metric i use is RMSE. Can anyone tell me what i did wrong with my model and what should i do?

This is my code

# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'learning_rate' : [0.01, 0.05, 0.1, 0.2],
    'subsample': [0.2, 0.4, 0.6, 0.8],
    'colsample_bytree': [0.2, 0.4, 0.6, 0.8],
    'n_estimators': [1,10,100],
    'max_depth': [3, 4, 5, 6, 7, 8],
    'gamma': [0,0.1,0.2],
    'reg_alpha': [0, 0.001, 0.002],
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(objective="reg:squarederror")

# timeseries CV
tscv = TimeSeriesSplit(n_splits=4)

# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid,
                        scoring='neg_mean_squared_error', cv=tscv, verbose=1)
grid_mse.fit(X_train, y_train)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
trivialfis commented 4 years ago

Probably nothing is wrong.

trivialfis commented 4 years ago

Closing, feel free to re-open if there's something more concrete we can work with. If you are suspicious about the model, you can dump it out with Booster.get_dump.