NGYB / Stocks

Programs for stock prediction and evaluation
Apache License 2.0
346 stars 209 forks source link

XGBoost: Shouldn't data source of model.predict is from past data? #4

Closed viper7882 closed 3 years ago

viper7882 commented 4 years ago

Hi @yibinng,

I love reading your articles and your publication of time series prediction codes. All of them are wonderful and I enjoy playing with your projects very much.

While I was experimenting with XGBoost Stock Prediction, I've noticed in the Final Model section, the model.predict is taking in X_sample_scaled as input parameter. If you trace the assignment of X_sample_scaled (= test_scaled[features]), it is feature of the Test dataframe, which the associate label is the Test dataframe is what we intend to predict.

Since test_scaled[features] is derived from lag data, it means this is generated based on Test dataframe i.e. future data of dev/cv dataframe! It explains why the predicted data is lagged by actual data for few cycles.

How we could confirm this issue is by removing X_sample_scaled from the equation and solely relying on y_sample (= test[target]) as the golden result.

The fix is simple, in the final model, replace X_sample_scaled with X_cv_scaled, where X_cv_scaled is the present dataframe before Test dataframe was created in real scenario.

I'm seeing the prediction is similar to expected y_sample but with larger RMSE and MAPE values.

Please let me know if you think else.

NGYB commented 4 years ago

Hi @viper7882,

Thanks for your comments. They are very encouraging and I am happy that you found the projects fun.

I read through your comments carefully, I believe you are talking about this section of the code in the Final model section:

rmse, mape, est = train_pred_eval_model(X_train_cv_scaled, y_train_cv_scaled, X_sample_scaled, y_sample, test['adj_close_mean'], test['adj_close_std'], seed=model_seed, n_estimators=n_estimators_opt, max_depth=max_depth_opt, learning_rate=learning_rate_opt, min_child_weight=min_child_weight_opt, subsample=subsample_opt, colsample_bytree=colsample_bytree_opt, colsample_bylevel=colsample_bylevel_opt, gamma=gamma_opt)

And you said I should replace X_sample_scaled with X_cv_scaled. I don't agree with you this should be the case.

Firstly, please note that the model is intended to predict the value for day t (note we are doing a one-day prediction and not multiple days prediction), based on values from days t-1, t-2, ... and so on. These features of lags t-1, t-2, ... are contained in the X_sample_scaled(= test_scaled[features]). So we are passing the right stuff in.

If you pass in X_cv_scaled, then you are doing model.predict(X_cv_scaled) and obtaining the predictions for X_cv_scaled. But we are not interested in predictions for X_cv_scaled at the 'Final model' section, because we want predictions for X_sample_scaled instead.

Let me know if the above is clear, thanks. Again, thanks for sharing the issue with me.