Train/test split does not take into account carryover effects in test set

Garve / mamimo

A package to compute a marketing mix model.

69 stars 12 forks source link

Train/test split does not take into account carryover effects in test set #4

Closed tsitsimis closed 1 year ago

tsitsimis commented 1 year ago

Hi, and thank you for the great package. It is very intuitive and has helped me a lot.

Looking into the README file, in the section Training The Model, the RandomizedSearchCV first splits the initial X and y to train and test sets, and then applies the preprocessing pipeline (carryover, saturation, and the model) to train set and test set separately. But this results in missing all the carryover effects that would be caused from the media at the end of the train set to the beginning of the test set, which I think decreases the validity and accuracy of the model.

Shouldn't the preprocessing be applied first, and then pass the preprocessed dataset to the grid search?

Thanks!

Garve commented 1 year ago

Hello! Thanks for your kind words, very happy that the library helped you :)

To assess the model quality, you should always split first and then do the processing and training only on the training dataset, otherwise the model (or preprocessor) might be able to cheat if you give it testing data a well.

Sure, it does not see the last part of the time series, but that's how it out also on real life: you train the model, gut it has to deal with unseen data tomorrow.

If you really just want to fit on the whole dataset, you could write a cv splitter that outputs not train/ test, but 2 times the complete dataset. That will lead to overfitting though, and you can trust the carryovers etc. even less.

Best Robert

tsitsimis commented 1 year ago

Sure, it does not see the last part of the time series, but that's how it out also on real life: you train the model, gut it has to deal with unseen data tomorrow.

I am getting confused with this part. Even in real life, you know what happened in the past. In this case for example, you know the carryover strength parameter that you applied in the previous days, so why not apply it in the past data to update current or future data as well?

Edit: expanded question

Garve commented 1 year ago

Yes, and the hyper parameter optimization learns from the past and checks how well the same parameters with in the future. Once you have the best hyper parameters, you can predict into the future with them for unseen data