Scaling Query - Githubissues

andrewwaites commented 7 years ago

Hi

This is excellent code - many thanks as it aligns with some research I am doing and also helps with my Python.

I have a query regarding the scaling. Am I correct that this code is including the test data in the scaling fit? Should the test set be excluded from that process? That is, should the data not be sploit into train/test prior to fit_transform on train and transform only on test?

dafrie commented 7 years ago

Hi Andrew,

Glad the code is helping you! As this is my first shot at RNN (thus take everything cautiously), I also profited massively from other projects and blog posts and thus had to share this...

Regarding the scaling on test data: Thanks for raising this point. You are very correct, one should not base the standardization on the whole sample but only on the training data and then use the estimated parameters also on the test data, otherwise the model gets an "illegal" glimpse at the test data.

I noticed this flaw only after running the generated models (which took many hours...) and as the hand-in date for the paper was fast approaching, I didn't have time to correct and rerun the models. In the paper I argued that as the series seems to be stationary (if you take out the seasonality) and the distribution of both the train- and test data is similar, the results should not really be affected...

andrewwaites commented 7 years ago

Hi Dafrie

Was more a sanity check than anything critical as I agree the scaling would have been very similar either way.

thanks again

dafrie / lstm-load-forecasting

Scaling Query #1