Standardization of Test Datased uses future values (conceptual error)

JekyllQuant commented 8 years ago

Hi Alex. Thanks for giving out your code, it's a very good example.

I've read your medium post, checked your code and re-coded the data pre-processing functions (and some other parts) by myself to use different standardization methods and check for possible explanations for the exceptional results you've obtained.

I think you may have committed a conceptual error by writing if scale: timeseries = preprocessing.scale(timeseries) https://github.com/Rachnog/Deep-Trading/blob/master/simple_forecasting/processing.py#L65

It's a good idea to standardize the data for each sliding window sample by using all and only the data inside the window (instead of methods like scenario 3 at http://sebastianraschka.com/faq/docs/scale-training-test.html). This is fine for the train dataset but not for the test dataset (your standardization method is similar, in some terms, to scenario 2 at the above URL), in this last one you shouldn't use the full window but only the data for X_test, not for Y_test. This because Y_test, at test time, isn't known, so you cannot calculate the mean and the std used in the standardization process to normalize the full sliding window sample.

Taking away that single number from the mean-std calc takes away the astonishing results of the network.

I've tried different standardization methods and the only one providing the exceptional results you've obtained is the method you've used. The other standardization methods I coded work ok but give results far away from your one.

My simple explanation is that "hiding" the information of Y_test(t) inside the mean and std of each window (then used to standardize the [X_test(t),Y_test(t)] sample) is enough to give the neural network the information it needs to reconstruct almost perfectly the output Y_test by providing X_test_std (standardized with future information) as an input and making the inverse standardization process with the mean and std (calculated with future information).

Please, let me know your thoughts.

Rachnog commented 8 years ago

Hi, @JekyllQuant Thank you for your interest, let's try to understand what's the problem.

This because Y_test, at test time, isn't known, so you cannot calculate the mean and the std used in the standardization process to normalize the full sliding window sample.

Let's say we already have some trained neural network for regression. We have some live data from stock market, close prices of last 20 days. They are definitely KNOWN. x_i = timeseries[:-1] We want to forecast 21st day price, y_i = timeseries[-1], but network outputs the scaled result and we want to restore it. My hypothesis is that I can just take mean and std from already KNOWN 20 days and assume that it's good enough estimate to restore the next day price. I do not use anything from the future, just statistical information from the past trying to restore future.

Correct me, if I am wrong please in my thoughts or implementation.

JekyllQuant commented 8 years ago

Yes, your assumption is totally correct but in your code you're not doing it, your doing this: timeseries = np.array(data[i:i+train+predict]) if scale: timeseries = preprocessing.scale(timeseries) x_i = timeseries[:-1] y_i = timeseries[-1]

i.e. you take the full "timeseries" sliding window including X_test and Y_Test and you standardize it using the preprocessing.scale() function.

Instead of writing timeseries = np.array(data[i:i+train+predict]) you should have written timeseries = np.array(data[i:i+train]), take the mean and std from this data, and use it to standardize x_i and y_i

Rachnog commented 8 years ago

@JekyllQuant seems like you have right, just looking on the picture you can see that our predictions are too late. Does it mean that regression approach doesn't work at all? How do you think?

JekyllQuant commented 8 years ago

Sincere apologies for the very late reply Alex, I completely missed your post.

Does it mean that regression approach doesn't work at all? How do you think?

I cannot say much, but I wouldn't throw everything away. Maybe I would check if the information the NN is giving you can be useful. For sure now it doesn't look quite good, aesthetically speaking.

creotiv commented 7 years ago

Guys dont forget that you cant normalize data against different mean and std. it will give not correct results. So if you calculate mean and std for you first window, you should use then for all your data, including predicted data.

cn3c3p commented 7 years ago

Hi Alex. Thanks for giving out your code, it's a very good example. But, I still can't run volatility.py, because I CAN'T find "feature_extractor" module, Would you mind tell me where from it ? thanks!

Rachnog commented 7 years ago

@cn3c3p added needed files, but I think they even weren't used in the code :)