Closed ritchieng closed 5 years ago
Regarding your first question, please read section 4.1.2 of the paper. In short, we shift the target time series with 1 period such that the value that should be predicted is not given as input to the kernel, in order to prevent data leakage. I think this also answers your second question. Please let me know if you have any other questions regarding data snooping.
Thanks for the quick reply. Yes that’s if you do a df.shift(-1) not df.shift(1) which is the case in the code where your kernel may have seen the data.
The second part is another different issue because of how the prediction is on the whole y for example time stamp 2018 and the kernel has memory of 2015-2019 in predicting the shift of that whole time series.
df.shift(periods=1) is right. We move all values one forward and add a 0 in front, as described in section 4.1.2 of the paper. You can check this by printing df_y and df_yshift at line 17 and 20, and you will see the same situation as described in the paper. The kernel can access values in df_yshift within the window size up to and including time t-1, when predicting the value at time t.
I see your second point. However, when the target time series is long and the kernel is not so large, the effect of keeping future values in memory will be minimal. Furthermore, this is part of the deal with sequence learning. Unless you have a very large dataset, you will show data multiple times to train your neural network, i.e. the network will see values multiple times.
For the first part, yes I saw the code and indeed it's according to what you described. However in this case then, it's predicting the past, 1 day before because it's df.shift(1). Does predicting the past make sense?
Thanks for acknowledging the second point :) Yes you will indefinitely shuffle through unless there is a large dataset. But that's where validation and test sets come in (a time series each for example in this case) to show predictive capacity out-of-sample. But it's not possible to really show out-of-sample performance because in running this algorithm on unseen data on inference for testing performance, the kernel has knowledge of future data within that inference run.
| x10 | x11 | x12 | x13 | x14 |
| x20 | x21 | x22 | x23 | x24 |
Without shifting, a CNN with kernel size 3 will use x10, x11, x12 and x20, x21, x22 to predict x13. However, we want to take x23 into account when predicting x13 (to allow the discovery of instantaneous effects).
Therefore, we replace the target time series in the dataset with a shifted version of the target time series, resulting in the following dataset:
| 0.0 | x10 | x11 | x12 | x13 |
| x20 | x21 | x22 | x23 | x24 |
Now, the CNN with kernel size 3 will use x10, x11, x12 and x21, x22, x23 to predict x13. This corresponds with our goal: using only the past values of the target, and using the past and current value of the other time series. I hope that answers your question.
Herein lies the issue. To break this down so it's easier to see the issue, here's an example.
Take 2 time series for example: S&P 500 where this is x1i
and NASDAQ where this is x2i
.
| x10 | x11 | x12 | x13 | x14 |
| x20 | x21 | x22 | x23 | x24 |
You stated yourself the kernel will use x23
. You are equivalently using NASDAQ current value to predict S&P current value. You would almost get very high predictive capacity (almost a straight line equity curve with high sharpe ratio) because you're using future information, current value of other time series (other equity indices) to predict the current value of S&P 500 for example.
Since we want to detect instantaneous effects in our causal graph as well (since this is important as shown by Hyvarinen et al. http://icml2008.cs.helsinki.fi/papers/160.pdf), it was a design choice to include the current values in the input when making a prediction. This is also done in the Temporal Convolutional Network architecture where TCDF is based on: see section 3.1 of https://arxiv.org/pdf/1803.01271.pdf. As discussed in our paper, section 2: "In practice, instantaneous effects mostly occur when cause and effect refer to the same time step that cannot be causally ordered a priori, because of a too coarse time scale." An extra benefit of including instantaneous effects, is that TCDF can use these to circumstantially discover the presence of hidden confounders.
Thanks for clarifying, I get your experimental setup choice now.
Interesting work, looking through your repository, I've 2 questions on possible data snooping:
On this line for preparing your data I see you did a dataframe shift for your labels https://github.com/M-Nauta/TCDF/blob/master/TCDF.py#L19 and you shifted with the function
df.shift(periods=1)
.If you use a CNN on the feature space with X where you've your convolutions containing information across the time space, predicting the forward price levels beyond what the convolutions go through on X is fine, but if you predict the whole y target series, you would have "forward" information via your convolutions on X that is from t = 0 to t = T and T - 1 of your labels are within that space.