Questions on Feature and Target Set

ritchieng commented 5 years ago

Interesting work, looking through your repository, I've 2 questions on possible data snooping:

On this line for preparing your data I see you did a dataframe shift for your labels https://github.com/M-Nauta/TCDF/blob/master/TCDF.py#L19 and you shifted with the function df.shift(periods=1).
- When you do this, you essentially have targets as a one day historical price. In your description in that function you said y is to be predicted but predicted historical data seems odd?
If you use a CNN on the feature space with X where you've your convolutions containing information across the time space, predicting the forward price levels beyond what the convolutions go through on X is fine, but if you predict the whole y target series, you would have "forward" information via your convolutions on X that is from t = 0 to t = T and T - 1 of your labels are within that space.

M-Nauta commented 5 years ago

Regarding your first question, please read section 4.1.2 of the paper. In short, we shift the target time series with 1 period such that the value that should be predicted is not given as input to the kernel, in order to prevent data leakage. I think this also answers your second question. Please let me know if you have any other questions regarding data snooping.

ritchieng commented 5 years ago

Thanks for the quick reply. Yes that’s if you do a df.shift(-1) not df.shift(1) which is the case in the code where your kernel may have seen the data.

The second part is another different issue because of how the prediction is on the whole y for example time stamp 2018 and the kernel has memory of 2015-2019 in predicting the shift of that whole time series.

M-Nauta commented 5 years ago

df.shift(periods=1) is right. We move all values one forward and add a 0 in front, as described in section 4.1.2 of the paper. You can check this by printing df_y and df_yshift at line 17 and 20, and you will see the same situation as described in the paper. The kernel can access values in df_yshift within the window size up to and including time t-1, when predicting the value at time t.

I see your second point. However, when the target time series is long and the kernel is not so large, the effect of keeping future values in memory will be minimal. Furthermore, this is part of the deal with sequence learning. Unless you have a very large dataset, you will show data multiple times to train your neural network, i.e. the network will see values multiple times.

ritchieng commented 5 years ago

For the first part, yes I saw the code and indeed it's according to what you described. However in this case then, it's predicting the past, 1 day before because it's df.shift(1). Does predicting the past make sense?
Thanks for acknowledging the second point :) Yes you will indefinitely shuffle through unless there is a large dataset. But that's where validation and test sets come in (a time series each for example in this case) to show predictive capacity out-of-sample. But it's not possible to really show out-of-sample performance because in running this algorithm on unseen data on inference for testing performance, the kernel has knowledge of future data within that inference run.

M-Nauta commented 5 years ago

TCDF does not predict the past, it predicts the current value of the target at time t. With kernel size k, TCDF predicts the target based on the k past values of the target, and based on the k-1 past values plus the current value at time t of the other time series. This allows the discovery of instantaneous effects, i.e. causal relationships with a delay of 0. Let's illustrate this with an example. Suppose we have the following dataset with 2 time series (x₁ and x₂) and we want to predict x₁:

| x₁⁰ | x₁¹ | x₁² | x₁³ | x₁⁴ |

| x₂⁰ | x₂¹ | x₂² | x₂³ | x₂⁴ |

Without shifting, a CNN with kernel size 3 will use x₁⁰, x₁¹, x₁² and x₂⁰, x₂¹, x₂² to predict x₁³. However, we want to take x₂³ into account when predicting x₁³ (to allow the discovery of instantaneous effects).

Therefore, we replace the target time series in the dataset with a shifted version of the target time series, resulting in the following dataset:

| 0.0 | x₁⁰ | x₁¹ | x₁² | x₁³ |

| x₂⁰ | x₂¹ | x₂² | x₂³ | x₂⁴ |

Now, the CNN with kernel size 3 will use x₁⁰, x₁¹, x₁² and x₂¹, x₂², x₂³ to predict x₁³. This corresponds with our goal: using only the past values of the target, and using the past and current value of the other time series. I hope that answers your question.

Indeed, that's why we use the whole dataset only for discovering causal relationships. We measure prediction accuracy by splitting the timeseries in a training and a test set. We use the first 80% of the time series to train, and evaluate TCDF on the last 20%, as discussed in section 5.1, section 5.4.1 and Table 3 of the paper.

ritchieng commented 5 years ago

Herein lies the issue. To break this down so it's easier to see the issue, here's an example.

Take 2 time series for example: S&P 500 where this is x1i and NASDAQ where this is x2i.

| x10 | x11 | x12 | x13 | x14 |

| x20 | x21 | x22 | x23 | x24 |

You stated yourself the kernel will use x23. You are equivalently using NASDAQ current value to predict S&P current value. You would almost get very high predictive capacity (almost a straight line equity curve with high sharpe ratio) because you're using future information, current value of other time series (other equity indices) to predict the current value of S&P 500 for example.

M-Nauta commented 5 years ago

Since we want to detect instantaneous effects in our causal graph as well (since this is important as shown by Hyvarinen et al. http://icml2008.cs.helsinki.fi/papers/160.pdf), it was a design choice to include the current values in the input when making a prediction. This is also done in the Temporal Convolutional Network architecture where TCDF is based on: see section 3.1 of https://arxiv.org/pdf/1803.01271.pdf. As discussed in our paper, section 2: "In practice, instantaneous effects mostly occur when cause and effect refer to the same time step that cannot be causally ordered a priori, because of a too coarse time scale." An extra benefit of including instantaneous effects, is that TCDF can use these to circumstantially discover the presence of hidden confounders.

ritchieng commented 5 years ago

Thanks for clarifying, I get your experimental setup choice now.

M-Nauta / TCDF

Questions on Feature and Target Set #2