Chapter 14 - target-shape in time series prediction with RNN

I'm struggling with understanding the way the RNN is trained in the time series example: You pose the prediction problem as a supervised problem of predicting y given X in the following way, to my understanding: We randomly select an arbitrary number of n_samples of a long time-series (for example 20000) and slice out a time sequence of length n_timesteps(for example 24). So far so good.

One sample would now be starting at a random index of the real time series, for example at the 24st index of the overall time series: X = [d24, d25, d26,....,d47,d48] (length 24) y = [d25, d26, d27, .... , d48, d49] So when we're fitting the RNN to this sample the only forecasted value is actually d49, since it isn't needed as an input to the RNN.

Following up from that we could now feed this sample into the trained RNN to predict the 50th value of the overall timeseries and to generate a multi step forecasting (like you propose in the Creative RNN-part).

In a way, if my goal is to generate an n-step prediction based on my previous values, I would have to predict n times, reinserting the prediction into the network.

In another well known blog (https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/) RNNs and LSTMs are trained in a different manner: The supervised problem is posed as:

X = [d24, d25, d26,....,d47,d48] (length 24) y = [d49] or for multistep: X = [d24, d25, d26,....,d47,d48] (length 24) y = [d49, d50, d51,...] length n

My overall scape is to do a n-step ahead prediction of electricity load in grid, which is a quite autoregressive approach, kind of like forecasting the temperature. I want to select the past values and predict the next value). A naive way to do that would be to just use the last value in the sequence.

If i evaluate the output of my network, the error is quite good for the first predictions and it's predicting nicely, but the last value (e.g. d49 of your given way to train the network) is just a repetition of the d48, which will produce a quite low MSE value.

I'm wondering now, when you're feeding the network with a tensor of shape (n_samples, n_timesteps, n_features) (20000-24, 24,1) is there any dataleakage in one sample? (note that if we randomly sample 24 sequences of the overall sequence we can get 20000-24 independet samples)

So more specifically in a random sample of shape (1,24,1) can the network use the 2nd value X(1,2,1) to predict y(1,1,1) and that's why the last value is bad, because there is no information about that in the sample?

Hi @weberoliver , Thanks for your interesting question, and my apologies for the late response (I was on vacation). I improved this chapter in the second edition, you can check out the early release on O'Reilly's Safari platform (it requires a subscription, but you can benefit from their trial period). Or if you just need the code, you can check out the Jupyter notebook for the RNN chapter in the 2nd edition.

In short, in this new chapter I present several approaches to forecasting multiple timesteps at once:

[d24, d25, d26, ..., d47, d48] => [d49]. This is a sequence-to-vector RNN. It only forecasts one step ahead, so to forecast multiple time steps, you would need to append this predicted value to the input sequence and use the model again to predict d50, and so on (just like in the "creative RNN" section). This can work well to forecast very few time steps ahead (e.g., 2 to 5 time steps).
[d24, d25, d26, ..., d47, d48] => [d25, d26, d27, ..., d48, d49]. A sequence-to-sequence net that only predicts one time step ahead. So just like the previous net, you would need to append d49 to the input sequence and use the model again to forecast d50, and so on. The benefit of this model is that it tends to handle long input sequences better (there are gradients flowing back through the net at each time step, not only from the last time step), and it tends to overfit less.
[d24, d25, d26, ..., d47, d48] => [d49, d50, d51]. Sequence-to-vector RNN. The input shape is [batch size, time steps, input_dims] and the output shape is [batch size, steps_ahead * output_dims]. This is quite simple and often works well.
[d24, d25, d26, ..., d47, d48] => [[d25, d26, d27], [d26, d27, d28], ..., [d49, d50, d51]]. A sequence-to-sequence RNN. Same input shape as the previous model, but the output shape is [batch size, time steps, steps_ahead * output_dims]. Again, this sequence-to-sequence approach may help handle long input sequences, and it have less risk to overfit.

In the 2nd edition, I also showed how to use 1D-convolutional layers for sequence preprocessing, and even do forecasting using 100% convolutional neural nets (e.g., using WaveNet).

Hope this helps.

To answer your last question:

I'm wondering now, when you're feeding the network with a tensor of shape (n_samples, n_timesteps, n_features) (20000-24, 24,1) is there any dataleakage in one sample? (note that if we randomly sample 24 sequences of the overall sequence we can get 20000-24 independet samples)

There should not be. RNN layers are causal: the output at time step t is only based on the inputs at time step t and earlier. It does not look ahead (unless you build a bidirectional RNN). Similarly, a Conv1D layer is causal if you set padding="causal" (it automatically pads zeros before time step t, depending on the kernel size, so again, the output at time step t does not depend on time steps after t). You can also use padding="valid" and crop the targets appropriately (e.g., if the kernel size is 3, then the first output will be based on t, t+1 and t+2, so the targets should start at t+3).

So more specifically in a random sample of shape (1,24,1) can the network use the 2nd value X(1,2,1) to predict y(1,1,1) and that's why the last value is bad, because there is no information about that in the sample?

I think you meant X[0, 1, 0] and y[0, 0, 0], right? If the shape is (1, 24, 1), X[1, 2, 1] would be out of bounds.

In an RNN, X[0, 1, 0] will only be used for the outputs at time step 1 and after. It is causal, as explained above. At time step 0, the RNN's output only depends on X[0, 0, 0], so it has very little information, and thus the prediction will generally not be great. At time step t, the RNN's output depends on X[0, 0, 0], X[0, 1, 0], ..., X[0, t, 0], so it can make a much better prediction.

Hope this helps.

Thank you very much for your clarification on the matters! I will look into the newer editions, but your answers have already clarified a lot and sorry about the indexing mistake, you were correct, that was what I meant.

ageron / handson-ml

Chapter 14 - target-shape in time series prediction with RNN #458