ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.95k stars 12.78k forks source link

[BUG] Chapter 15 naive prediction error with correction #421

Open ChrisChiasson opened 3 years ago

ChrisChiasson commented 3 years ago

The naive prediction in Chapter 15 reproduced below is incorrect. It also raises a visual red flag because the error from the network was about 10x smaller (about 0.027). Quoting from Chapter 15, https://github.com/ageron/handson-ml2/blob/master/15_processing_sequences_using_rnns_and_cnns.ipynb

Let's compare this performance with some baselines: naive predictions and a simple linear model:

In [31]:
Y_naive_pred = Y_valid[:, -1:]
np.mean(keras.metrics.mean_squared_error(Y_valid, Y_naive_pred))
Out[31]:
0.22278848

An accurate way to calculate the naive prediction error would be as follows. The naive error is actually smaller than the RNN walk forward error. This makes sense because it is actually very hard to beat the naive prediction, even when walking forward with real data as the notebook did. (For example, 70% of the naive error would be excellent. 10% of the naive error is ludicrous.)

actual_Y_naive=np.array([np.concatenate([x,y]) for x,y in zip(X_valid[:,-1],Y_valid[:,:-1])])
print(np.average(np.square(Y_valid-actual_Y_naive)))
np.mean(keras.metrics.mean_squared_error(Y_valid,actual_Y_naive))
0.01524985
0.015249849

The main problem with the code in the existing notebook is that it is not dimensionally consistent. Here are some accurate ways to calculate subsets of the naive error to make it easier to see.

This is just using the last validation value vs. the 2nd to last validation value.

Y_naive_pred = Y_valid[:, -2]
np.average(np.square(Y_valid[:,0]-X_valid[:,-1,0]))
np.mean(keras.metrics.mean_squared_error(Y_valid[:,-1], Y_naive_pred))
0.0152704455
0.0152704455

This is just using the first validation value vs. the last value in the input.

Y_naive_pred=X_valid[:,-1,0]
print(np.average(np.square(Y_valid[:,0]-X_valid[:,-1,0])))
np.mean(keras.metrics.mean_squared_error(Y_valid[:,0],Y_naive_pred))
0.015137789
0.015137789
ageron commented 3 years ago

Hi @ChrisChiasson ,

Great catch, thanks a lot! I believe you are referring to the naive baseline in the "Forecasting Several Steps Ahead" section. I'm not sure why I defined it as Y_naive_pred = Y_valid[:, -1:], I'm guessing I first copied the naive prediction from the first section (y_pred = X_valid[:, -1], which was correct) but then something went wrong, it really makes no sense at all.

Recall that in this section we are trying to forecast t+1, t+2, ..., t+10 based on t-n, t-n+1, ..., t-1, t. In other words, we are trying to predict 10 values at once, based only on past values. We should not be using any future values to make our predictions, which means we should not be using Y_valid at all (as my code was doing), only X_valid.

The most naive prediction for this use case is simply repeating the last known value 10 times (it's like predicting that the temperature for the next 10 days will be equal to today's temperature). Here's a simple way to implement this:

Y_naive_pred = np.tile(X_valid[:, -1], 10)

This takes the last time step from each sequence (X_valid[:, -1]) and repeats it 10 times (np.tile(..., 10)). Let's check that it works well for the first sequence:

>>> X_valid[0, -1]
array([-0.24104503], dtype=float32)
>>> Y_naive_pred[0]
array([-0.24104503, -0.24104503, -0.24104503, -0.24104503, -0.24104503,
       -0.24104503, -0.24104503, -0.24104503, -0.24104503, -0.24104503],
      dtype=float32)

Looks good! Now we can compute the error:

>>> np.mean(keras.metrics.mean_squared_error(Y_valid, Y_naive_pred))
0.25697407

Luckily, it's not too far from the value I got when using the wrong code! So the neural net really is 10x better than the naive prediction.

Now regarding the first solution you propose:

actual_Y_naive=np.array([np.concatenate([x,y]) for x,y in zip(X_valid[:,-1],Y_valid[:,:-1])])

In this case you are using Y_valid to define the naive predictions (like my code was), but that's incorrect as explained above: it uses future values to predict future values (it's kind of cheating, which explains why the error is so low!). Specifically, it's predicting t, t+1, t+2, ..., t+9.

Regarding the second solution:

Y_naive_pred = Y_valid[:, -2]
np.mean(keras.metrics.mean_squared_error(Y_valid[:,-1], Y_naive_pred))

Again, this code uses Y_valid to define Y_naive_pred, which is not allowed. Moreover, it's only predicting a single value, but in this section we're trying to predict the next 10 values at once.

Lastly, the third solution:

Y_naive_pred=X_valid[:,-1,0]
np.mean(keras.metrics.mean_squared_error(Y_valid[:,0],Y_naive_pred))

This code only predicts a single value, when we want to predict 10 values at once. At least it's not using future values, though.

I'll fix the notebook now.

Thanks again for catching this error! 👍

ChrisChiasson commented 3 years ago

When I used the y values it was intentional, not a mistake. I was showing what a 1 step naive prediction would do. In one case, I showed the last vs 2nd to last y values. I showed that it was similar to the last x value vs the first y value. In real life, no one would fail to use the most recently available data for their naive predictions. (10 step ahead naive does not make sense the way you were doing it, only 1 step) if you compare 1 step ahead neural network predictions vs the 1 step ahead naive error, in this case you get that the naive error is lower.

If you really want comparable naive predictions 10 steps ahead using using periodic data like what you made, then your MASE should be calculated with respect to the value 1 period ago. (i.e. compare to the same time slot 1 day ago, or month ago, or year ago as appropriate).

It makes absolutely no sense to compare a constant prediction for 10 time slots on periodic data vs the neural network's output in the way that you are doing. That is why a 10% MASE is ludicrous.

On Thu, May 27, 2021, 1:13 AM Aurélien Geron @.***> wrote:

Closed #421 https://github.com/ageron/handson-ml2/issues/421 via 8342ab3 https://github.com/ageron/handson-ml2/commit/8342ab3fa25e4e855a140f008b6dde064064827f .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ageron/handson-ml2/issues/421#event-4803509149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2O765EWVMOIJYHVPIDAP3TPXPI3ANCNFSM433I7RFQ .

ageron commented 3 years ago

Hi Chris,

Thanks for clarifying, I understand your point and I really appreciate your feedback. I agree that my naive approach is too naive when predicting 10 steps ahead: it makes sense when predicting a single time step ahead, and perhaps 2 or 3, but not 10, that's just too basic and it does not give a very meaningful baseline.

However, it's hard to find a better "naive" approach with this dataset. Indeed, every sequence is a superposition of two sine waves with different frequencies, phases and amplitudes (plus noise). In general the ratio of the two frequencies will not be a simple fraction, therefore the period of the sum of the two sine waves will be much larger than the length of each sequence. What period should we use for each sequence when copying past data? I guess we could use the partial autocorrelation function (PACF) to find the "best" lag. Is this what you suggest? Would you have a minute to propose some simple code for a better naive approach? Thanks again! 👍

ChrisChiasson commented 3 years ago

In a general engineering case, like mechanical engineering where a waveform appears to be (multiply-)periodic but is perhaps of unknown functional form, then usually a Fourier transform would be taken in order to identify the amplitudes and frequencies present. It will also correctly capture the phase shifts as well if you are careful when you set it up. In the case of your function, there are going to be two huge spikes in the transform, associated with the two periods of the sinusoids, and the transform will be near-zero everywhere else. Note that the Fourier transform would need to be done on a per-system basis. I remember when I was reading your code you had it generating several different systems of sinusoid pairs. This should allow you to pull out the amplitudes and frequencies by inspection of the Fourier transform plots.

If there is some kind of overall growth and decay of a system (I don't think there was in your example), that will usually be modeled by multiplying the periodic part with an exponential function like c * exp(k t). Non-linear regression can be used to back out the values of c & k.

Anyway, in this case, I think the generating function is so basic that once the Fourier transform is complete you'll have essentially a perfect naive representation (aside from the random noise term that you added at the end of your generation function), and it will be impossible for anything to beat it.

If we were talking about a real-world business-case scenario that had multiple periods, then as far as I can tell from the last couple of months of reading the literature, then the periods would be daily, weekly, monthly, yearly, etc, and to break it down, an analysis like TBATS (see sktime Python package) or fbProphet would be need to be used, where the different seasonal periods are treated as additive. So they would separately regress out daily harmonics, weekly harmonics, yearly harmonics, and then add them all together.

In ARIMA, I think you would actually need superposition of two different lags in order to represent your system, since both sinusoids seem relatively important, but I have never attempted that.

One thing I would say is that usually in a business case problem, one of the periods is going to dominate, and will usually be a direct function of the resolution of your data. For example, the daily business cycle in the case of hourly sales data. In that case, you would just be comparing back to the previous business day at the same hour. (or, sometimes, previous weekend "business day" at the same hour). In those types of cases, it will be clear where the naive "one step ago" prediction will need to come from.

If you go the Fourier transform route, note that there are different versions of the transformation (i.e. multiple continuous transforms, and multiple discrete transforms). They can all be related through the choice of Fourier parameters and through the choices of scaling of the time and frequency terms, so be wary of notation if you attempt to learn the theory. (I have done this and written about it on the Mathematica stackexchange, but personally I would just use whatever defaults the relevant Python package uses.)

Chris

On Fri, May 28, 2021 at 2:25 AM Aurélien Geron @.***> wrote:

Reopened #421 https://github.com/ageron/handson-ml2/issues/421.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ageron/handson-ml2/issues/421#event-4810533795, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2O76Z5JLQIFJDKOL3GSBLTP5APNANCNFSM433I7RFQ .

ageron commented 3 years ago

Awesome response, Chris, thanks so much. I had a sum of sine waves and yet I didn't think of the Fourier transform... silly me! :) I'll give this a try.