cerlymarco / MEDIUM_NoteBook

Repository containing notebooks of my posts on Medium
MIT License
2.08k stars 975 forks source link

input training for model #3

Closed mfumagalli68 closed 5 years ago

mfumagalli68 commented 5 years ago

Hi marco, good post on medium and good code.

Given the paper http://roseyu.com/time-series-workshop/submissions/TSW2017_paper_3.pdf, they specify that a training dataset is created by splitting the historical data into sliding windows of input and output variables. I don't see any of that in your code.

Furthermore, I feel like we are doing a sort of "data leaking" hereXX = encoder.predict(X) .it seems that we are using the entire data available both test and train to produce that array XX.

Thanks

cerlymarco commented 5 years ago

Hi,

With functions 'gen_sequence' and 'gen_labels' I created sliding windows to feed my lstm models.

The autoencoder is fitted only on train data; the associated encoder predictions are computed on the entire data... this doesn't generate 'data leaking' because I use the same train data for fitting autoencoder and forecaster and I mantained train alway separated from test. You can image the encoder as a sort of 'trasformer' like the StandardScaler or MinMaxscaler of sklearn, which we fitted on train data and use to transform test.

I appreciate comments like yours to improve the quality of these analysis. Thanks Stay tuned on Medium Ciao

mfumagalli68 commented 5 years ago

ok, thank you. I'm going one step further.

Now I want to predict completely new data. Let's suppose that we are at the end of the current month and I want to forecast on a daily basis Average price of our avocados for the next 30 days.

How can I build our features derived from autoencoder? I must need our time series for August which is obviously not available.

I could build a model without autoencoder features to predict August price and then use autoencoder to build feature, and then retrian the model. But I suspect that my features will be completely wrong since ( in my case) autoencoder features seems to improve my predictions ( I have quite few extreme values which are messing with my metrics) and build a model without them , will produce poor forecast values for the autoencoder.

Any thoughts from you?

I'm sorry to insist on this, but it's very crucial to me.

Thanks for the patience.

cerlymarco commented 5 years ago

Hi, don't worry...

If we are at 31st of July we can have at our disposal all the data of July, we use them to generate our features and predict the future (like every ML task). In avocado case the future is represented by the day after; so with our feature we'll forecast the value of 1st of August. At 1st August we can observe the past and use it to create the features to forecast the value of 2nd August and so on.

I use autoencoder because I hope it learns the system dynamics by itself and so I use it as features generator BEFORE the forecaster. I don't see added value to use an autoencoder on final forecasts (with different models trained you can use prediction ensemble technique, but this is another question).

I don't know your use case and I'm also at disposal to exchange point of views Thanks

mfumagalli68 commented 5 years ago

Nope I think we are not getting each other.

In machine learning usually we give an observed X to our model in order to make the model predict Y. And usually every package of every programming languages has a method like model.predict(X).

I need to predict 30 days ahead.

So I need to build an X which it's gonna have the same number of column of our X_train. My X contains time feature related and other internal data of my company ( mostly related to product promotions). So let's take August. From 2019-08-01 to 2019-08-30 I can generate the time related feature I want, for example day of month and quarter. So my X will now will be made up of 2 column and 30 rows. Then I will add my product promotion features and that's fine. Now it comes to the part of the feature generated by my autoencoder, that matrix of 128 column. That's my doubt right now. Autoencoder is just a "compression of my signal data", but in a real scenario I don't have any data to compress. I don't think it's possible to generate a matrix of autoencoded features to concatenate with my X.

I hope I make myself clear

cerlymarco commented 5 years ago

If you have n initial featuares and the autoencoder produces m features (with n < m) is logicaly not a compression.

I'm thinking about your initial features... Like you anticipate ('I can generate the time related feature I want, for example day of month and quarter' and then I will add my product promotion features); I belive you are using lot of sparse values (one hotted values), this may not be benefical for a NN