4 questions regarding the structure of LSTM_autoencoder

hellojinwoo commented 5 years ago

Hello, Mr. Ranjan. Thanks for your great article LSTM Autoencoder for Extreme Rare Event Classification in Keras and code on the github. While reading your code, however, I came up with 3 questions.

I decided to ask you questions here rather than on medium because I can upload pictures and quote codes more accurately here. Hope you are okay with this.

Q1. Why ‘return_sequences=True’ for all the LSTM layers?

Back up explanations

I think LSTM autoencoder is very similar to seq2seq model : In Autoencdoer , input data is squeezed into a single latent vector with smaller length than the original input. In seq2seq model the same thing happens with an input sequence and a fixed-length vector.

encoder decode model

< Figure 1. seq2seq model : Encoding - Decoding model >

In the encoding stage, what a model needs to do is making a fixed-length vector(a latent vector) which contains all the information and time-wise relationships of the input sequence. In decoding step, a model’s goal is to create an output that is as close as possible to the original input.
So my guess is that in the encoding stage, we do not need outputs as in the figure 1, as the autoencoder model's only goal is to make a hidden latent vector well. The little MSE the output created from the latent vector in the decoding stage has with the input data, the better the latent vector is.
Doesn’t it mean that we can make ‘return_sequences = False’, which does not print out the outputs in the encoding stage?

Q2. What would be the first hidden state (h0, c0) for the decoding stage?

Back up explanations

According to the code, the hidden latent vector is repeated for timesteps as in lstm_autoencoder.add(RepeatVector(timesteps)) This means that the latent vector would be fed to the decoder as an input in the decoding stage. Below is the code snippet.

lstm_autoencoder = Sequential()

# Encoder
lstm_autoencoder.add(LSTM(timesteps, activation='relu', input_shape=(timesteps, n_features), return_sequences=True))
lstm_autoencoder.add(LSTM(16, activation='relu', return_sequences=True))
lstm_autoencoder.add(LSTM(1, activation='relu'))
lstm_autoencoder.add(RepeatVector(timesteps))

# Decoder
lstm_autoencoder.add(LSTM(timesteps, activation='relu', return_sequences=True))
lstm_autoencoder.add(LSTM(16, activation='relu', return_sequences=True))
lstm_autoencoder.add(TimeDistributed(Dense(n_features)))

If latent vectors are used as inputs in the decoding stage, what would be used for inital hidden state (h0, c0) ? In the seq2seq model (figure 1) mentioned above, the latent vector is used as initial hidden state (h0, c0) in the decoding stage. The input in the decoding stage would be a sentence that needs to be translated, for example from English to French.
So I am curious to know what would be used as an initial hidden state cell (h0, c0) in your code!

Q3. Why output unit size increases from 5 to 16, in the encoding stage?

Back up explanations

From the lstm_autoencoder.summary() we can see that the output unit increases from 5 (in the layer 'lstm_16') to 16 (in the layer 'lstm_17' )

< Figure 2. summary of LSTM - Autoencoder model >

Since the output of previous LSTM layer is an input for the next LSTM layer, I think the output size is equivalent to hidden state size.
If the hidden layer's size is greater than the number of inputs, the model can learn just an 'identity function' which is not desirable. (Source : [What is the intuition behind the sparsity parameter in sparse autoencoders?])(https://stats.stackexchange.com/questions/149478/what-is-the-intuition-behind-the-sparsity-parameter-in-sparse-autoencoders)
Layer 'lstm_16' is only 5-size long while the next layer 'lstm_17' is 16-size long. So I think the lstm_17 would just copy (acting like an 'identity matrix') the last_16, which makes the layer lstm_17 undesirable.
I am curious to know why the output size (hidden_layer size) increases rather than decreases!

Q4. How smaller does the input data size get reduced in the latent vector?

Back up explanations

In the full-connected layer Autoencoder, we can see how smaller the input vector get reduced. For example in the picture below, 10 node-long vector input gets reduced to 4 node-long latent vector in the middle.

In your code, how smaller did the 59 node long input vector (one input of a certain time. It has 59 features and 1 answer label) get reduced in the latent vector?

Thanks for this nice post again.

anooptoffy commented 5 years ago

Hello Ranjan, Nice write-up. Thanks for the article. I was searching for a architecture in such as scenario. I do have a small clarification if you could help me with.

How did you set the threshold ? I got confused how your were able to select a threshold from Precision/Recall plot ? If I have a graph as shown below what's the threshold to be set that you suggest?

cran2367 commented 5 years ago

An ideal threshold is one where the precision and recall are highest together. That means, the point of their intersection. If it is hard to identify that from this plot, I will look at the array of precision and recall.

hellojinwoo commented 5 years ago

Can you answer my other questions as well please...?

cran2367 commented 5 years ago

@hellojinwoo Yes, I am at the moment drafting a post that will answer your questions (at least some of them). Your questions are really good and require a detailed explanation. I also identified a few issues in my lstm network, that I will correct and mention. Please look for my message with the post. I will reply to you as soon as I post it (sometime before the end of this week).

cran2367 commented 5 years ago

@hellojinwoo Please look at this post, https://towardsdatascience.com/step-by-step-understanding-lstm-autoencoder-layers-ffab055b6352 It should answer your questions. I will be making some changes in the LSTM network I have in the LSTM Autoencoder for Extreme Rare Event Classification in Keras, e.g. the size of the first layer and update the post/github. Please let me know if you have questions.

sudhanvasbhat1993 commented 5 years ago

@hellojinwoo Please look at this post, https://towardsdatascience.com/step-by-step-understanding-lstm-autoencoder-layers-ffab055b6352 It should answer your questions. I will be making some changes in the LSTM network I have in the LSTM Autoencoder for Extreme Rare Event Classification in Keras, e.g. the size of the first layer and update the post/github. Please let me know if you have questions.

Hi @cran2367 this is such a good post.I just wanted to know if you would update the LSTM structure soon?

cran2367 commented 5 years ago

Thank you, @sudhanvasbhat1993. I will be making the next post explaining how to optimize a Dense Autoencoder. Thereafter, I will be making a post on LSTM autoencoder tuning. But the next LSTM post may take a few weeks.

DebJRoy commented 5 years ago

Hi, LSTM Autoencoder for Extreme Rare Event Classification in Keras was a great article. I applied the same on a vehicle predictive maintenance data set. I have a couple of questions, if you could kindly answer those, The following code gives you the prediction classification for each row as 0 or 1 (correct me if I am wrong) in your script, pred_y = [1 if e > threshold_fixed else 0 for e in error_df.Reconstruction_error.values] when for production I apply the pre-processing steps on a test set without y value to get the prediction , I do the temporalize and the scaling and then do the model.predict and finally get pred_y. But when I try to attach the pred_y to my original df as a predicted column. It is giving a length error, as the length of the original df in my case is 1020 and length of pred_y is 1009.

Can you please guide me to where i am going wrong and what can be done to resolve this issue.

Thanks a lot in advance.

cran2367 / lstm_autoencoder_classifier