TimyadNyda / Variational-Lstm-Autoencoder

Lstm variational auto-encoder for time series anomaly detection and features extraction
MIT License
312 stars 61 forks source link

Loss based on MSE instead of log-likelihood #8

Open Railore opened 4 years ago

Railore commented 4 years ago

Hello, First, thank you for this repo, I learned a lot on how to make tf networks without keras' premade layers. However, it seems to me that the reconstruction loss is computed with MSE instead of log-likelihood like in the article "A Multimodal Anomaly Detector for Robot-Assisted FeedingUsing an LSTM-based Variational Autoencoder". Is that intentional ? If yes, then what is the meaning of the sigma in the network output ?

TimyadNyda commented 4 years ago

Hello,

Yes it is : log loss is used for classification problems, mse for regression ones (as we are reconstructing time series here). However, both are related (proportional) : if you reduce the mse between your prediction and target, then your are implicitely reducing the distance between two probabily distribution : the one your data are sampled from and the one estimated by your model (more on this here)

Sigma is the parameter of the gaussian you are sampling from (the latent variable z), which takes u and sigma as parameters (mean and std). Why are we estimating it ? Because we are using variational inference, by definition of a variational auto encoder. The lower bound of the likelihood to be maximised (in the paper, equation 1) uses the latent variable parameters (so sigma and u in the KL term) and a log-loss term (or mse, which is proportional).

Hope this helps.

Railore commented 4 years ago

Thank you for your answer, I really appreciate it. And the link was very interesting. Yet about sigma, I was not clear enough. In fact there is two sigma in the article : one for the latent variable and one for the reconstruction, quoting the article : "Then, the randomly sampled z from the posterior p(zt|xt) feeds into the decoder’s LSTM. The final outputs are the reconstruction mean μxt and co-variance Σxt."

Then, even if log-loss is proportional to mse, as stated in the link you gave me, the sigma of the gaussian probability distribution has a scaling effect. In the article (equation 4) this is Σxt that is used.

In other word, according to what I understood : by outputting Σxt (and using log-likelihood with Σxt) the article gives the opportunity to the neural network to scale it's reconstruction loss. Like this, the neural network can tell how sure it is of the reconstruction mean.

Sorry if all this seems unclear, I am new to this kind of exercise.

TimyadNyda commented 4 years ago

Oh I see.

Your understanding of the output's sigma is right. Actually, they are (instead of an mse) directly estimating the log-likehood of Xt based on the log-probability of a multivariate gaussian with the output's parameters.

Instead of directly targeting Xt (as it could be done with another loss for time series, MAE, MSE, etc), their model guess the parameters of the distribution Xt could be sampled from (and so its likelihood). And indeed, the bigger the sigma is, the less sure is the algorithm. Note that the log likelihood becomes low when the algorithm seems to be sure about the mean (low sigma) but the input is far from it (mse hello again).

P(X far from the mean with a big scale) > P(X far from the mean with a low scale) = possible anomaly.

Railore commented 4 years ago

Then would you be interested in a pull request that would add an argument to choose between MSE and log-likelihood ?

TimyadNyda commented 4 years ago

Why not :)