The definition of GMM linear layer may wrong? Or I have missed something?

Reinelieben commented 4 years ago

hi ctallec, In file mdrnn.py: I observed the neural number in gmm_linear layer is too few, why is the output size defined as (2 latents + 1) gaussians + 2? Shouldn't it be 3 latents gaussians +2 (I also saw this definition in other implementation of mdn-rnn)? In your definition, you seem to share the pis to all gaussian element which is not feasible under my understanding of GMM. My understanding is that, each element of the latent vector has its own GMM, that is, for example, if we have 3 gaussian elements, for each z_i we have 3 mus, 3 sigmas and 3 pis. Or have I had some misunderstandings of GMM? Best,

Reinelieben commented 4 years ago

You could check this implementation https://github.com/zmonoid/WorldModels/blob/master/model.py for more information at line 62

AlexGonRo commented 4 years ago

I am a bit rusty with this library but, if I remember correctly, you are right.

This library uses an output such that each Gaussian mixture has a defined μ and σ for each element of the array z_{t+1}. I don't know about other implementations, but the original World Models uses the one you propose.

I don't think this approach is wrong, it is just a more restrictive one. If you use it with the proposed environments (carRacing and ViZDoom: Take Cover) you won't be able to see the difference.

Reinelieben commented 4 years ago

I am a bit rusty with this library but, if I remember correctly, you are right.

This library uses an output such that each Gaussian mixture has a defined μ and σ for each element of the array z_{t+1}. I don't know about other implementations, but the original World Models uses the one you propose.

I don't think this approach is wrong, it is just a more restrictive one. If you use it with the proposed environments (carRacing and ViZDoom: Take Cover) you won't be able to see the difference. Hi Alex, Thank you so much for the very quick response! I found issue when I was training the rnn, the gmm loss was always relatively high like (0.95), but after I changed the definition and loss function, respectively , I found the gmm loss was reduced to -0.005 which means the likelihood is near 1. Do you have any experience how the value loss should be? And could you give me some suggestion to sample the latent variable from the distribution according to "temperature"? Best regards

AlexGonRo commented 4 years ago

I found issue when I was training the rnn, the gmm loss was always relatively high like (0.95), but after I changed the definition and loss function, respectively , I found the gmm loss was reduced to -0.005 which means the likelihood is near 1. Do you have any experience how the value loss should be?

I have the feeling I'm missing too much information to provide a faithful answer to your question (which environment you are using, size of the latent space, algorithm for the vision module, etc). I'll try to give you some general advice:

First of all, yes, that new value you are getting doesn't seem right. You might need to check your code for any errors. If you are using ViZDoom as an environment, check the World Models original repo. It uses the metric you propose and they have a logfile for their experiments with viZDoom: Take Cover. They get a final loss around 0.98 after 400 epochs
Using this library with the original configuration, and the loss function you propose, I got a GMM loss of 1.05 on the carRacing game and 0.95 on ViZDoom: Take cover. I might be wrong about this one (this was a long time ago, I don't remember the exact details) but take these values as reference. I ran other environments and configurations, but my loss never varied as significantly as yours.
Pay a lot of attention to your vision algorithm and its resulting values. Sometimes you get very strange results from your memory module but the bug isn't there. The first thing that comes to mind is KL divergence of the VAE. The higher it is, the lower the loss of the memory module (but it doesn't mean the module is better).
Talking about KL divergence, the original paper uses "free bits", which I think was not implemented in this library (Have a look at these papers: paper 1, paper 2). It helps with the reconstructions of both test environments by setting a threshold on the KL divergence. Is your latent space informative enough?

And could you give me some suggestion to sample the latent variable from the distribution according to "temperature"?

Well, this is a completely different question. From my experience, there are a couple of things you need to know:

Using τ<1 never helped. The controller just learnt to trick the simulation more consistently.
Training offline (in a dream) was very VERY tricky. Models trained with the same configuration and architecture could yield very different results. And this affected the temperature value too. From my experience, values 1.2 < τ > 1.1 worked best.

ctallec / world-models

The definition of GMM linear layer may wrong? Or I have missed something? #33