Open HaopengZhang96 opened 5 years ago
I just found out that someone asked the same question earlier.
Yeah, this is why this is such a good technique. Unlike GAN it's not a minimax optimization.
Gradients are propagated directly through the loss function to the encoder network and they are optimized jointly.
The experimental setup is based on the idea that the mutual information between an image, and a randomly selected image should be zero.
This forces the encoder to learn a latent space where the encodings that share mutual information are close in distance, but those that don't are farther away.
Excuse me, I met a problem when I use the mutual information, its loss value is negative at the beginning. Is this normal?
Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?
Excuse me, I met a problem when I use the mutual information, its loss value is negative at the beginning. Is this normal?
not normal. I use the mutual Infomation is always positive.
Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?
I read the code for the original paper,and I think I am right. The encoder and discriminator loss should be divide,like GAN
Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?
I read the code for the original paper,and I think I am right. The encoder and decoder loss should be divide,like GAN
So, Did you reimplement the code in this repo or use the official code? How is it work?
Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?
I read the code for the original paper,and I think I am right. The encoder and decoder loss should be divide,like GAN
So, Did you reimplement the code in this repo or use the official code? How is it work?
I follow the DIM‘s work and do some job on user behavior modeling. In my experiment,the local mutual information is have a good performance on Sequence modeling,when the downstream task is Classification.Actually,the prior Loss is not important if only focus on downstream task performance. Prior loss plays a role of normalization to some extent.
My paper is being submitted and I haven't sorted out the relevant code.
Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?
I read the code for the original paper,and I think I am right. The encoder and decoder loss should be divide,like GAN
So, Did you reimplement the code in this repo or use the official code? How is it work?
I follow the DIM‘s work and do some job on user behavior modeling. In my experiment,the local mutual information is have a good performance on Sequence modeling,when the downstream task is Classification.Actually,the prior Loss is not important if only focus on downstream task performance. Prior loss plays a role of normalization to some extent.
My paper is being submitted and I haven't sorted out the relevant code.
Looking forward to your work!
Just to put a pin in this one. I think the answer is quite clear from the paper.
All three terms are added. There is no double backward pass in Infomax.
As to the loss becoming negative. This can happen because the Pytorch F divergences can return "probabilities" greater than 1.0. This is due to the way F-divergences are calculated in practice. See this explanation of the formula https://github.com/pytorch/pytorch/issues/7637 for the reason as to how log_prob can return a value greater than one.
the following code :
"-(term_a + term_b)" is the loss of Discriminator, and “term_b” is the loss of encoder( similar as generator of gan )
In the code you only backward Discriminator's loss(part of prior distribution), and there is no backward of the loss that belongs to the encoder in the prior distribution.
I think it could be the following process
Is my understanding wrong?