DuaneNielsen / DeepInfomaxPytorch

Learning deep representations by mutual information estimation and maximization
https://arxiv.org/abs/1808.06670
322 stars 47 forks source link

Why does prior distribution have no encoder loss? #6

Open HaopengZhang96 opened 5 years ago

HaopengZhang96 commented 5 years ago

the following code :

term_a = torch.log(self.prior_d(prior)).mean()
term_b = torch.log(1.0 - self.prior_d(y)).mean()
PRIOR = - (term_a + term_b) * self.gamma

"-(term_a + term_b)" is the loss of Discriminator, and “term_b” is the loss of encoder( similar as generator of gan )

In the code you only backward Discriminator's loss(part of prior distribution), and there is no backward of the loss that belongs to the encoder in the prior distribution.

loss.backward()  // loss = global+local + prior , prior =-(term_a+term_b)
optim.step()
loss_optim.step()

I think it could be the following process

term_a = torch.log(self.prior_d(prior)).mean()
term_b = torch.log(1.0 - self.prior_d(y.detach())).mean()  // y should be detach
PRIOR = - (term_a + term_b) * self.gamma
encoder_loss_for_p = term_b
.............

loss.backward()  // loss = global+local + prior , prior =-(term_a+term_b)
optim.step()   //update the gradient from global+local but no prior
loss_optim.step()

encoder_loss_for_p.backward()   //optim the encoder for Adversarial
optim.step()

Is my understanding wrong?

HaopengZhang96 commented 5 years ago

I just found out that someone asked the same question earlier.

DuaneNielsen commented 4 years ago

Yeah, this is why this is such a good technique. Unlike GAN it's not a minimax optimization.

Gradients are propagated directly through the loss function to the encoder network and they are optimized jointly.

The experimental setup is based on the idea that the mutual information between an image, and a randomly selected image should be zero.

This forces the encoder to learn a latent space where the encodings that share mutual information are close in distance, but those that don't are farther away.

tianlili1 commented 4 years ago

Excuse me, I met a problem when I use the mutual information, its loss value is negative at the beginning. Is this normal?

SchafferZhang commented 4 years ago

Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?

HaopengZhang96 commented 4 years ago

Excuse me, I met a problem when I use the mutual information, its loss value is negative at the beginning. Is this normal?

not normal. I use the mutual Infomation is always positive.

HaopengZhang96 commented 4 years ago

Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?

I read the code for the original paper,and I think I am right. The encoder and discriminator loss should be divide,like GAN

SchafferZhang commented 4 years ago

Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?

I read the code for the original paper,and I think I am right. The encoder and decoder loss should be divide,like GAN

So, Did you reimplement the code in this repo or use the official code? How is it work?

HaopengZhang96 commented 4 years ago

Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?

I read the code for the original paper,and I think I am right. The encoder and decoder loss should be divide,like GAN

So, Did you reimplement the code in this repo or use the official code? How is it work?

I follow the DIM‘s work and do some job on user behavior modeling. In my experiment,the local mutual information is have a good performance on Sequence modeling,when the downstream task is Classification.Actually,the prior Loss is not important if only focus on downstream task performance. Prior loss plays a role of normalization to some extent.

My paper is being submitted and I haven't sorted out the relevant code.

SchafferZhang commented 4 years ago

Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?

I read the code for the original paper,and I think I am right. The encoder and decoder loss should be divide,like GAN

So, Did you reimplement the code in this repo or use the official code? How is it work?

I follow the DIM‘s work and do some job on user behavior modeling. In my experiment,the local mutual information is have a good performance on Sequence modeling,when the downstream task is Classification.Actually,the prior Loss is not important if only focus on downstream task performance. Prior loss plays a role of normalization to some extent.

My paper is being submitted and I haven't sorted out the relevant code.

Looking forward to your work!

DuaneNielsen commented 4 years ago

Just to put a pin in this one. I think the answer is quite clear from the paper.

image

All three terms are added. There is no double backward pass in Infomax.

As to the loss becoming negative. This can happen because the Pytorch F divergences can return "probabilities" greater than 1.0. This is due to the way F-divergences are calculated in practice. See this explanation of the formula https://github.com/pytorch/pytorch/issues/7637 for the reason as to how log_prob can return a value greater than one.