Basic information of Lab3-3

yyrkoon27 commented 6 years ago

Dear TA,

I wonder if it is possible for you to shed some light on basic information of Lab3-3.

Number of parameters. I use the following code to count parameters:

param_count = 0
for param in netD.parameters():
param_count += param.numel()
print('NetD Parameter count = {}'.format(param_count))

For all three nets, I have:

NetEnc Parameter count = 76542656
NetDec Parameter count = 36165984
NetD Parameter count = 19341665

Order of loss calculation and network update. My code for each epoch looks like:

# Discriminator
netD.zero_grad()
D_real = D loss for real photo
D_real.backward()
D_fake = D loss for noise-generated photo
D_fake.backward()
D_vae = D loss for VAE-reconstructed photo
D_vae.backward()
optimizerD.step()

# Encoder
netEnc.zero_grad()
Enc_KLD = KLD of mu and logvar; multiplied by beta
Enc_KLD.backward()
Enc_MSE = MSE between features of VAE-reconstructed and real photos
Enc_MSE.backward()
optimizerEnc.step()

# Decoder
netDec.zero_grad()
Dec_MSE = MSE between features of VAE-reconstructed and real photos; multiplied by gamma
Dec_MSE.backward()
Dec_fake = G loss for noise-generated photo
Dec_fake.backward()
Dec_vae = G loss for VAE-reconstructed photo
Dec_vae.backward()
optimizerDec.step()

Does it make sense?

Sample images (maybe from epoch 1, 10, 20, 30, 40, 50) to let us know what it should look like if everything is correct.

Thank you :-)

a514514772 commented 6 years ago

I probably know why you can't train it successfully. I will provide solutions later.

a514514772 commented 6 years ago

嗨各位,

感覺會有點長度就不寫英文了。

首先最大的不同應該在於 KL divergence 的算法，附上兩個版本的template: loss 會差2048倍。

第一種KL，那麼你變成要在enc那裡強化rec loss:

prior_loss = 1 + logvar - mean.pow(2) - logvar.exp()
prior_loss = (-0.5 * torch.sum(prior_loss, 1)).mean()

第二種KL算法是，那就維持跟講義一樣:

prior_loss = 1 + logvar - mean.pow(2) - logvar.exp()
prior_loss = (-0.5 * torch.sum(prior_loss)) / torch.numel(mean)

default

有鑒於很多人不太清楚更新的順序或是什麼時候該重算 loss ，這裡我直接給一個template，一樣供參考，你可以用別的等效方式train起來。

一個epoch後的結果 (rec): rec_epoch_0 png

兩個後的結果 (rec): rec_epoch_1 png

參數總量約: 132051844

感謝 @yyrkoon27 的提醒，挑出我原有程式的錯誤，現在似乎只要幾個epoch就能看見很不錯的臉。

yyrkoon27 commented 6 years ago

Dear TA,

I really appreciate your help.

Two places in your sample code are not clear to me, yet.

rec_loss is calculated between real features and reconstructed features. In your sample code, the features don't come from the updated D but from the old D.
err_enc: beta should multiply with prior_loss instead of rec_loss.

Thank you for your help again.

a514514772 commented 6 years ago

Hi @yyrkoon27 ,

Thank you for your reminder.

It seems that we have two version of templates now.

yyrkoon27 commented 6 years ago

Dear TA,

Some further shallow questions.

KLD normalization: According to the Lab3-1 sample code, it wants KLD to balance with MSE. Therefore, it uses batchsize 28 28 to normalize KLD.

As for Lab3-3, if we apply the same reason, KLD should balance with feature MSE, therefore we should normalize KLD by batchsize * 1024.

However, we are simply not sure: what kind of KLD normalization means beta = 1 in the original VAE setup?

VAEGAN training, debugging, and profiling: I can imagine that there are many ways to train a successful VAEGAN. But there are even more ways to fail. Besides learning directly from your sample code, do you have any advice about what I can do to diagnose my own VAEGAN? Because when I get 64 noise-packed images after one epoch, I can only try different combinations of network refinement without a general feeling about the big picture.

For example, (2-a) Do we need batch normalization for discriminator? Or any kind of weight normalization? Or any nonlinear activation after the convolution layers? (2-b) Even in the original author's VAEGAN sample code, he applied some training tricks to pause D/G updating. I can see different tricks in others' code, too. On the other hand, in your code, you don't need to do special training scheduling, which is good news. However, is it possible to understand why they need to do that while you don't?

Thank you very much!

a514514772 commented 6 years ago

Hi @yyrkoon27 ,

I really feel sorry to make you guys muddled.

(1). beta =1 means it's a normal VAE which is derived from math and only regularizes a single dimension of latent codes to be a "independent" and "standard" Gaussian. So, originally, there shouldn't be a beta according to the math.

(2-a) You do not need any weight initialization, because Pytorch would do that for you unless you need special ones. Empirically, I found that if the discriminator is more powerful, the training will be much easier. I've tried new settings for several days and I hope this one can help you guys train it better. Keep other settings the same as we described before and replace your Discriminator with this one.

default

(2-b) I would say it's hard to say if we need those tricks or not. As you said, several people train their models with different tricks. However, sometimes they need these tricks because they must have them; otherwise, they can't train our model successfully. On the other hand, sometimes they just need these tricks to train it better. So, it's hard to understand why they need.

Thanks

yyrkoon27 commented 6 years ago

Dear TA,

Personally, I think that BN and ReLU are the game changer. Therefore, it is really cool if someone can train (i.e. survive more than 50 epochs or so) without them in the discriminator for VAEGAN.

As for KLD normalization, (a) Lab3-1 sample code: KLD is normalized by batchsize 28 28 (to balance with pixel MSE) (b) Lab3-3 sample code (in this discussion thread): KLD is normalized by batchsize in the 1st version and by batchsize * 2048 in the 2nd version.

My question: what exactly should we normalize to meet the original VAE paper (Kingma)?

Thank you very much!

a514514772 commented 6 years ago

Hi @yyrkoon27 ,

I think it's second one, over each element to match the original paper.

Thanks

jessejchuang commented 6 years ago

Hi TA,

Besides KL normalization coefficient, I saw your L2 norm equation is not same as most of students. For us, we just simply use pytorch function. Thus, it seems we have 1024 times difference between your version.

I tried to multiply 1024 after nn.MSELoss to mimic your equation. The L2 norm can converge in 1st epoch and reconstruction image seems OK. However, L2 norm is becoming unstable and diverges from 2nd epoch.

a514514772 commented 6 years ago

Hi,

It's normal when you're dealing with GAN unless the images are not improved for several epochs. Check images by eyes and the rec loss is just for reference.

jessejchuang commented 6 years ago

Hi TA,

I'll give it a try more epochs for reconstruction images.

BTW, the generated image is not good and seems to have mode collapse (64 different noises but the generated 64 images are almost same). I guess the current gamma parameter prefers reconstruction to generation from the noise. There is a same discussion below. Is it correct?

https://github.com/JeremyCCHsu/tf-vaegan#discussion The gamma parameter in Eq. (9) is a trade-off between style and content as mentioned in the paper. In my experiment, if gamma is set too small (such as 1e-5), the content could be lost, thus unable to reconstruct the input. However, random sampls directly generated from the latent space were realistic in this case. Setting gamma to a larger value (say, 0.1), we ended up with a good reconstruction, but the random samples could be less reasonable to the eyes.

a514514772 commented 6 years ago

Hi @jessejchuang ,

Yes, exactly.

You can try gamma * rec_loss + KLD to see the difference.

jessejchuang commented 6 years ago

Hi TA,

Many thanks for the feedback. I estimate I've spent several thousands of NTD in M$ Azure in the past week. I'm going to stop Lab3 here and go to do the other works from my boss.

BTW, your sample code is good but a little different from ours. Everyone has his/her own coding style. I think it's fine but I would like to consult you about whether the following coding concept is OK.

[1] Save unnecessary backward pass in training discriminator.

Do you think adding detach() is a better way?

[2] In training decoder, errD_real can be removed?

[3] Is it better for discriminator to return both sigmoid and l2 norm together instead of forward passing the same network twice?

Return sigmoid and l2 norm together. Remove similarity function.

a514514772 commented 6 years ago

Hi @jessejchuang ,

In terms of coding and experiences, you must deserve to be learned by me.

For [1] and [3], yes.

For [2], you can do it, too. The reason I keep errD_real is to see the min-max loss of GAN. The summation of the loss of discriminator and generator should be zero, so I want to check what happened to the loss if I updated the discriminator.

Thanks

jessejchuang commented 6 years ago

Hi TA,

Thanks for prompt reply. Got your original design idea now.

2017-fall-DL-training-program / VAE-GAN-and-VAE-GAN

Basic information of Lab3-3 #23