Progress Attention Disentanglement

alfjesus3 commented 3 years ago

The following preliminary results show an faster decrease in the reconstruction loss when using the attention disentanglement loss.

Vanilla FactorVAE results on dsprites with 6000 iterations:

[500] vae_recon_loss:578.611 vae_kld:11.370 vae_tc_loss:0.002 D_tc_loss:0.693                                                                                                                                
[1000] vae_recon_loss:517.982 vae_kld:8.887 vae_tc_loss:0.011 D_tc_loss:0.694                                                                                                                                
[1500] vae_recon_loss:166.487 vae_kld:29.584 vae_tc_loss:1.687 D_tc_loss:0.376                                                                                                                               
[2000] vae_recon_loss:157.948 vae_kld:21.348 vae_tc_loss:1.101 D_tc_loss:0.311                                                                                                                               
[2500] vae_recon_loss:137.528 vae_kld:16.759 vae_tc_loss:0.876 D_tc_loss:0.459                                                                                                                               
[3000] vae_recon_loss:140.476 vae_kld:12.761 vae_tc_loss:1.345 D_tc_loss:0.389                                                                                                                                                                                                                                          
[3500] vae_recon_loss:130.959 vae_kld:15.437 vae_tc_loss:1.121 D_tc_loss:0.405                                                                                                                               
[4000] vae_recon_loss:129.715 vae_kld:21.662 vae_tc_loss:0.376 D_tc_loss:0.647                                                                                                                               
[4500] vae_recon_loss:125.371 vae_kld:17.322 vae_tc_loss:0.456 D_tc_loss:0.603                                                                                                                               
[5000] vae_recon_loss:109.433 vae_kld:14.848 vae_tc_loss:0.121 D_tc_loss:0.630                                                                                                                               
[5500] vae_recon_loss:118.234 vae_kld:14.107 vae_tc_loss:0.133 D_tc_loss:0.649                                                                                                                               
[6000] vae_recon_loss:113.613 vae_kld:13.469 vae_tc_loss:0.338 D_tc_loss:0.622

FactorVAE + Attention Disentanglement loss function:

[500] vae_recon_loss:594.577 vae_kld:0.001 vae_tc_loss:-0.006 D_tc_loss:0.694                                                                                                                                
[1000] vae_recon_loss:144.755 vae_kld:19.301 vae_tc_loss:1.309 D_tc_loss:0.355                                                                                                                               
[1500] vae_recon_loss:128.944 vae_kld:17.555 vae_tc_loss:0.785 D_tc_loss:0.506                                                                                                                               
[2000] vae_recon_loss:116.633 vae_kld:15.988 vae_tc_loss:0.611 D_tc_loss:0.525                                                                                                                               
[2500] vae_recon_loss:115.944 vae_kld:15.814 vae_tc_loss:0.985 D_tc_loss:0.530                                                                                                                               
[3000] vae_recon_loss:108.621 vae_kld:16.692 vae_tc_loss:1.376 D_tc_loss:0.413                                                                                                                               
[3500] vae_recon_loss:108.255 vae_kld:15.344 vae_tc_loss:0.784 D_tc_loss:0.507                                                                                                                               
[4000] vae_recon_loss:107.966 vae_kld:14.930 vae_tc_loss:0.618 D_tc_loss:0.527                                                                                                                               
[4500] vae_recon_loss:107.723 vae_kld:14.687 vae_tc_loss:0.392 D_tc_loss:0.568                                                                                                                               
[5000] vae_recon_loss:98.750 vae_kld:15.505 vae_tc_loss:0.553 D_tc_loss:0.536                                                                                                                                
[5500] vae_recon_loss:96.007 vae_kld:15.645 vae_tc_loss:0.074 D_tc_loss:0.735                                                                                                                                
[6000] vae_recon_loss:94.336 vae_kld:16.592 vae_tc_loss:0.433 D_tc_loss:0.533

alfjesus3 commented 3 years ago

Current preliminary results using the Disentanglement metric proposed by Kim et. al. 2018. The accuracy is around 0.61 with 66000 iterations.

=> loaded checkpoint 'checkpoints/tmp/33000 (iter 33000)'                                                                                                                                                    
=> loaded checkpoint 'checkpoints/tmp/33000 (iter 33000)'                                                                                                                                                    
66000it [00:20, 1647.08it/s]The factors are  <class 'torch.Tensor'> torch.Size([737280, 5]) with classes 5
66000it [00:40, 1647.08it/s]The empirical mean for kl dimensions-wise:
[[ 0.16251403]
 [ 0.04704665]
 [-0.13326351]
 [-0.49411842]
 [-0.49725255]
 [ 0.11181379]
 [ 0.09036198]
 [-0.4975004 ]
 [-0.49553087]
 [-0.4970772 ]]
Useful dimensions: [0 1 5 6]  - Total: 4
Empirical Scales: [[[1.1460223]]

 [[1.0301305]]

 [[1.1061238]]

 [[1.0858247]]]
Votes:
 [[ 20.  20.   0.   0. 160.]
 [  1.  40.  65.   0.   0.]
 [100.  40.  95.   0.   0.]
 [ 39.  60.   0. 160.   0.]]

The accuracy is 0.60625

alfjesus3 commented 3 years ago

[Update] The training metric of 'total correlation' and 'reconstruction loss' over the first 50000 iterations. It is better to average over 100 iterations chunks to get less spiky results (?) disent_res_abl

disent_train_metrics_abl

~~The current experimental setup is:~~

~~All the experiments used a single Nvidia GTX 1080 Ti with 11 GB VRAM;~~
The ablation study regarding the gamma hyperparameter uses the same values as in the original FactorVAE paper - 5,10,15,20,25,30,35,40,45,50. The results are obtained with a single run per hyperparameter combination because of the limited computational resources. Each model is trained for 300000 iterations like the original paper.
The ablation study for the lambda hyperparameter uses the values 0.33, 0.67,1.0. The results are obtained with a single run per hyperparameter combination because of the limited computational resources. Each model is trained for 300000 iterations like the original paper.
~~The analysis of the influence of the attention maps selection mechanism to use in the attention loss is yet to be defined;~~ See updated experimental setup below !

alfjesus3 commented 3 years ago

Update on the disentanglement metric spiking behaviour - The vanilla factor VAE is stabler when plotting the disentanglement metric ~~hence the issue is likely to be the way the attention loss (L_AD) is computed~~ . There was an error when computing the disentanglement metric which was fixed.

metric_diff

alfjesus3 commented 3 years ago

The current experimental setup is:

All the experiments used a single Nvidia GTX 1080 Ti with 11 GB VRAM. Moreover, 2 different random seeds were used in order to smooth the results for all the experiments described below;
In the AD-FactorVAE, the attention maps selection was done for all attention maps in pairs of two. This was assumed and applied in all experiments;
To compare the influence of the new attention disentanglement loss (L_AD) the ablation study compared the Vanilla FactorVAE and the AD-FactorVAE on the gamma values 10,20,30,40,50. Moreover, the lambda factor in the latter was set to 1.0. Regarding the implementation of the disentanglement metric, as in the paper, the latent dimensions that have collapsed to the prior were removed using a threshold of 0.01. Each model is trained for 300000 iterations like the original paper.
The ablation study regarding the gamma hyperparameter on the AD-FactorVAE uses the same values as in the original FactorVAE paper - 5,10,15,20,25,30,35,40,45,50. Each model is trained for 300000 iterations like the original paper.
TODO Think carefully on the lambda values ~~The ablation study for the lambda hyperparameter uses the values 0.33, 0.67, 1.0 with the gamma values of 10,20,30,40,50. Each model is trained for 300000 iterations like the original paper.~~

Carmoondedraak / FACT-2021

Progress Attention Disentanglement #6