drboog / Lafite

Code for paper LAFITE: Towards Language-Free Training for Text-to-Image Generation (CVPR 2022)
MIT License
180 stars 25 forks source link

Reproduce the experimental results of the paper #5

Closed Cwj1212 closed 2 years ago

Cwj1212 commented 2 years ago

First of all thank you very much for your paper, it has been a huge help to me.The project you uploaded has also greatly helped my research. I want to ask you a few questions.

1.Are the results shown in the paper based on "sim = torch.exp(sim/temp), itd=10, itc=20" ? However, what is the result of "sim = sim/temp, itd=5, itc=10"? Under the setting of "sim = sim/temp", is this "itd=5, itc=10" optimal?

2.I am using 4 Nvidia 1080 for training and it takes 15 days for me to run a 25000kimg experiment, I would like to know your equipment and how long it will take to run one training session.

drboog commented 2 years ago
  1. The results are mostly based on "sim = torch.exp(sim/temp), itd=5, itc=10" (But I remember I used much smaller value for Multi-modal CelebA-HQ), I don't think "sim = sim/temp, itd=5, itc=10" or "sim = sim/temp, itd=10, itc=20" are optimal. Because after removing the exponential operation, you need to tune the temperature of contrastive loss. Also, --gamma is very important and can be sensitive to different datasets, I suggest searching from range [0.1,10]. 10 is used for COCO, but CUB and Multi-modal CelebA-HQ used much smaller gamma.
  2. I used 4 Nvidia V100 to train. 25000kimg may need 3~4 days.
Cwj1212 commented 2 years ago

Thanks for your answer. I did a little experimentation, on the CUB dataset, "sim = sim/temp, itd=5, itc=10"without modifying any other hyperparameters, did get the results in your paper(Although the training is not completely over, it is close to the paper results. FID=11.16, 2000kimg).

  1. So, The result in the paper is that the hyperparameters have been searched, under setting of sim = torch.exp(sim/temp). But there is no search for hyperparameters under the setting of sim = sim/temp yet?
  2. Modifications of hyperparameter settings may still yield better results. Is that so?

I am a beginner in deep learning. My question may seem unprofessional, thank you very much for answering my question.

drboog commented 2 years ago
  1. Yes, with the implementation of sim = torch.exp(sim/temp).
  2. Yes, I believe better results can be obtained with more carefully hyper-parameter searching (for both implementations), because what I did is roughly searching from several fixed constant (such as 1, 5, 10, 20), you can see there are still lots of potential combinations.

By the way, please update your final results of CUB here after you finishing training, so other people can know what the results will be under that setting. Thanks a lot.

drboog commented 2 years ago

I think you don't need to run 25000 kimgs on CUB, because CUB has less than 9000 images.

Cwj1212 commented 2 years ago

I experimented with this setting of "CUB dataset, sim=sim/temp ,itd=5, itc=10, 2200 kimg" and got FID=10.89(Continued training does not lead to continued improvement in FID), which is similar to the result of 10.48 in the paper. If I find out later in my experiments that I can get better results with a certain hyperparameter setting, I'll report back in this issue.Thanks again for your work and answer.

drboog commented 2 years ago

I experimented with this setting of "CUB dataset, sim=sim/temp ,itd=5, itc=10, 2200 kimg" and got FID=10.89(Continued training does not lead to continued improvement in FID), which is similar to the result of 10.48 in the paper. If I find out later in my experiments that I can get better results with a certain hyperparameter setting, I'll report back in this issue.Thanks again for your work and answer.

Which inception model do you use, pre-trained on imagenet or fine-tuned on CUB?

https://github.com/drboog/Lafite/blob/a69b91622a12d5a0d226443d9cd84eb5e4f850d2/metrics/frechet_inception_distance.py#L15

drboog commented 2 years ago

I experimented with this setting of "CUB dataset, sim=sim/temp ,itd=5, itc=10, 2200 kimg" and got FID=10.89(Continued training does not lead to continued improvement in FID), which is similar to the result of 10.48 in the paper. If I find out later in my experiments that I can get better results with a certain hyperparameter setting, I'll report back in this issue.Thanks again for your work and answer.

Which inception model do you use, pre-trained on imagenet or fine-tuned on CUB?

https://github.com/drboog/Lafite/blob/a69b91622a12d5a0d226443d9cd84eb5e4f850d2/metrics/frechet_inception_distance.py#L15

inception model pre-trained on imagenet is not usually directly used on CUB ,such as attnGAN, stackGAN,DF-GAN。 Hence, I'm wondering if the CUB result in the paper is not appropriate? anyway, this work is great and impressive

Yes, I think StackGAN, AttnGAN, DM-GAN, DF-GAN used the fine-tuned inception model to calculate the IS.

To calculate FID, DF-GAN and DM-GAN used the pre-trained model directly from torchvision.models.inception_v3. No FID is reported in StackGAN and AttnGAN. So I think FID results are OK.

As for IS results, unfortunately, I don't have access to the original fine-tuned model used in StackGAN. And I think fine-tuning a model by myself may not lead to fair results. Considering others and future works may want to compare with our method, I choose to use the pre-trained inception model, so everyone can quickly get results under a fair comparison.

But it will be interesting to see the IS results with the original fine-tuned inception model from StackGAN. I hope someone who has that fine-tuned model on their machine can test it later.

drboog commented 2 years ago

kGAN, AttnGAN, DM-GAN, DF-GAN used the fine-tuned inception model to ca

The fine-tuned model is available in DF-GAN,DM-GAN. Note that they used different inception models on CUB and COCO. The fine-tuned model is first provided by stack GAN. According to my experience, the IS of this work will not be more than 4.5 (tested on fine-tuned inception model). I highly recommend the authors update this metric in the paper.

DM-GAN and DF-GAN did not upload the fine-tuned inception model... they provided links, which basically leads to https://github.com/hanzhanggit/StackGAN-inception-model. However, the model is on longer available there. If you have the fine-tuned model, can you send it to me, or upload it to google drive?

Can you explain and elaborate more on "According to my experience, the IS of this work will not be more than 4.5"? I see that DF-GAN and DM-GAN have already obtained IS of 4.75 and 5.10 with the fine-tuned model. According to your experience, what convinces you that that IS of this work will be worse than theirs and less than 4.5? Thanks.

senmaoy commented 2 years ago

kGAN, AttnGAN, DM-GAN, DF-GAN used the fine-tuned inception model to ca

The fine-tuned model is available in DF-GAN,DM-GAN. Note that they used different inception models on CUB and COCO. The fine-tuned model is first provided by stack GAN. According to my experience, the IS of this work will not be more than 4.5 (tested on fine-tuned inception model). I highly recommend the authors update this metric in the paper.

DM-GAN and DF-GAN did not upload the fine-tuned inception model... they provided links, which basically leads to https://github.com/hanzhanggit/StackGAN-inception-model. However, the model is on longer available there. If you have the fine-tuned model, can you send it to me, or upload it to google drive?

Can you explain and elaborate more on "According to my experience, the IS of this work will not be more than 4.5"? I see that DF-GAN and DM-GAN have already obtained IS of 4.75 and 5.10 with the fine-tuned model. According to your experience, what convinces you that that IS of this work will be worse than theirs and less than 4.5? Thanks.

Sorry, I'm not familiar with text-to-image, I just guess that based on subjective judgement. After my experiments, the results accord with the paper well. Thank you for your great work and careful reply. I apologize for my crude comments. Best wishes! sincerely.

drboog commented 2 years ago

kGAN, AttnGAN, DM-GAN, DF-GAN used the fine-tuned inception model to ca

The fine-tuned model is available in DF-GAN,DM-GAN. Note that they used different inception models on CUB and COCO. The fine-tuned model is first provided by stack GAN. According to my experience, the IS of this work will not be more than 4.5 (tested on fine-tuned inception model). I highly recommend the authors update this metric in the paper.

DM-GAN and DF-GAN did not upload the fine-tuned inception model... they provided links, which basically leads to https://github.com/hanzhanggit/StackGAN-inception-model. However, the model is on longer available there. If you have the fine-tuned model, can you send it to me, or upload it to google drive? Can you explain and elaborate more on "According to my experience, the IS of this work will not be more than 4.5"? I see that DF-GAN and DM-GAN have already obtained IS of 4.75 and 5.10 with the fine-tuned model. According to your experience, what convinces you that that IS of this work will be worse than theirs and less than 4.5? Thanks.

Sorry, I'm not familiar with text-to-image, I just guess that based on subjective judgement. After my experiments, the results accord with the paper well. Thank you for your great work and careful reply. I apologize for my crude comments. Best wishes! sincerely.

It's OK, all discussions are welcome here :)

Cwj1212 commented 2 years ago

I'm having some doubts while reading your code. I know this part of the code is from stylegan, but if you know about it , I hope it can answer my doubts. The code uses DDP for distributed training. It is divided into multiple rounds in a batch, and Gradient Accumulation is used to speed up.

  1. If batch size=64, gpu num=4, batch_gpu=8, then round=2. Therefore, when calculating the contrast loss, multiple GPUs in a round are gathered to obtain negative samples for contrast, then the number of negative samples is 32, not the 64(batch size).Is that so?
  2. For Gradient Accumulation, stylegan only puts forward process in model.no_sync, but not backward(). Is such Gradient Accumulation effective? https://github.com/drboog/Lafite/blob/a79c66a407dd7996052b6c7c9d77a338380506b4/training/loss.py#L81-L82 https://github.com/drboog/Lafite/blob/a79c66a407dd7996052b6c7c9d77a338380506b4/training/loss.py#L268 I thought that all reduce operations of gradient synchronization will be carried out in backward, but backward() is not in model.no_sunc, will it prevent unnecessary all reduce at this time?

These two doubts are not related to your thesis itself, but are caused by my shallow knowledge, thank you very much for answering it. As a beginner, I don't know if I express my doubts clearly

drboog commented 2 years ago

I'm having some doubts while reading your code. I know this part of the code is from stylegan, but if you know about it , I hope it can answer my doubts. The code uses DDP for distributed training. It is divided into multiple rounds in a batch, and Gradient Accumulation is used to speed up.

  1. If batch size=64, gpu num=4, batch_gpu=8, then round=2. Therefore, when calculating the contrast loss, multiple GPUs in a round are gathered to obtain negative samples for contrast, then the number of negative samples is 32, not the 64(batch size).Is that so?
  2. For Gradient Accumulation, stylegan only puts forward process in model.no_sync, but not backward(). Is such Gradient Accumulation effective? https://github.com/drboog/Lafite/blob/a79c66a407dd7996052b6c7c9d77a338380506b4/training/loss.py#L81-L82

    https://github.com/drboog/Lafite/blob/a79c66a407dd7996052b6c7c9d77a338380506b4/training/loss.py#L268

    I thought that all reduce operations of gradient synchronization will be carried out in backward, but backward() is not in model.no_sunc, will it prevent unnecessary all reduce at this time?

These two doubts are not related to your thesis itself, but are caused by my shallow knowledge, thank you very much for answering it. As a beginner, I don't know if I express my doubts clearly

In my implementation, I manually set "batch size = 16*gpus" (round will be 1), and the contrastive loss is computed per GPU, i.e. using 16 samples instead of 64. If you want to calculate the contrastive loss across GPUs using 64 samples, you can add "--gather=True", but then you have to tune the related hyper-parameters (itd, itc, temp), see https://github.com/drboog/Lafite/blob/a79c66a407dd7996052b6c7c9d77a338380506b4/training/loss.py#L215 In your "batch size=64, gpu num=4, batch_gpu=8, round=2" example, the contrastive loss will be calculated on 8 samples with "--gather=False", on 32 samples with "--gather=True".

I'm not sure about the second question.

Cwj1212 commented 2 years ago

Thank you very much for always answering my doubts promptly, even though some questions are not related to the paper. I really appreciate it!

drboog commented 2 years ago

You are welcome :)