drboog / Shifted_Diffusion

Code for Shifted Diffusion for Text-to-image Generation (CVPR 2023)
Creative Commons Zero v1.0 Universal
159 stars 11 forks source link

The cosine similarity of generated and gt image embeddings is not increasing during training #8

Closed ilovecv closed 1 year ago

ilovecv commented 1 year ago

Hi,

I computed the cosine similarity of the gt image embedding and the generated image embedding from the prior model during training (https://github.com/drboog/Shifted_Diffusion/blob/main/sft_test.py#L155), and found the similarity score was not increasing.

Also, I observed there were few spikes of the training loss:

Screenshot 2023-05-30 at 9 20 13 PM

Did you observe similar phenomenon during training? Thank you very much!

drboog commented 1 year ago

What is your batch size, what dataset are your using, what's the value range of your obtained similarity. Please provide more detailed information.

ilovecv commented 1 year ago

Thanks for your quick reply.

I am using a batch size of 4096, following the setting from the paper. The learning rate, learning schedule and other hyper parameters follow the parameters here: https://github.com/drboog/Shifted_Diffusion/blob/main/train.py#L206

I am using a filtered LAION dataset, containing around 220M images. I tried to train the unclip prior model on this dataset. The loss seemed normal and the similarity score was increasing during training.

I used 100 images from a validation dataset to compute the similarity score. After 5000 steps, the score is 0.81. It is much higher compared with the unclip prior model, which is only 0.5 after 5000 steps.

Thank you very much for your help.

drboog commented 1 year ago

Sorry I'm little confused. By unclip prior do you mean diffusion based on standard Gaussian or shifted diffusion? 0.81 is obtained from which model? When you feed the generated image embeddings to our fine-tuned SD 2, will it generate target text-aligned images? Do you mean the similarity stop increasing after first 5k ( or maybe tens of thousands of steps) in your training?

ilovecv commented 1 year ago

Sorry for the confusion.

For unclip prior model, it is based on standard Gaussian.

0.81 is obtained from the shifted diffusion model. The shifted diffusion model achieves high similarity score after 5000 steps (the checkpoint is saved every 5000 steps, so this is the earliest checkpoint I can test), but the score was not increasing after longer training. For example, after 300K steps, the similarity score was still 0.815. It can generate text-aligned images, but overall I feel the quality is not very good.

One thing puzzles me is the loss is decreasing, although with some spikes, why the score is not increasing? BTW, in the paper, it said the model was trained for 500K steps. How do you decide the number of steps for training?

drboog commented 1 year ago

I see. CLIP similarity can be useful, but it may not be indicative enough. The reason is, only 100 samples are used in evaluation, they could be too "easy" or too "representative", so that the model can already perform accurate prediction for them at early stage of training. This doesn't mean the training after 10k is not useful. We should not evaluate the model only by similarity. In the end, the shifted diffusion model will be applied on downstream tasks, CLIP similarity can help us check whether the code is working, but it can not tell us how the performance will be in downstream tasks. I think similarity~0.8 is acceptable, what really matters is the actual performance on downstream tasks. There is no specific reason why 500k iterations are used, we just use a number similar to DALL-E 2, which uses 600k. Since your dataset size is 220M, I think you can use less iterations.

ilovecv commented 1 year ago

Thanks.

Last question, did you observe the small spikes of the loss curve during training? Do you think this will be a problem?

drboog commented 1 year ago

I didn't monitor the training very often, so I didn't observe those. If you are worried about them, I think you can try slightly smaller learning rate.

ilovecv commented 1 year ago

Thanks!