beichenzbc / Long-CLIP

[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"
Apache License 2.0
694 stars 33 forks source link

The hyperparameter to train Long-CLIP-L and Long-CLIP-B #78

Closed veroveroxie closed 1 month ago

veroveroxie commented 1 month ago

Hi, Thanks for your great work! I have two questions regarding the method.

  1. Your train.md includes a hyper-parameter while you do have two different models. Are the two models trained with same hyper-parameter and same data?

  2. The impressive performance of plug-and-play. You have finetuned the CLIP text and image encoder on the different data. Do you have any thoughts why the improved CLIP can still be used in Stable Diffusion? I mean that the feature space of text embedding could be totally different after finetuning. Any thoughts would be appreciated.

Thanks!

beichenzbc commented 1 month ago

Hello, thanks for your recognition. 1.The two models are trained with same hyper-parameter and same data. 2.keeping feature space is indeed a key issue, so we make several attempts. First, we keep the first 20 positional embedding which is vital for final representation. Second, we add short text matching to align with original CLIP's training task. Finally, the lr is relatively small.

veroveroxie commented 1 month ago

Thanks for your quick response. I tried to load the data and train. I am using 8 GPUs with 40G mem. I can only set batch_size=32 instead of 128. Otherwise, it reports OOM.

  1. So, I am guessing that you are using 16 A100-80G ?

  2. About the speed. I have to use gradient accumulation to maintain total_bs=2048. I still need around 5 hours to train the whole model with 6 epochs, CLIP-L. May I ask how long does it take on your machine? In your readme, you said that 0.5 hour with 8GPUs, may I ask what is the configuration of that model? Is it LONG-CLIP-B?

Thank you so much! I just want to make sure my setup is correct.

beichenzbc commented 1 month ago
  1. yeah,we use 16 A100-80G
  2. it takes 0.5 hours with 8 GPUs for Long-CLIP-B, and it takes about 1 hours for Long-CLIP-L. The total training only takes 1 epochs and we observe an overfitting even at the second epoch.
veroveroxie commented 1 month ago

Thanks! For the Long-CLIP-L, you als only train 1 epoch? Could you please tell me more about the overfitting? By overfitting, you mean the performance drop in long-caption or short text-to-image retrieval or classification?

beichenzbc commented 1 month ago

Yes, we only train 1 epoch. We find that the training acc on training dataset is about 100%, but the performance on Urban-1k and coco both decrease on the second epoch. However, different batch size or hardware may result in different results, you'd better record your own experiment.