Closed veroveroxie closed 1 month ago
Hello, thanks for your recognition. 1.The two models are trained with same hyper-parameter and same data. 2.keeping feature space is indeed a key issue, so we make several attempts. First, we keep the first 20 positional embedding which is vital for final representation. Second, we add short text matching to align with original CLIP's training task. Finally, the lr is relatively small.
Thanks for your quick response. I tried to load the data and train. I am using 8 GPUs with 40G mem. I can only set batch_size=32 instead of 128. Otherwise, it reports OOM.
So, I am guessing that you are using 16 A100-80G ?
About the speed. I have to use gradient accumulation to maintain total_bs=2048. I still need around 5 hours to train the whole model with 6 epochs, CLIP-L. May I ask how long does it take on your machine? In your readme, you said that 0.5 hour with 8GPUs, may I ask what is the configuration of that model? Is it LONG-CLIP-B?
Thank you so much! I just want to make sure my setup is correct.
Thanks! For the Long-CLIP-L, you als only train 1 epoch? Could you please tell me more about the overfitting? By overfitting, you mean the performance drop in long-caption or short text-to-image retrieval or classification?
Yes, we only train 1 epoch. We find that the training acc on training dataset is about 100%, but the performance on Urban-1k and coco both decrease on the second epoch. However, different batch size or hardware may result in different results, you'd better record your own experiment.
Hi, Thanks for your great work! I have two questions regarding the method.
Your train.md includes a hyper-parameter while you do have two different models. Are the two models trained with same hyper-parameter and same data?
The impressive performance of plug-and-play. You have finetuned the CLIP text and image encoder on the different data. Do you have any thoughts why the improved CLIP can still be used in Stable Diffusion? I mean that the feature space of text embedding could be totally different after finetuning. Any thoughts would be appreciated.
Thanks!