Closed mikelee-dev closed 1 month ago
Specifically, I was curious of you trained this layer of the vision encoder for the continued pretraining: vision.vision_model.embeddings.class_embedding
As training this layer redefine's CLIP's classifications in embedding space
Thanks for your question. Yes, all layers of CLIP ViT L/14 336 were trained during our continued pretraining.
Hi! very nice repository and paper. I have a question about the continued pretraining. Were all 428M parameters of OpenAI's CLIP ViT L/14 336 trainable for your continued pretraining? Or were some of them frozen?
Based on the paper, it seems like a batch size of 32 image/text pairs fit into 80GB of GPU memory, so I wanted to double check how many parameters of the original backbone were frozen/trainable. Thanks!