Greate work!
I am also very interested in your work. Recently, I tried to reproduce the work on video modality alignment. I used the pre-trained ViT-b32 of OpenAI for initialization. The visual encoder part uses temporal attention to model the temporal relationship. During training, the text encoder is fixed, and only the weights of the embedding layer and the temporal attention part of the visual encoder will be updated. During training, the loss of the model dropped from 5.9 to 5.2. If both the visual encoder and the text encoder are all fine-tuned, the loss can be reduced to about 0.3. For this situation where only some parameters of the visual encoder are fine-tuned, the loss converges poorly. I wonder if you have encountered this during training? What should I pay attention to when using this fine-tuning method?
Greate work! I am also very interested in your work. Recently, I tried to reproduce the work on video modality alignment. I used the pre-trained ViT-b32 of OpenAI for initialization. The visual encoder part uses temporal attention to model the temporal relationship. During training, the text encoder is fixed, and only the weights of the embedding layer and the temporal attention part of the visual encoder will be updated. During training, the loss of the model dropped from 5.9 to 5.2. If both the visual encoder and the text encoder are all fine-tuned, the loss can be reduced to about 0.3. For this situation where only some parameters of the visual encoder are fine-tuned, the loss converges poorly. I wonder if you have encountered this during training? What should I pay attention to when using this fine-tuning method?