Thunderbeee / ZSCL

Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models
83 stars 6 forks source link

Text embeddings in distillation loss #1

Open gonzachiar opened 1 year ago

gonzachiar commented 1 year ago

In the distillation loss of continual-CLIP:

https://github.com/Thunderbeee/ZSCL/blob/main/cil/continual_clip/models.py#LL260C4-L260C4

Shouldn't you also do the opposite comparison too? Compare the current model embeddings of the ref_text with the original model embeddings of the ref_images.

Also, is the method is "LwF", shouldn't the logits_current be between the current model embeddings of the ref_images and the ref_texts, instead of being between the current model embeddings of the ref_images and the ref model embeddings of the ref_texts?

Screenshot from 2023-06-16 11-31-08

Screenshot from 2023-06-16 11-30-53

If I that isn't the case, there is no possibility of fine tuning the text encoder only. Why is this discarded for continuous CLIP?

Sorry if this questions are pretty basic.

Thunderbeee commented 1 year ago

Thanks so much for your comments! Toward your questions:

“Shouldn't you also do the opposite comparison too?” --- Because LwF has not been applied on contrastive learning approached before (it is the first time to adopt LwF to handle forgetting issue on CLIP), what we do in our experiments is our design choice. Because we are comparing between continual-learning (CL) methods, experiments are controlled as long as all CL methods using the exact same assignment from (ref_model, ref_image, ref_text, zero shot, target_image, target_text). From our final experiment tables on arxiv, our experiments are sufficient to demonstrate our method (ZSCL) outperform those SOTA method. Of course, your suggestion is helpful which could be done in future ablation experiments on CL of VL!

“Also, is the method is "LwF", shouldn't the logits_current be between the current model embeddings of the ref_images and the ref_texts, instead of being between the current model embeddings of the ref_images and the ref model embeddings of the ref_texts?” --- I think this question is similar to the last question. This is our design choice. For double encoders, we let ref-texts-embedding come from the reference-model (original CLIP) to better ensure that the current training image-encoder would be aligned with the text-encoder of original CLIP (reference model); for double encoders, we want to control variables when distilling to ensure that an encoder comes from original CLIP.

“If I that isn't the case, there is no possibility of fine tuning the text encoder only. Why is this discarded for continuous CLIP?” --- It is feasible to only train text-encoder, but we do this uniformly here to control variables, because our goal is to compare the differences between methods, it is our design choice on experiments.

Again, thanks so much for your constructive comments! Many ablation studies could be done in future to explore the continual learning of vision-language model!