facebookresearch / CiT

Code for the paper titled "CiT Curation in Training for Effective Vision-Language Data".
Other
78 stars 1 forks source link

a consideration about unfreezing the image tower #4

Open fabiozappo opened 1 year ago

fabiozappo commented 1 year ago

Hi,

I really liked the idea of selecting training data online, thank you for publishing the code! I would like to apply this idea to my training code using a non frozen image tower and I am here to ask for a hint.

I saw all your experiments are done with a similar approach to LiT paper by google. My intuition why you're doing this is that it helps to keep the text model more stable during time, and as a consequence the amount of new training samples slightly decrease over time. Do you think unfreezing the image tower could bring to a collapse resulting in an inclusion of the whole image-text pool pairs? Have you tried to run some side experiments with all the parameters as retrainable? What behavior do you expect by that?

howardhsu commented 1 year ago

Good question.

One reason of using frozen image encoder for CiT is to have high quality soft target for text encoder. Check SLIP / LiT paper you may get the idea how noisy (e.g., sth. irrelevant to image) text supervision can corrupt image self-supervision (fine-grained image details). This high quality target yields better text representation on semantic level (See Table 8) for better curation to improve efficiency for this chicken-egg problem.

Hope this answered your question.