dhg-wei / DeCap

ICLR 2023 DeCap: Decoding CLIP Latents for Zero-shot Captioning
119 stars 6 forks source link

questions about the Paper: "A sentence with a large norm is usually not visual-related." #1

Closed byougert closed 1 year ago

byougert commented 1 year ago

Hi, thanks for your wonderful work and congratulations on your paper being accepted by ICLR 2023. But I have some questions about the expression (i.e., "A sentence with a large norm is usually not visual-related.") in the Experiments. Could you provide some explanation or citations? Thanks.

dhg-wei commented 1 year ago

Thanks for your careful reading. Using the CLIP feature norm to filter out visual-unrelated sentences is a simple and effective trick we discovered during our BookCorpus experiments. You can do some simple experiments on MSCOCO captions to verify this. You will find that some nonsense captions always have a large CLIP feature norm, e.g., 'There is no image here to provide a caption for.'. We conjecture that this phenomenon is because, during the CLIP contrastive pre-training process, hard examples (texts do not relate to images) tend to learn a longer text feature norm. In DeCap, we did not provide proof for this. As far as we know, there is no related research on this phenomenon.

byougert commented 1 year ago

Thanks for your reply.