Closed byougert closed 1 year ago
Thanks for your careful reading. Using the CLIP feature norm to filter out visual-unrelated sentences is a simple and effective trick we discovered during our BookCorpus experiments. You can do some simple experiments on MSCOCO captions to verify this. You will find that some nonsense captions always have a large CLIP feature norm, e.g., 'There is no image here to provide a caption for.'. We conjecture that this phenomenon is because, during the CLIP contrastive pre-training process, hard examples (texts do not relate to images) tend to learn a longer text feature norm. In DeCap, we did not provide proof for this. As far as we know, there is no related research on this phenomenon.
Thanks for your reply.
Hi, thanks for your wonderful work and congratulations on your paper being accepted by ICLR 2023. But I have some questions about the expression (i.e., "A sentence with a large norm is usually not visual-related.") in the Experiments. Could you provide some explanation or citations? Thanks.