Long caption-image retrieval with CLIP tokenizers?

beichenzbc / Long-CLIP

[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"

Apache License 2.0

589 stars 27 forks source link

Long caption-image retrieval with CLIP tokenizers? #69

Closed ivonajdenkoska closed 2 days ago

ivonajdenkoska commented 2 weeks ago

Hi, thanks again for your cool work!

I was looking into the long caption-image retrieval with Urban1k dataset. The tokenizers used by CLIP models usually tokenize the sentence into 77 tokens. I'm wondering if you modified this behavior to tokenize the full sentence into more than 77 tokens, basically without truncation? Thanks in advance!

beichenzbc commented 2 weeks ago

longclip.tokenize can tokenize up to 248 tokens.

ivonajdenkoska commented 2 weeks ago

Thanks for your answer. Could you briefly explain how you changed the original CLIP tokenizer which tokenizes up to 77 tokens?

beichenzbc commented 2 weeks ago

We change the default context_length in clip.py or longclip.py. You may refer to https://github.com/beichenzbc/Long-CLIP/blob/main/model/longclip.py for further detailes.