facebookresearch / SLIP

Code release for SLIP Self-supervision meets Language-Image Pre-training
MIT License
747 stars 69 forks source link

Possible bug in Tokenizer about max sequence length #18

Closed zhihuacc closed 2 years ago

zhihuacc commented 2 years ago

I found here that the token sequence is truncated to context_length = 77. The issue is that the truncation is done after wrapping original tokens with [SOT] and [EOT], and I think it's possible that [EOT] token is cut off if the original token sequence is too long. Meanwhile the text transformer uses the embedding of the [EOT] as feature representation, I guess something would be wrong for long text input.

Am I understanding right here ?

normster commented 2 years ago

Thanks for pointing this out! You're right- we need to insert EOT when truncating like the CLIP authors do here: https://github.com/openai/CLIP/blob/main/clip/clip.py#L232. I'll make a note to fix this soon. Fortunately, the large majority of captions in YFCC15M fall short of 77 tokens, and the text transformer design used here is far from optimal anyway so I wouldn't expect this to make a huge difference in overall performance.