Closed zhihuacc closed 2 years ago
Thanks for pointing this out! You're right- we need to insert EOT when truncating like the CLIP authors do here: https://github.com/openai/CLIP/blob/main/clip/clip.py#L232. I'll make a note to fix this soon. Fortunately, the large majority of captions in YFCC15M fall short of 77 tokens, and the text transformer design used here is far from optimal anyway so I wouldn't expect this to make a huge difference in overall performance.
I found here that the token sequence is truncated to context_length = 77. The issue is that the truncation is done after wrapping original tokens with [SOT] and [EOT], and I think it's possible that [EOT] token is cut off if the original token sequence is too long. Meanwhile the text transformer uses the embedding of the [EOT] as feature representation, I guess something would be wrong for long text input.
Am I understanding right here ?