facebookresearch / ImageBind

ImageBind One Embedding Space to Bind Them All
Other
8.25k stars 759 forks source link

No EOT when long sequence is truncated? #82

Open bakachan19 opened 1 year ago

bakachan19 commented 1 year ago

Hi. I noticed that when the input text sequence, truncation is performed to reduce the sequence to 77 tokens. However no EOT token is added at the end?

For example, in the case of a short text, I have the following tokenization with the EOT= 49407 as last token.

tensor([[49406,   518,  8809,   631,  5284,   620,   530,  7395, 12188,   267,
           593,   836,  6377,   531,   518,  2184,   537,  3326,   536,   518,
         10223,   539,   518,  1771,   269,   997,   631,   536,  3651,  2581,
          1047,  8626,   530,   518,  2867,   267,   836,  6765,   525,   911,
          8809,  1519,  3326,   631,  2862, 13314,   269,   518,  2117,  7290,
         32231,   530,   518,  5994,   267,  5524,   320, 24894, 10506,   556,
           911, 11251,   269, 49407,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], device='cuda:0')

But with longer sequences, I do not see any EOT=49407 added.

tensor([[49406,   518,  2867, 15305,   320, 19663,  6368,  1655,   593,  5560,
          1047, 14646,  1630,   320,  3638, 10297,   269,   997,   631,  6470,
          1047,   530,   518,  3562,   267,   836,  2862,  6377,   531,   518,
         35186,   267,  1519,  3326,  9308,   531, 24210,   320,   750, 18949,
          2445,   269,   320,  1876,  3309,   320, 11122, 12726,   525,   518,
         48812,   539,   320, 10297,   267,  2339,   518,  3562,   320,  1499,
           267,  2050,  1139, 10506,   269,   997,   631, 17082, 12033,  9729,
          6721,   267,  5256,   556,  6212,   541, 39306]], device='cuda:0')

Is this something intended? If so, what is the reasoning behind it?

I also noticed that I get the same embedding values for different text sequences that are bigger > 77 even though after tokenization I see different tokens being generated (but no EOT)....

Also, from my understanding (please correct me if I am wrong) ImageBind uses CLIP. However in the CLIP implementation the EOT is added when truncating a long sequence: https://github.com/openai/CLIP/blob/a1d071733d7111c9c014f024669f959182114e33/clip/clip.py#L239C1-L240C39

Any idea on what I am doing wrong?

Thanks.