Why is there such a big difference in cosine similarity between embeddings of the same pair when using padding=max_length versus padding=true?

qianyue76 commented 7 months ago

When I was embedding a relevant text pair using the m2-bert-80M-32k-retrieval model, the cosine similarity obtained with padding=max_length was 0.7, while with padding=true (to save memory) it was close to 0. This resulted in semantic retrieval being completely impossible with padding=true. The same situation occurred with the 2k and 8k models as well.Why is this the case? And is padding=true completely unusable?

DanFu09 commented 7 months ago

The bidirectional convolutions in these models use the padding tokens to pass information from layer to layer (like scratch tokens). Padding = true sets the padding to be the length of the longest element in the batch, max length sets it to the max length of the tokenizer.

We’re working on a version that gracefully interpolates between the 32k/8k/2k versions to save compute but it’s still active research so may not be live for a while.

On Sun, Jan 21, 2024 at 9:01 AM qianyue76 @.***> wrote:

When I was embedding a relevant text pair using the m2-bert-80M-32k-retrieval model, the cosine similarity obtained with padding=max_length was 0.7, while with padding=true (to save memory) it was close to 0. This resulted in semantic retrieval being completely impossible with padding=true. The same situation occurred with the 2k and 8k models as well.Why is this the case? And is padding=true completely unusable?

— Reply to this email directly, view it on GitHub https://github.com/HazyResearch/m2/issues/19, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDDIIRDZ5WDR7Y3S45VTILYPUNSVAVCNFSM6AAAAABCEAFNIKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TENJZHA3TQMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

qianyue76 commented 7 months ago

I don't know why the padding to max_length with just adding (token_id) 0s make such a big difference to the embedding performance?

HazyResearch / m2

Why is there such a big difference in cosine similarity between embeddings of the same pair when using padding=max_length versus padding=true? #19