encoding in batches with token length leads to different results?

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

15.17k stars 2.47k forks source link

encoding in batches with token length leads to different results? #2570

Open achibb opened 6 months ago

achibb commented 6 months ago

Hi! I was just wondering:

https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py#L350

You sort by token length which is really great for performance, however I was wondering if it still could be that sentences with different token length get parsed and pad? For some models it was observed that pad token changes the mean slightly. I was wondering if this is the case that sentence transformers batches also different token lengths?

Thank you for the great repo! Aaron

achibb commented 6 months ago

Oh or do you sort by text length? It appears so. For me, sorting by token length gives clean and quick embeddings.

achibb commented 6 months ago

approach to first calculate token length instead of text length and sort, here:

https://github.com/UKPLab/sentence-transformers/pull/2571

feel free to modify / tell me your opinion

tomaarsen commented 6 months ago

Hello!

Thanks for reporting and for the PR! Sorting by token size seems a lot more consistent, but I'm a bit wary of the double tokenization. I understand that sorting samples by length allows you to get less padding in a batch, which should process more quickly, but perhaps efficient string sorting + slightly inefficient batches is faster than inefficient tokenization + efficient batches? Perhaps some performance tests are needed to get a good understanding of what's best.

As for your original question, my understanding is that the padding tokens are all fully ignored. They should not influence the eventual embedding results. If they do, please share a script for reproducing that, I would be quite interested.

Tom Aarsen

achibb commented 6 months ago

Hi Tom - sure thing I will do some tests and share them. (Only generally speaking I experienced a 2x increase but will do a proper test). Will folllow up