Open achibb opened 6 months ago
Oh or do you sort by text length? It appears so. For me, sorting by token length gives clean and quick embeddings.
approach to first calculate token length instead of text length and sort, here:
https://github.com/UKPLab/sentence-transformers/pull/2571
feel free to modify / tell me your opinion
Hello!
Thanks for reporting and for the PR! Sorting by token size seems a lot more consistent, but I'm a bit wary of the double tokenization. I understand that sorting samples by length allows you to get less padding in a batch, which should process more quickly, but perhaps efficient string sorting + slightly inefficient batches is faster than inefficient tokenization + efficient batches? Perhaps some performance tests are needed to get a good understanding of what's best.
As for your original question, my understanding is that the padding tokens are all fully ignored. They should not influence the eventual embedding results. If they do, please share a script for reproducing that, I would be quite interested.
Hi Tom - sure thing I will do some tests and share them. (Only generally speaking I experienced a 2x increase but will do a proper test). Will folllow up
Hi! I was just wondering:
https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py#L350
You sort by token length which is really great for performance, however I was wondering if it still could be that sentences with different token length get parsed and pad? For some models it was observed that pad token changes the mean slightly. I was wondering if this is the case that sentence transformers batches also different token lengths?
Thank you for the great repo! Aaron