UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.83k stars 2.44k forks source link

padding error - enforce padding to a multiple of N tokens #2540

Open pszemraj opened 6 months ago

pszemraj commented 6 months ago

While integrating a mega-based encoder (BEE-spoke-data/mega-encoder-small-16k-v1) with the sentence-transformers library, I've encountered a RuntimeError due to padding issues when input lengths are not a multiple of 1024, falling within the range of 1024 to 16384 (this model's max length) tokens. The error message is:

RuntimeError: shape '[1, 3, 1024, 192]' is invalid for input of size 616128

This error seems to occur when the input does not meet the expected input dimensions of the model, specifically when trying to reshape the query.

Simple snippet:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BEE-spoke-data/mega-encoder-small-16k-v1')
text = "..." # Long text with token count not a multiple of 1024
emb = model.encode(text)

I honestly can't point to exactly why, but I've used the model in other tasks (text classification, etc.) where I didn't pad to max length and never saw this come up. Is there a way to force padding to a multiple of 1024 or otherwise handle this situation?

Any help is appreciated! Let me know if you need more details.

pszemraj commented 6 months ago

BTW I might have found a solution in this branch of my transformers fork but wanted to check if this is needed/the way to go, or if there is some easier/smaller thing that can be handled in SBERT directly

the linked implementation is my attempt at replicating what longformer/bigbird seem to do for padding to a multiple of N

tomaarsen commented 5 months ago

Hello!

I think for the most part this is fairly niche & specific to that architecture, so I'm not sure if it makes sense to support this in Sentence Transformers. That said, more control over the tokenizer would be nice, e.g. being able to specify the padding strategy.

pszemraj commented 5 months ago

Thanks for the feedback! So, I'm about 95% sure the reason that this issue is so niche in practice is that other models that would have the same problem have helper methods in the modeling class(es) to automatically pad to the relevant window, chunk, etc. See examples: LED, bigbird

Are we aligned that it makes more sense to implement this in the modeling code for MEGA in transformers itself? I can look into an issue/PR for this if so (based on what I already have)