MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.47k stars 344 forks source link

Max Sequence Length #161

Closed Hossein-1991 closed 1 year ago

Hossein-1991 commented 1 year ago

Hi,

The Max Sequence Length for the all-MiniLM-L6-v2 model is 256. What does that mean? Does it mean the total number of tokens must be 256? If it is the case, then the kind of tokenizer we use will be important, am I right?

MaartenGr commented 1 year ago

It means that the model can handle at most 256 tokens, after that the text will be truncated. There is already a tokenizer integrated within all-MiniLM-L6-v2 that does the tokenization for you with respect to creating the embeddings. The tokenizer that is passed to KeyBERT is used for creating candidate keywords and keyphrases that will be compared to the input document.