Closed Hossein-1991 closed 1 year ago
It means that the model can handle at most 256 tokens, after that the text will be truncated. There is already a tokenizer integrated within all-MiniLM-L6-v2
that does the tokenization for you with respect to creating the embeddings. The tokenizer that is passed to KeyBERT is used for creating candidate keywords and keyphrases that will be compared to the input document.
Hi,
The Max Sequence Length for the
all-MiniLM-L6-v2
model is 256. What does that mean? Does it mean the total number of tokens must be 256? If it is the case, then the kind of tokenizer we use will be important, am I right?