Some problem about tokenizer

I have trid your model, and its suitable to extract keywords obtain semantic info.

What i want to ask is you tokenize the doc by countvector firstly . and when it comes to keyword with blank inside such as "learning progress" , it seems like you tokenize it at encode method in sentence-transformers model as pre_tokenized param set to False. So the tokenizer used this two times seems different, one for default, another should be the tokenizer from transformers model such as some pretrained tokenizer's tokebize attribute. dose this conflict may yield some problem ? For me i use chinese document.So i pretokenlize doc to phares and join them with blank to simulated english suitable input, so this two tokenizer can process the doc to list of phares rather than list of chars . with not change the tokenizer inside model, this produce reasonable conclusion but to other task or domain, did this conflict have some problem ?

MaartenGr / KeyBERT

Some problem about tokenizer #18