MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.54k stars 349 forks source link

Some problem about tokenizer #18

Closed svjack closed 3 years ago

svjack commented 3 years ago

I have trid your model, and its suitable to extract keywords obtain semantic info.

What i want to ask is you tokenize the doc by countvector firstly . and when it comes to keyword with blank inside such as "learning progress" , it seems like you tokenize it at encode method in sentence-transformers model as pre_tokenized param set to False. So the tokenizer used this two times seems different, one for default, another should be the tokenizer from transformers model such as some pretrained tokenizer's tokebize attribute. dose this conflict may yield some problem ? For me i use chinese document.So i pretokenlize doc to phares and join them with blank to simulated english suitable input, so this two tokenizer can process the doc to list of phares rather than list of chars . with not change the tokenizer inside model, this produce reasonable conclusion but to other task or domain, did this conflict have some problem ?

MaartenGr commented 3 years ago

Sorry for the late response! This should not give any problems. The tokenizer is meant to create candidate words from a text and compare it then to a BERT embedding. How BERT creates the tokens under the hood does not create conflict with the countvectorizer as it is merely a way of creating the embeddings.