Chinese documents and candidates

MaartenGr / KeyBERT

Minimal keyword extraction with BERT

https://MaartenGr.github.io/KeyBERT/

MIT License

3.59k stars 358 forks source link

Chinese documents and candidates #247

Open bsariturk opened 3 months ago

bsariturk commented 3 months ago

I'm using jieba for tokenization for my Chinese documents, as suggested here in the issues and in the documentation. It also says in the documentation that if I use a vectorizer, I cannot use a candidates lists. In that case, is there a way to use a candidates lists with Chinese documents?

MaartenGr commented 3 months ago

When you pass candidates to KeyBERT, the only thing that you are doing is adding them as part of the CountVectorizer vocabulary. So if you have a custom CountVectorizer, simply add the list of candidate words to the vocabulary parameter.

bsariturk commented 3 months ago

Thank you so much Maarten. I managed to use my candidates list by providing it as vocabulary to a custom vectorizer.