MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.55k stars 349 forks source link

Different output from the Readme #129

Open kamadforge opened 2 years ago

kamadforge commented 2 years ago

When I type

kw_model = KeyBERT()
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)

I get

[('supervised', 0.6676), ('labeled', 0.4896), ('learning', 0.4813), ('training', 0.4134), ('labels', 0.3947)]

which is different than the output from Readme

[('learning', 0.4604), ('algorithm', 0.4556), ('training', 0.4487), ('class', 0.4086), ('mapping', 0.3700)]

I get

[('supervised learning', 0.6779), ('supervised', 0.6676), ('signal supervised', 0.6152), ('in supervised', 0.6124), ('labeled training', 0.6013)]

which is different and less diverse than the output from the Readme, which is [('learning algorithm', 0.6978), ('machine learning', 0.6305), ('supervised learning', 0.5985), ('algorithm analyzes', 0.5860), ('learning function', 0.5850)]

How to reproduce exactly same output as the one provided in Readme file?

MaartenGr commented 2 years ago

I am not entirely sure but the underlying embedding model has been changed a few times over the last year, so the output might be different by now. I believe you can get similar results by making use of MMR by setting use_mmr=True and setting diversity=0.5. The embedding model that works quite well but is a bit slower is all-mpnet-base-v2.