Closed Pratik--Patel closed 1 year ago
Could you try it without KeyphraseCountVectorizer
? Perhaps there is something happening with the tokenizer there.
Thanks for swift reply. Not using KeyphraseCountVectorizer
does seem to fix the issue but the quality of key phrases seems to be affected significantly.
Following is the code without KeyphraseCountVectorizer
and corresponding results.
result_1 = model.extract_keywords(docs_1, keyphrase_ngram_range=(1, 3), stop_words='english')
[
[
[
"apple day",
0.83
],
[
"apple day keeps",
0.8135
],
[
"apple",
0.8074
],
[
"day keeps doctor",
0.8011
],
[
"keeps doctor",
0.7826
]
],
[
[
"strawberry good fruit",
0.9904
],
[
"strawberry good",
0.9584
],
[
"strawberry",
0.9463
],
[
"fruit",
0.8886
],
[
"good fruit",
0.8823
]
],
[
[
"microsoft acquired openai",
1.0
],
[
"acquired openai",
0.9409
],
[
"openai",
0.9262
],
[
"microsoft",
0.8121
],
[
"microsoft acquired",
0.8052
]
],
[
[
"openai provides ai",
0.972
],
[
"openai provides",
0.9226
],
[
"openai",
0.9083
],
[
"ai powered tools",
0.855
],
[
"provides ai powered",
0.8526
]
]
]
Many of the key phrases are sub set of some big key phrase. And then we have keywords like keep doctor
which are not much meaningful.
Is there a way to include POS features or improve the quality? Ussing MMR
and Max Sum Distance
does not seem to help much. Thanks again for your help!
Currently, the only way to include POS features is by customizing the tokenizer in the CountVectorizer. That is where, in a way, the choice of candidate tokens is made. It is also the same process as the KeyphraseCountVectorizer
is currently doing. It might also just be a bug, so posting an issue there might be worthwhile to do.
Thanks, will follow it up there.
The result of KeyBERT doesn't seem to be deterministic when we change the order of documents that are passed to it. I have created a reproducible example as follows.
The output is as follows
result_1 for docs_1
result_2 for docs_2
As we can see, for the text
strawberry is a good fruit
,strawberry
is extracted inresult_1
whereasgood fruit
is extracted inresult_2
.Any idea why this might be happening?