LIAAD / yake

Single-document unsupervised keyword extraction
https://liaad.github.io/yake
Other
1.66k stars 230 forks source link

Why YAKE misses COVID-19 keyword in output? #57

Closed gvalchca closed 2 years ago

gvalchca commented 2 years ago

Hi, Why would YAKE not return the COVID-19 in any of the keywords in the following example:

occupational stress and mental health among anesthetists during the COVID-19 pandemic.

with default parameters, the output looks like this:

pandemic 0.04491197687864554
occupational stress 0.04940384002065631
stress and mental 0.09700399286574239
mental health 0.09700399286574239
health among anesthetists 0.09700399286574239
occupational 0.15831692877998726
stress 0.29736558256021506
mental 0.29736558256021506
health 0.29736558256021506
anesthetists 0.29736558256021506
arianpasquali commented 2 years ago

Hi @gvalchca This is current a limitation. It does not handle well enough tokens with special characters like - nor digits.

In this case I would recommend normalising all COVID-19 mentions to simply COVID and it will work just fine.

Ideally we should improve the algorithm to manage this better. If you have ideas, please send us a PR :)

arianpasquali commented 2 years ago

Related issue and explanation by @rncampos here

arianpasquali commented 2 years ago

Hi @gvalchca This is not exposed by the API but you could play with DataCore's tagsToDiscard parameter. By default it ignores digits.

Further explanation can be found here

gvalchca commented 2 years ago

Hey, thanks for your answer and sorry to have duplicated the thread. However, the solution would not work for me cause in biology/medicine there are plenty of those abbreviations with meaningful numbers (e.g. IL2, IL6).