Notebook that compares keyword extraction approaches on longevity papers

PrimozGodec commented 3 years ago

This comparison was made for an article.

PrimozGodec commented 3 years ago

I am additionally appending a word cloud for those articles

Screenshot 2021-03-16 at 12 12 05

djukicn commented 3 years ago

There might be a reason for precision and recall being so low. If I'm not mistaken, this notebook originated from SemEval evaluation where the ground truth keywords were already stemmed and the code kept them just as they are. In PubMed data, the keywords are not preprocessed but the notebook does stemming on text. Also, I'd recommend lemmatization instead of stemming because stemming might take away too much information and reduce the score achieved by embedding based keyword extraction. I also notice that embedding_document_keywords is used with Slovenian instead of English model.

ajdapretnar commented 3 years ago

That's a good find. One must indeed be careful with preprocessing!

PrimozGodec commented 3 years ago

@djukicn thank you for your observation. I totally overlooked that true keywords are not preprocessed same as text. It is fixed now.

Bellow, I am appending results produced with stemming and lemmatization. As you correctly assumed embedding method gives better results with lemmatization. I do not understand exactly what is the reason for that.

Stemming: Screenshot 2021-03-28 at 15 40 32

Lemmatization: Screenshot 2021-03-28 at 15 40 40

biolab / text-semantics

Notebook that compares keyword extraction approaches on longevity papers #64