Closed PrimozGodec closed 3 years ago
I am additionally appending a word cloud for those articles
There might be a reason for precision and recall being so low. If I'm not mistaken, this notebook originated from SemEval evaluation where the ground truth keywords were already stemmed and the code kept them just as they are. In PubMed data, the keywords are not preprocessed but the notebook does stemming on text. Also, I'd recommend lemmatization instead of stemming because stemming might take away too much information and reduce the score achieved by embedding based keyword extraction. I also notice that embedding_document_keywords
is used with Slovenian instead of English model.
That's a good find. One must indeed be careful with preprocessing!
@djukicn thank you for your observation. I totally overlooked that true keywords are not preprocessed same as text. It is fixed now.
Bellow, I am appending results produced with stemming and lemmatization. As you correctly assumed embedding method gives better results with lemmatization. I do not understand exactly what is the reason for that.
Stemming:
Lemmatization:
This comparison was made for an article.