Precision on wikipedia index preprocessing

dice-group / Palmetto

Palmetto is a quality measuring tool for topics

GNU Affero General Public License v3.0

209 stars 36 forks source link

Hi Evan Dufraisse,

Thank you for your interest in our project. :smiley:

We used the Stanford Core NLP library for preprocessing (including lemmatization). I am 99% sure, that the words are lower-cased (the effect can be seen by the issue #19 :wink: ). I also think that we applied the lemmatizer first, before tranforming the words into their lower-cased form (it simply makes more sense in this order :wink: ).

You may also want to take into consideration, that we created the index in 2014. Depending on the documents that you use to generate your topics, this might influence your results as well. In 2014, Donald Trump hadn't even started his presidential campaign and (obviously) COVID 19 did not exist. So if you process news articles, a newer reference corpus might be interesting :thinking:

Cheers, Michael Röder

dice-group / Palmetto

Precision on wikipedia index preprocessing #60