dice-group / Palmetto

Palmetto is a quality measuring tool for topics
GNU Affero General Public License v3.0
209 stars 36 forks source link

Precision on wikipedia index preprocessing #60

Closed EvanDufraisse closed 2 years ago

EvanDufraisse commented 2 years ago

Dear Michael Roeder,

Thanks a lot for sharing this useful work of yours !

As you point it out in your instructions for Lucene index creation , the preprocessing steps of the indexed dataset must be the same ones as those of your modelisation dataset.

What preprocessor have you used to make the lemmatization ? Have you lower-cased all words ? I'd be glad if you still have that information so I know wether I need to re-compute another index.

Thanks again,

Evan Dufraisse

MichaelRoeder commented 2 years ago

Hi Evan Dufraisse,

Thank you for your interest in our project. :smiley:

We used the Stanford Core NLP library for preprocessing (including lemmatization). I am 99% sure, that the words are lower-cased (the effect can be seen by the issue #19 :wink: ). I also think that we applied the lemmatizer first, before tranforming the words into their lower-cased form (it simply makes more sense in this order :wink: ).

You may also want to take into consideration, that we created the index in 2014. Depending on the documents that you use to generate your topics, this might influence your results as well. In 2014, Donald Trump hadn't even started his presidential campaign and (obviously) COVID 19 did not exist. So if you process news articles, a newer reference corpus might be interesting :thinking:

Cheers, Michael Röder