Topic word not in reference corpus

dkltimon commented 4 years ago

Hi,

I have a question regarding the calculation of topic coherence (for example NPMI).

Let's say I have a topic of five very rare words. None of them occur in my reference corpus (Wikipedia). What result will I get? 0? If the result is 0, it doesn't reflect the true interpretability of this topic, isn't it? Because NPMI does have negative values, which indicate that a topic is not very interpretable.

MichaelRoeder commented 4 years ago

The behavior of NPMI is not exactly defined for this case. The classical computation would give you NaN. However, in Palmetto, we have an additional check for the probabilities before we calculate NPMI. If one of the two words has a 0.0 probability, we set the NPMI to 0.

This behavior can be adapted by giving -1 in the constructor of the NPMI calculation. However, it is arguable whether this is better than 0. At least from my point of view, the 0 reflects the actual situation (i.e., the system simply has no information about the terms of the topic) better than the -1.

If you adapt it (in your local Palmetto instance), I would suggest to document this in your later publication, report or whatever you may use the numbers for :wink:

dkltimon commented 4 years ago

Thank you very much for your help!

dice-group / Palmetto

Topic word not in reference corpus #33