dice-group / Palmetto

Palmetto is a quality measuring tool for topics
GNU Affero General Public License v3.0
209 stars 36 forks source link

c_v gave full score, 1.0, for clusters where none of the words ever occured in the same document #21

Closed torhaa closed 6 years ago

torhaa commented 6 years ago

I encountered this problem while evaluating fast-text clusters for a corpus of a million articles from nrk.no. I fixed it by redefining cosine similarity for the case when both term frequency vectors are length 0. Cosine similarity for two 0 length vectors is now set to 0. Clusters with no similarity now get score 0.

MichaelRoeder commented 6 years ago

You are right :smile:

I implemented it from a theoretical point of view where it is fine to state that two vectors with length = 0 are equal. However, you are right that in the current use case this definition does not make much sense. Thanks for the fix :+1: