CosineSimilarity mishandles IDF scores

nicolaierbs commented 9 years ago

Original issue 32 created by dkpro on 2014-09-03T17:26:40.000Z:

I think that CosineSimilarity is mishandling passed-in IDF scores. There is no documentation on whether the scores should be the raw count of token t in the corpus, or 1/t. Tests in CosineSimilarityTest, such as idfL2(), indicate that IDF scores should be formatted 1/t, and in r.538, Cosine Similarity clear handles some IDF scores this way: [line 288] score = idfScores.get(term); [line 315] vector.set(i, vector.get(i) * score); which only makes sense if score=1/t (0<=score<=1).

However, in r. 513: [line 302] if (weightingModeIdf == WeightingModeIdf.BINARY) { [line 303] score = score >= 1 ? 1 : 0; [line 304] }

I can't quite read this but I think it could be rewritten as: if(score >= 1){ score = 1; } else{ score = 0; }

This code appears to be a copy-paste from TF scores: [line 270] if (weightingModeTf == WeightingModeTf.BINARY) { [line 271] score = score >= 1 ? 1 : 0; [line 272] } where it makes sense: for TF, it's the binary form of whether a token appeared in a document, instead of the count of how frequently it occurred in the document.

There is a similar concept in IDF, which is document frequency: the number of documents in a collection that a term occurs in. DF could also be replaced with collection frequency: the number of occurrences of a term in a collection. I think that the IDF Binary code is trying to generate DF counts from CF counts. If this were possible, it would allow the user to pass in "idfScores" that are actually CF scores, and convert using WeightingModeIdf.BINARY. Of course, the necessary information has already been lost by that point, and this is an impossible task.

There's a small possibility that the desired behavior is "if this term occurs in any document in the corpus, set score to 1." I can't imagine a research application for this, though, and suggest specific documentation is added if this is the desired behavior, since the name is so close to standard DF usage.

WeightingModeIdf.BINARY is rewritten in r.538 to do something completely different: [line 297] if (weightingModeIdf == WeightingModeIdf.BINARY) { [line 298] if (score < 1.0) { [line 299] score = 0.0; [line 300] } [line 301] }

This new code is harmless if the passed-in IDF scores are raw counts of token t, because the tokens either occur and counts are either >=1 and left unchanged, or they are 0 anyways (which defeats minScore for Log weighting but otherwise is fine).

However, if IDF scores are form 1/t, and WeightingModeIdf.BINARY (which a user would presumably chose as the normal configuration for passed-in DF scores, because IDF cannot be null when using TFIDF), any DF counts where a term occurs in multiple documents will result in term score=0, because the IDF score is <1. Oddly, the old different code caused the same problem.

Summary: CosineSimilarity produces incorrect similarity scores if idfScores are in the form DF, because the DF scores are not inverted to IDF. However, CosineSimilarity also produces incorrect similarity scores if idfScores are in the form IDF, because lines 297 etc reduce them to 0. So, it is currently impossible to use IDF scores with WeightingModeIdf.BINARY.

Changes: I suggest the developers make a choice on input format of idfScores, document this, and then we can all help rewrite the code to handle this format. I also suggest the developers consider combining WeightingModeIdf BINARY and PASSTHROUGH, because (unless BINARY means "any document in the corpus" as described above) the CosineSimilarity class can't differentiate between DF and CF scores anyways, with: Map<String, Double> idfScores; Otherwise, BINARY and PASSTHROUGH could be renamed to DOCUMENTFREQUENCY and COLLECTIONFREQUENCY, etc.

-Emily

nicolaierbs commented 9 years ago

Comment #1 originally posted by dkpro on 2014-09-03T19:54:35.000Z:

As your comment is quite long, lets take it step by step:

CosineSimilarity expects 1/df (aka idf) scores.

The implementation never really explains what binary idf weighting should actually be. You are definitely right that the current code makes little sense. I am not even sure why we even have this option (it might be in some original tf.idf paper).

I tend to just remove this option.

nicolaierbs commented 9 years ago

Comment #2 originally posted by dkpro on 2014-09-03T20:20:51.000Z:

Ok, the consensus is that CosineSimilarity should use 1/df scores. I agree that binary idf should be removed. Is passthrough idf now the prototypical tfidf case?

nicolaierbs commented 9 years ago

Comment #3 originally posted by dkpro on 2014-09-03T20:23:33.000Z:

I would say that "log" is standard as it is considered to be a bit more stable regarding the resulting features.

nicolaierbs commented 9 years ago

Comment #4 originally posted by dkpro on 2014-09-03T20:29:19.000Z:

<empty>

dkpro / dkpro-similarity

CosineSimilarity mishandles IDF scores #32