dkpro / dkpro-similarity

Word and text similarity measures
https://dkpro.github.io/dkpro-similarity
Other
53 stars 22 forks source link

CosineSimilarity with IDF returns NaN #30

Closed nicolaierbs closed 9 years ago

nicolaierbs commented 9 years ago

Original issue 30 created by dkpro on 2014-08-15T19:00:24.000Z:

CosineSimilarity returns NaN when in IDF or TFIDF weighting mode, instead of double. When the CosineSimilarity NaN bug was first discussed in 2012 (Issue 4), only TF mode was tested; TF mode now works fine.

The current NaN's tend to occur when one of the two texts is very short, less than 15 characters, such as "thanks" or "Done!", except for IDF weighting with LOG, which produces exclusively NaN's.

Some sample parameters that will reproduce this problem:

IDF weighting mode: weightingModeTf = null; weightingModeIdf = CosineSimilarity.WeightingModeIdf.BINARY; (or PASSTHROUGH, LOG, LOGPLUSONE) normalizationMode = CosineSimilarity.NormalizationMode.L2; (or L1)

TFIDF weighting mode: weightingModeTf = CosineSimilarity.WeightingModeTf.FREQUENCY_LOGPLUSONE; (or BINARY) weightingModeIdf = CosineSimilarity.WeightingModeIdf.BINARY; normalizationMode = CosineSimilarity.NormalizationMode.L2;

These findings used a corpus of 600 text pairs, avg length 3 sentences each. An IDF Map<String, Double> was provided to CosineSimilarity. I used a current snapshot of DKPro Similarity.

nicolaierbs commented 9 years ago

Comment #1 originally posted by dkpro on 2014-08-15T19:12:45.000Z:

Thanks for reporting. It would be even better though if you could provide a test case that uses only a minimal test case to show what is going on. It should be sufficient to have one test case with two texts from your collection that fail.

nicolaierbs commented 9 years ago

Comment #2 originally posted by dkpro on 2014-08-15T19:41:57.000Z:

I am not sure I can provide a minimal example, because the NaN values are calculated from the DF scores from the entire corpus. When the corpus changes, the DF values change too. I am currently using a DF Map<String, Double> with 4678 words (keys) and their DF frequency counts.
Also, choice of IDF formula changes NaN results.

For me, this text pair fails (NaN) with the following parameters: WeightingModeTf.FREQUENCY_LOGPLUSONE WeightingModeIdf.BINARY NormalizationMode.L2 View1text: OK, I thought that was the case - so shouldn't this have a mention in the article? View2text: Done. Similarity: NaN

With those parameters, this text pair does not fail: View1text: I really don't know how to cite this properly - help!!! View2text: You're best to find the EDM800 in the official record of Parliament, Hansard, than on a website. Similarity: 0.09995161063087137

Also for me, nearly all text pairs fail when IDF weight is LOG. Here's an example: WeightingModeTf.FREQUENCY_LOGPLUSONE WeightingModeIdf.LOG NormalizationMode.L2 View1text: Without telling them why, I have directed many of my French friends to this article. The first thing that hits them is the title with "noir" being lower case. I am not saying that the French should rule en:wiki, but that is their reaction. View2text: Thanks for the work - I'm satisfied that this is as well documented as can be, and apologies if this turned into a tempest in a teapot. I don't remember seeing it this way, but my days of perusing Le monde and noting spelling oddities (between naps) have been regrettably limited. If this is moved/fixed now, I don't see any reasonable basis for objection. Similarity: NaN

nicolaierbs commented 9 years ago

Comment #3 originally posted by dkpro on 2014-08-15T21:09:53.000Z:

If you cannot provide a test easily, could you maybe debug the process and report the line where the NaN occurs and the values for the most important variables (tf count, df count etc.) at that point.

nicolaierbs commented 9 years ago

Comment #4 originally posted by dkpro on 2014-08-16T10:13:41.000Z:

I just found that my previous posted results were actually using DKPro Similarity 2.1.0, not 2.2.0-SNAPSHOT. 2.2.0-SNAPSHOT also produces NaN, but under different conditions.

Following the example of DKPro TC's Greedy String Tiling, and without reading CosineSimilarity too closely, I passed in Strings to use the getSimilarity(String string1, String string2) method, which is the central method in Greedy String Tiling. However, it turns out that CosineSimilarity's getSimilarity(Collection aTerms1, Collection aTerms2) is its central method, and also, CosineSimilarity's getSimilarity(String string1, String string2) method was changed from a convenience method to instead do something completely different: character-based cosine similarity on the strings. I point this out to make sure DKPro Similarity developers agree that this is a good idea.

The following results use: WeightingModeTf.FREQUENCY_LOGPLUSONE WeightingModeIdf.BINARY NormalizationMode.L2

This results in the similarity scores for the text pair: Text1: OK, I thought that was the case - so shouldn't this have a mention in the article? Text2: Done. 2.1.0 String method: Similarity:NaN 2.1.0 Collection method: Similarity:0.0 2.2.0-SNAPSHOT String method: Similarity:0.4581651745778347 2.2.0-SNAPSHOT Collection method: Similarity:0.0

Both 2.2.0-SNAPSHOT's String and Collection methods seems to produce NaN's when one of the texts is only whitespace. Here is an example with accompanying vector values: Text1: NHC emailed. Text2:

String method: Text1: TF unweighted score vector: 1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0 DF unweighted score vector: 3.0,2.0,347.0,16.0,425.0,350.0 Text2: TF unweighted score vector: 0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0 DF unweighted score vector: Similarity:NaN vector1:0.6931471805599453,0.0,1.0986122886681098,0.6931471805599453,0.0,0.0,0.6931471805599453,0.0,0.0,0.6931471805599453,0.0,0.6931471805599453 vector2:0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

Collection method: Text1: TF unweighted score vector: 1.0,1.0,1.0 DF unweighted score vector: 4.0,425.0 Text2: TF unweighted score vector: 0.0,0.0,0.0 DF unweighted score vector: Similarity:NaN vector1:0.6931471805599453,0.6931471805599453,0.0 vector2:0.0,0.0,0.0

In getSimilarity(Collection, Collection), the vector of 0.0's seem to produce num and norm values of 0, which looks bad for the method's return: return num / norm;

Idf LOG: With 2.2.0-SNAPSHOT just like 2.1.0, the use of WeightingModeIdf.LOG produces lots more NaN's, with a different pattern. I provide a short-text example, but it also happens on longer texts.

The following results use: 2.2.0-SNAPSHOT WeightingModeTf.FREQUENCY_LOGPLUSONE WeightingModeIdf.LOG NormalizationMode.L2

Text1: The sentence covers all of 1861-65 as does the book. see the TOC on amazon.com Text2: Sounds good.

Collection method: Text1: TF unweighted score vector: 1.0,1.0,1.0,1.0,1.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0 IDF unweighted score vector: 128.0,5.0,230.0,22.0,411.0,425.0,4.0,345.0,99.0,5.0,45.0,183.0,105.0 Text2: TF unweighted score vector: 0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0 IDF unweighted score vector: 52.0,425.0 Similarity:NaN vector1:3.3631710974274096,1.1155773512899807,-Infinity,3.7693893406414887,2.142547361536413,NaN,6.612100465940063,4.194988543937342,0.9609060278364028,4.050436337142097,-Infinity,3.1850943684558293,1.1155773512899807,2.6385773721275987,NaN,3.6109406390081076,3.2258794951494627 vector2:NaN,NaN,NaN,NaN,NaN,2.7387934432399104,NaN,4.194988543937342,NaN,NaN,NaN,NaN,NaN,NaN,-Infinity,NaN,NaN Num: NaN Norm: NaN

String method: Text1: TF unweighted score vector: 1.0,15.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,0.0,1.0,0.0,1.0,10.0,1.0,3.0,4.0,4.0,8.0,2.0,2.0,1.0,3.0,1.0,0.0,3.0,5.0,1.0,1.0 IDF unweighted score vector: 425.0,95.0,17.0,4.0,15.0,9.0,3.0,3.0,2.0,4.0,21.0,347.0,6.0,1.0,16.0,1.0,5.0,14.0,2.0 Text2: TF unweighted score vector: 1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0 IDF unweighted score vector: 425.0,5.0,3.0,6.0,1.0,3.0,14.0 Similarity:NaN vector1:-Infinity,-Infinity,6.648899533095532,3.1565069280300024,3.1126029961986283,1.523000020837618,1.8770773617087897,1.523000020837618,-Infinity,-Infinity,-Infinity,NaN,0.761500010418809,NaN,0.761500010418809,1.6620943476182115,0.9609060278364028,4.220604287718965,9.414125062986725,2.8837256197384322,0.0,3.046000041675236,-Infinity,-Infinity,-Infinity,0.0,NaN,2.2311547025799614,4.728555960173844,-Infinity,0.4804530139182014 vector2:-Infinity,-Infinity,4.194988543937342,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,-Infinity,NaN,1.1155773512899807,1.206948960812582,NaN,NaN,NaN,NaN,1.2419530243370103,0.0,NaN,NaN,NaN,NaN,NaN,0.761500010418809,NaN,1.8292551473588745,NaN,NaN Num: NaN Norm: NaN

nicolaierbs commented 9 years ago

Comment #5 originally posted by dkpro on 2014-08-17T00:14:22.000Z:

Well, this is not exactly a minimal example, and I cannot quite follow your argument especially in the beginning, but it seems you have identified some cases, where the measure breaks.

Please have a look at CosineSimilarityTest, that already tries to test various conditions. It also shows how to manually provide idf values. Please turn your examples above into a runnable test and make the tests as minimal as possible.

nicolaierbs commented 9 years ago

Comment #6 originally posted by dkpro on 2014-08-17T11:16:10.000Z:

It turns out there already exists a test for the IDF LOG problem I described earlier: The test tfidfLogPlusOne() considers NaN to be correct output for the text pair: Text1: test String1 Text2: test String2 I don't understand why the desired output would be NaN: under what circumstances would a user ever want NaN?

Here are some tests for empty/whitespace strings. I presume that the desired similarity score for any non-empty and empty text pair is 0.0.

import static org.junit.Assert.assertFalse;

[...]

@Test
public void oneEmptyListTest() throws Exception {
    CosineSimilarity measure = new CosineSimilarity();

    List<String> tokens1 = Arrays.asList("This is a sentence with seven tokens".split(" "));     
    List<String> tokens2 = new ArrayList<String>();    

    assertFalse(Double.isNaN(measure.getSimilarity(tokens1, tokens2)));
    assertEquals(0.0, measure.getSimilarity(tokens1, tokens2), 0.000001);
}
@Test
public void oneWhitespaceListTest() throws Exception {
    CosineSimilarity measure = new CosineSimilarity();

    List<String> tokens1 = Arrays.asList("This is a sentence with seven tokens".split(" "));     
    List<String> tokens2 = Arrays.asList("      ".split(" "));   

    assertFalse(Double.isNaN(measure.getSimilarity(tokens1, tokens2)));
    assertEquals(0.0, measure.getSimilarity(tokens1, tokens2), 0.000001);
}
@Test
public void twoWhitespaceListTest() throws Exception {
    CosineSimilarity measure = new CosineSimilarity();

    List<String> tokens1 = Arrays.asList("          \n".split(" "));    
    List<String> tokens2 = Arrays.asList("     ".split(" "));   

    assertEquals(1.0, measure.getSimilarity(tokens1, tokens2), 0.000001);
}

Confusion in the constructor: CosineSimilarity(WeightingModeIdf modeIdf, NormalizationMode normMode, Map<String, Double> idfScores) There's no documentation, and I assumed that the Map was my corpus counts of DF scores, and that the algorithm would invert them, i.e., "1/x". But the example in the tests show that this asumption is not the case. I recommend at least adding documentation to clarify this, if not also switching the algorithm to accept DF counts instead of "1/DF" counts.

Here is a test that directly addresses the example from my previous post (and produces the same vector scores, etc). The DF counts I originally supplied cause a NaN similarity score; the IDF counts like those used in a similar test cause a non-0.0 similarity score. I think the correct return should be 0.0.

@Test
public void tfidfOneWhitespaceListTest()
    throws Exception
{
    Map<String,Double> idfScores = new HashMap<String,Double>();
    // Documentation unclear which scores should be provided

// idfScores.put("NHC", 1.0 / 4.0); // idfScores.put("emailed", 1.0 / 425.0); idfScores.put("NHC", 4.0); idfScores.put("emailed", 425.0);

    String[] tokens1 = "NHC emailed .".split(" ");
    String[] tokens2 = new String[0];

    TextSimilarityMeasure measure = new CosineSimilarity(
            CosineSimilarity.WeightingModeTf.FREQUENCY_LOGPLUSONE,
            CosineSimilarity.WeightingModeIdf.BINARY,
            CosineSimilarity.NormalizationMode.L2,
            idfScores
    );

    assertFalse(Double.isNaN(measure.getSimilarity(tokens1, tokens2)));
    assertEquals(0.0, measure.getSimilarity(tokens1, tokens2), 0.000001);
}
nicolaierbs commented 9 years ago

Comment #7 originally posted by dkpro on 2014-08-18T17:19:18.000Z:

ok, thanks for the tests.

It seems that the current implementation always return NaN if a word is not in the IDF list. This is definitely not the right thing to do and the test should not have simply swallowed that.

I have tried to fix that, but I think that the Cosine implementation is generally buggy and the tests cannot be trusted as they seem to test for the values that some first implementation generated instead of analytically derived ones.

nicolaierbs commented 9 years ago

Comment #8 originally posted by dkpro on 2014-08-19T14:03:40.000Z:

This should be fixed now. Given the number of issues with this measure, I cannot give any guaranties though.