feature23 / StringSimilarity.NET

A .NET port of java-string-similarity
Other
448 stars 70 forks source link

NGram simalarity are not correct #17

Closed piotr-nowicki closed 3 years ago

piotr-nowicki commented 5 years ago

I tried few of algorithms but at least at least one of them are not correct. Examples provided in description returns completly not expected results.

image

I am using .NET Core 2.2 console application.

paulirwin commented 5 years ago

Thanks @piotr-nowicki, looking into this now.

paulirwin commented 5 years ago

I can reproduce the results you're seeing, and I've added a unit test to match the code from the README that currently fails. Note that the README (and the code itself, including unit tests) came from the upstream Java project, just rewritten for use on .NET, so I will also have to validate if it's correct upstream as well. Thanks for bringing this to our attention, will continue to look into it.

paulirwin commented 5 years ago

Well that was easy to validate. Created a unit test in the upstream Java project to match their example, and it also gives the same result as .NET (so our code is in alignment) that doesn't match the comments. I'm guessing the comment in the example is incorrect.

Unit test (Java):

@Test
    public void exampleFromReadme() {
        // produces 0.416666
        NGram twogram = new NGram(2);
        assertEquals(0.416666, twogram.distance("ABCD", "ABTUIO"), 0.001);

        // produces 0.97222
        String s1 = "Adobe CreativeSuite 5 Master Collection from cheap 4zp";
        String s2 = "Adobe CreativeSuite 5 Master Collection from cheap d1x";
        NGram ngram = new NGram(4);
        assertEquals(0.97222, ngram.distance(s1, s2), 0.001);
    }

Unit test results (Java):

java.lang.AssertionError: 
Expected :0.416666
Actual   :0.5833333134651184

I'll file a bug with the upstream Java project and will update this issue once I hear back.

piotr-nowicki commented 5 years ago

Thanks mate.

Later on i checked java code and it seems that also they have a bug :).

Really weird that nobody tries to execute sample data :).

I started to think that .NET Core calculates it wrong ;)

paulirwin commented 5 years ago

I think the comment is incorrect and that the calculated results are right. It looks like this logic was changed as part of https://github.com/tdebatty/java-string-similarity/issues/22