Simmetrics / simmetrics

Similarity or Distance Metrics, e.g. Levenshtein, for Java
Apache License 2.0
338 stars 77 forks source link

Unexpected result using SmithWatermanGotoh #24

Closed Gh0s7 closed 8 years ago

Gh0s7 commented 8 years ago
StringMetric smithWatermanGotohMetric = StringMetricBuilder.with(new SmithWatermanGotoh()).build();
System.out.println(smithWatermanGotohMetric.compare("coliseo", "colosseum"));

Printed result is 0.5 while it should be 0.74, as reported here.

The result that I get is the same one returned by SmithWaterman's metric, so I'm starting to think that I'm missing something.

Can someone help me?

mpkorstanje commented 8 years ago

Sure. By the looks of it that website is using the 1.6.2 version of Simmetrics[1]. The 1.6.2 implementation had Smith-Waterman and Smith-Waterman-Gotoh reversed. This has been fixed since 3.0.0

The additional difference between the 1.6.2 implementation (mislabeled as Smith-Waterman-Gotoh) and 3.0.0 version is the way the substitution function works. The 1.6.2 implementation used the SubCost5_3_Minus3 which included several sets of characters that were considered an approximate match and thus given a good substitution score. In the 3.0.0 version approximate matches are no longer included. One of these sets includes aeiou which accounts for several difference between your two strings. As a result the 1.6.2 matcher thinks they're more similar then the 3.0.0 version.

Because Smith-Waterman is a string alignment algorithm, intended to match DNA sequences I thought an approximate match function to be a bit silly. Nor could I find any academic reference to include an approximate match function, hence I left the substitution out.

  1. The original author Sam Chapman included several metrics that are no longer included like Chapman Length Deviation. These are listed on the asecuritysite.com hence I believe this to be the 1.6.2 version.
Gh0s7 commented 8 years ago

Thanks for the reply.

Given that the results on the site are closer to what I need and that I can't find 1.6.2 on Maven, is there any way that I can use that cost function in v4?

mpkorstanje commented 8 years ago

Sure.

You can use SmithWaterman(Gap, Substitution, int)) or SmithWatermanGotoh(float, Substitution) to construct an instance of Smith-Waterman with your own implementation of the Substitution cost function.

E.g:

        Substitution substitution = new Substitution() {
            @Override
            public float compare(String a, int aIndex, String b, int bIndex) {
                // Implement
                return 0;
            }

            @Override
            public float max() {
                // Implement
                return 0;
            }

            @Override
            public float min() {
                // Implement
                return 0;
            }
        };

        StringMetric smithWaterman = new SmithWatermanGotoh(-0.5f, substitution);
        System.out.println(smithWaterman.compare("coliseo", "colosseum"));

Now I don't know your domain but you may want to check the original papers for the reasoning behind the values and what properties they should have.

Gh0s7 commented 8 years ago

Thanks again :)