Closed Gh0s7 closed 8 years ago
Sure. By the looks of it that website is using the 1.6.2 version of Simmetrics[1]. The 1.6.2 implementation had Smith-Waterman and Smith-Waterman-Gotoh reversed. This has been fixed since 3.0.0
The additional difference between the 1.6.2 implementation (mislabeled as Smith-Waterman-Gotoh) and 3.0.0 version is the way the substitution function works. The 1.6.2 implementation used the SubCost5_3_Minus3 which included several sets of characters that were considered an approximate match and thus given a good substitution score. In the 3.0.0 version approximate matches are no longer included. One of these sets includes aeiou
which accounts for several difference between your two strings. As a result the 1.6.2 matcher thinks they're more similar then the 3.0.0 version.
Because Smith-Waterman is a string alignment algorithm, intended to match DNA sequences I thought an approximate match function to be a bit silly. Nor could I find any academic reference to include an approximate match function, hence I left the substitution out.
Thanks for the reply.
Given that the results on the site are closer to what I need and that I can't find 1.6.2 on Maven, is there any way that I can use that cost function in v4?
Sure.
You can use SmithWaterman(Gap, Substitution, int)) or SmithWatermanGotoh(float, Substitution) to construct an instance of Smith-Waterman with your own implementation of the Substitution cost function.
E.g:
Substitution substitution = new Substitution() {
@Override
public float compare(String a, int aIndex, String b, int bIndex) {
// Implement
return 0;
}
@Override
public float max() {
// Implement
return 0;
}
@Override
public float min() {
// Implement
return 0;
}
};
StringMetric smithWaterman = new SmithWatermanGotoh(-0.5f, substitution);
System.out.println(smithWaterman.compare("coliseo", "colosseum"));
Now I don't know your domain but you may want to check the original papers for the reasoning behind the values and what properties they should have.
Thanks again :)
Printed result is
0.5
while it should be0.74
, as reported here.The result that I get is the same one returned by
SmithWaterman
's metric, so I'm starting to think that I'm missing something.Can someone help me?