GrammarViz2 / grammarviz2_src

GrammarViz 2.0 public release:
http://grammarviz2.github.io/grammarviz2_site
GNU General Public License v2.0
120 stars 39 forks source link

Possible bug in SAXFactory.strDistance() #14

Closed mmehrle closed 10 years ago

mmehrle commented 10 years ago

I ran into this today when parsing through my test's log:

2014-09-24_1615

Here are the original values which produced the two strings (if you want to test it):

Vector 1: Element # 1: -0.5 Element # 2: -1.0 Element # 3: -0.583 Element # 4: 1.0 Element # 5: -0.083 Vector 2: Element # 1: -0.2 Element # 2: -0.6 Element # 3: -1.0 Element # 4: 1.0 Element # 5: 0.2

When calling SAXFactory.strDistance() with cabfd and cbafd the result is 0. That is clearly wrong as they are not identical. Can you please take a look at this example?

mmehrle commented 10 years ago

By the way - here's how I'm calling your method:

strDistance = SAXFactory.strDistance(saxRep1.toCharArray(), saxRep2.toCharArray()); System.out.println("Distance between " + saxRep1 + " + " + saxRep2 + ": " + strDistance);

So I'm pretty certain I'm didn't trip over my own code somewhere ;-)

mmehrle commented 10 years ago

Here's another example:

Vector 1: Element # 1: -0.75 Element # 2: -0.5 Element # 3: -1.0 Element # 4: -0.583 Element # 5: 1.0 Vector 2: Element # 1: -1.0 Element # 2: -0.636 Element # 3: -0.182 Element # 4: -0.091 Element # 5: 1.0 Distance between bcbcf + abcdf: 0

seninp commented 10 years ago

Thanks for that, I'll check right now!

seninp commented 10 years ago

Well, it should be zero because the distance between symbols that are next in the alphabet doesn't count. It is SAX mindist thingy. I have assumed the alphabet size 6.

seninp commented 10 years ago

Ha! But description says otherwise. I see from where the misunderstanding grows... So, what would you think, should I make two variants and name them differently like strDistance and strSAXMinDistance?

mmehrle commented 10 years ago

This may be a misunderstanding on my part Senin and a result of my ignorance of the the mindist algo . I thought this was a straight comparison with mindist off (as if in setting it to OFF in the grammarviz console). I read the paper and although the math is a bit beyond my pay grade I do grasp the overall idea. So to answer your question - yes, two methods would be preferable. OR one method that passes it in as a param as such:

enum NumerosityReduction { Mindist, Off, Exact };

And then a call like this:

SAXFactory.strDistance(str1, st2, NumerosityReduction.Off);

This way you are more flexible further down the line. Adding methods with different names may design you in a corner. What do you think?

mmehrle commented 10 years ago

Oh, and until you change it - how do I call it with numerosity reduction off? I guess I could dig through your grammarviz code but since I have your attention ;-)

mmehrle commented 10 years ago

I just looked at the source and don't see anything fancy in there regarding MINDIST being off and on:

public static int strDistance(char[] a, char[] b) throws TSException { if (a.length == b.length) { int distance = 0; for (int i = 0; i < a.length; i++) { int tDist = Math.abs(Character.getNumericValue(a[i]) - Character.getNumericValue(b[i])); if (tDist > 1) { distance += tDist; } } return distance; } else { throw new TSException("Unable to compute SAX distance, string lengths are not equal"); } }

And then I saw this one:

public static double saxMinDist(char[] a, char[] b, double[][] distanceMatrix) throws TSException { if (a.length == b.length) { double dist = 0.0D; for (int i = 0; i < a.length; i++) { if (Character.isLetter(a[i]) && Character.isLetter(b[i])) { int numA = Character.getNumericValue(a[i]) - 10; int numB = Character.getNumericValue(b[i]) - 10; if (numA > 19 || numA < 0 || numB > 19 || numB < 0) { throw new TSException("The character index greater than 19 or less than 0!"); } double localDist = distanceMatrix[numA][numB]; dist += localDist; } else { throw new TSException("Non-literal character found!"); } } return dist; } else { throw new TSException("Data arrays lengths are not equal!"); } }

So this is basically what I was referring to. Shouldn't the first method do what I want? I.e. a comparison with numerosity reduction off? Please clarify.

seninp commented 10 years ago

I do not remember exactly, but I would guess that I was using this method to speed-up the computation. Comparing integers is a way faster that summing up doubles and making lookups in the table, at least I thought so back then. I'll try to see how to make API cleaner at this exact spot. Will keep you updated.

On Wed, Sep 24, 2014 at 6:11 PM, mmehrle notifications@github.com wrote:

I just looked at the source and don't see anything fancy in there regarding MINDIST being off and on:

public static int strDistance(char[] a, char[] b) throws TSException { if (a.length == b.length) { int distance = 0; for (int i = 0; i < a.length; i++) { int tDist = Math.abs(Character.getNumericValue(a[i]) - Character.getNumericValue(b[i])); if (tDist > 1) { distance += tDist; } } return distance; } else { throw new TSException("Unable to compute SAX distance, string lengths are not equal"); } }

This seems like a simple comparison. What am I missing here?

— Reply to this email directly or view it on GitHub https://github.com/GrammarViz2/grammarviz2_src/issues/14#issuecomment-56694872 .

Mahalo, Pavel.

mmehrle commented 10 years ago

Thanks - but the issue I have right now is that I don't know which call to make to get a simple (non-mindist) comparison. For some reason the simple call does not work properly, it has nothing to do with mindist. Can you please point me in the right direction?

seninp commented 10 years ago

I fixed that in SAXFactory, see latest commits.

On Wed, Sep 24, 2014 at 6:45 PM, mmehrle notifications@github.com wrote:

Thanks - but the issue I have right now is that I don't know which call to make to get a simple (non-mindist) comparison. Can you please point me in the right direction?

— Reply to this email directly or view it on GitHub https://github.com/GrammarViz2/grammarviz2_src/issues/14#issuecomment-56699999 .

Mahalo, Pavel.

mmehrle commented 10 years ago

I only see this change: http://tinyurl.com/puk248y - don't see how that one addresses it.

seninp commented 10 years ago

The previous one does the job

mmehrle commented 10 years ago

Excellent - you even wrote a test for the vectors I gave you (TestIssue14) - how awesome is that! :+1: Thanks a million Senin.

seninp commented 10 years ago

You are very welcome. Let me know if you'd have any new questions. I remember about panning, we'll discuss that with other developers (and I hope will make happen) soon.

mmehrle commented 10 years ago

I tested the heck out of it today and it seems to be working properly now. Excellent. FYI - I just implemented your SAX routines inside a Gaussian SAX based Kernel for libsvm (a bit like a DTW kernel). I don't think it's positive semi-definite but let's see how she rips.