Closed mmehrle closed 10 years ago
By the way - here's how I'm calling your method:
strDistance = SAXFactory.strDistance(saxRep1.toCharArray(), saxRep2.toCharArray()); System.out.println("Distance between " + saxRep1 + " + " + saxRep2 + ": " + strDistance);
So I'm pretty certain I'm didn't trip over my own code somewhere ;-)
Here's another example:
Vector 1: Element # 1: -0.75 Element # 2: -0.5 Element # 3: -1.0 Element # 4: -0.583 Element # 5: 1.0 Vector 2: Element # 1: -1.0 Element # 2: -0.636 Element # 3: -0.182 Element # 4: -0.091 Element # 5: 1.0 Distance between bcbcf + abcdf: 0
Thanks for that, I'll check right now!
Well, it should be zero because the distance between symbols that are next in the alphabet doesn't count. It is SAX mindist thingy. I have assumed the alphabet size 6.
Ha! But description says otherwise. I see from where the misunderstanding grows... So, what would you think, should I make two variants and name them differently like strDistance and strSAXMinDistance?
This may be a misunderstanding on my part Senin and a result of my ignorance of the the mindist algo . I thought this was a straight comparison with mindist off (as if in setting it to OFF in the grammarviz console). I read the paper and although the math is a bit beyond my pay grade I do grasp the overall idea. So to answer your question - yes, two methods would be preferable. OR one method that passes it in as a param as such:
enum NumerosityReduction { Mindist, Off, Exact };
And then a call like this:
SAXFactory.strDistance(str1, st2, NumerosityReduction.Off);
This way you are more flexible further down the line. Adding methods with different names may design you in a corner. What do you think?
Oh, and until you change it - how do I call it with numerosity reduction off? I guess I could dig through your grammarviz code but since I have your attention ;-)
I just looked at the source and don't see anything fancy in there regarding MINDIST being off and on:
public static int strDistance(char[] a, char[] b) throws TSException { if (a.length == b.length) { int distance = 0; for (int i = 0; i < a.length; i++) { int tDist = Math.abs(Character.getNumericValue(a[i]) - Character.getNumericValue(b[i])); if (tDist > 1) { distance += tDist; } } return distance; } else { throw new TSException("Unable to compute SAX distance, string lengths are not equal"); } }
And then I saw this one:
public static double saxMinDist(char[] a, char[] b, double[][] distanceMatrix) throws TSException { if (a.length == b.length) { double dist = 0.0D; for (int i = 0; i < a.length; i++) { if (Character.isLetter(a[i]) && Character.isLetter(b[i])) { int numA = Character.getNumericValue(a[i]) - 10; int numB = Character.getNumericValue(b[i]) - 10; if (numA > 19 || numA < 0 || numB > 19 || numB < 0) { throw new TSException("The character index greater than 19 or less than 0!"); } double localDist = distanceMatrix[numA][numB]; dist += localDist; } else { throw new TSException("Non-literal character found!"); } } return dist; } else { throw new TSException("Data arrays lengths are not equal!"); } }
So this is basically what I was referring to. Shouldn't the first method do what I want? I.e. a comparison with numerosity reduction off? Please clarify.
I do not remember exactly, but I would guess that I was using this method to speed-up the computation. Comparing integers is a way faster that summing up doubles and making lookups in the table, at least I thought so back then. I'll try to see how to make API cleaner at this exact spot. Will keep you updated.
On Wed, Sep 24, 2014 at 6:11 PM, mmehrle notifications@github.com wrote:
I just looked at the source and don't see anything fancy in there regarding MINDIST being off and on:
public static int strDistance(char[] a, char[] b) throws TSException { if (a.length == b.length) { int distance = 0; for (int i = 0; i < a.length; i++) { int tDist = Math.abs(Character.getNumericValue(a[i]) - Character.getNumericValue(b[i])); if (tDist > 1) { distance += tDist; } } return distance; } else { throw new TSException("Unable to compute SAX distance, string lengths are not equal"); } }
This seems like a simple comparison. What am I missing here?
— Reply to this email directly or view it on GitHub https://github.com/GrammarViz2/grammarviz2_src/issues/14#issuecomment-56694872 .
Mahalo, Pavel.
Thanks - but the issue I have right now is that I don't know which call to make to get a simple (non-mindist) comparison. For some reason the simple call does not work properly, it has nothing to do with mindist. Can you please point me in the right direction?
I fixed that in SAXFactory, see latest commits.
On Wed, Sep 24, 2014 at 6:45 PM, mmehrle notifications@github.com wrote:
Thanks - but the issue I have right now is that I don't know which call to make to get a simple (non-mindist) comparison. Can you please point me in the right direction?
— Reply to this email directly or view it on GitHub https://github.com/GrammarViz2/grammarviz2_src/issues/14#issuecomment-56699999 .
Mahalo, Pavel.
I only see this change: http://tinyurl.com/puk248y - don't see how that one addresses it.
The previous one does the job
Excellent - you even wrote a test for the vectors I gave you (TestIssue14) - how awesome is that! :+1: Thanks a million Senin.
You are very welcome. Let me know if you'd have any new questions. I remember about panning, we'll discuss that with other developers (and I hope will make happen) soon.
I tested the heck out of it today and it seems to be working properly now. Excellent. FYI - I just implemented your SAX routines inside a Gaussian SAX based Kernel for libsvm (a bit like a DTW kernel). I don't think it's positive semi-definite but let's see how she rips.
I ran into this today when parsing through my test's log:
Here are the original values which produced the two strings (if you want to test it):
Vector 1: Element # 1: -0.5 Element # 2: -1.0 Element # 3: -0.583 Element # 4: 1.0 Element # 5: -0.083 Vector 2: Element # 1: -0.2 Element # 2: -0.6 Element # 3: -1.0 Element # 4: 1.0 Element # 5: 0.2
When calling SAXFactory.strDistance() with cabfd and cbafd the result is 0. That is clearly wrong as they are not identical. Can you please take a look at this example?