CIRDLES / Squid

Squid3 is being developed by the Cyber Infrastructure Research and Development Lab for the Earth Sciences (CIRDLES.org) at the College of Charleston, Charleston, SC and Geoscience Australia as a re-implementation in Java of Ken Ludwig's Squid 2.5. - please contribute your expertise!
http://cirdles.org/projects/squid/
Apache License 2.0
12 stars 24 forks source link

Non-equivalence of identical eqns #615

Closed AllenKennedy closed 3 years ago

AllenKennedy commented 3 years ago

I get slightly different results for identical eqns. See the attached screen shots. I assume this is a rounding issue with my Macbook Pro, but would like that confirmed by the experts.

Screen Shot 2021-04-29 at 10 13 17 am Screen Shot 2021-04-29 at 10 16 13 am
bowring commented 3 years ago

Please send .squid file

bowring commented 3 years ago

Upon further investigation, the non-equivalence of the identical equations applies only to Squid3's RobReg, which uses a random-number generator in its calculation of slopes - in fact each equation will give a different result every time it is accessed. Ludwig states here in RobustReg2, which is the function we use, the following:

' Robust linear regression using median of all pairwise slopes/intercepts,
' after Hoaglin, Mosteller & Tukey, Understanding Robust & Exploratory Data Analysis,
' John Wiley & Sons, 1983, p. 160, with errors from code in Rock & Duffy, 1986
' (Comp. Geosci. 12, 807-818), derived from Vugrinovich (1981), J. Math. Geol. 13,
'  443-454).
' Has simple, rapid solution for errors.

RobustReg2 uses GetRobSlope, found here, which pertubates the values with a random number and then gives the medians as the slope and y-intercept.

The Java code that Squid3 uses to calculate the slope, which implements Ludwigs VBA, is:

  /**
     * Calculates slope and intercepts for a set of points. Does not implement
     * Ludwig's outlier rejection.
     *
     * @param xValues double [] array length n
     * @param yValues double [] array length n
     * @return double[4][] with row 0 containing slope, y-intercept,
     * x-intercept, row 1 containing slope array, row 2 containing y-intercept
     * array, and row 3 containing x-intercept array.
     *
     * @throws ArithmeticException
     */
    protected static double[][] getRobSlope(double[] xValues, double[] yValues)
            throws ArithmeticException {

        double[][] retVal = new double[][]{{0, 0, 0}};
        // check precondition of same size xValues and yValues and at least 3 points
        int n = xValues.length;
        if ((n == yValues.length) && n > 2) {
            // proceed
            Random random = new Random();
            int m = n * (n - 1) / 2;
            double[] slp = new double[m];
            double[] xInter = new double[m];
            double[] yInter = new double[m];
            int k = -1;

            for (int i = 0; i < (n - 1); i++) {
                for (int j = i + 1; j < n; j++) {
                    double vs = 0.0;
                    double vy;
                    k++;
                    if ((xValues[i] - xValues[j]) != 0.0) {
                        vs = (yValues[j] - yValues[i]) / (xValues[j] - xValues[i]);
                    }

                    vs += (0.5 - random.nextDouble()) * SQUID_EPSILON;
                    slp[k] = vs;

                    vy = yValues[i] - vs * xValues[i] + (0.5 - random.nextDouble()) * SQUID_EPSILON;
                    yInter[k] = vy;
                    xInter[k] = -vy / vs;
                } // end inner j loop
            } // end outer i loop

            double slope = Utilities.median(slp);
            double yInt = Utilities.median(yInter);
            double xInt = Utilities.median(xInter);

            retVal = new double[][]{{slope, yInt, xInt}, slp, yInter, xInter};
        }

        return retVal;
    }

This java code can be found here at line 48, and the Java for RobustReg2 can be found here at line 64.

The Squid3 implementation of RobReg (using RobustReg2) is found here, with the call to Ludwig's RobustReg2 (in Java) at line 75.

@AllenKennedy - great observation! I hope this explanation satisfies you.

bowring commented 3 years ago

One footnote that needs to be considered - the VBA random number generator always uses the same seed and thus returns the same series of random numbers unless specifically seeded with something transient like the date or initialized with Randomize() as explained here and here , but the Java random number generator always uses a new seed - see here, thus giving a different series of random numbers each time (unless seeded with the same number each time) and explaining the variations @AllenKennedy documented.

@sbodorkos - We need your vote here whether the random number generator should be constrained to the same series each time - as in Ludwig's VBA code that does not invoke Randomize() - or not, as currently implemented in Squid3?

ryanickert commented 3 years ago

I'm not able to follow the code exactly, but it's not clear why, say, the slope needs a random number generator. If this is the Kendall-Theil algorithm described by Rock which uses the median of all pairwise slopes, there is no need for random numbers. It is literally just a matter of calculating all pairwise slopes and then determining the median of these values.

bowring commented 3 years ago

I am with you, but we were tasked with reproducing Ludwig's methods exactly for the sake of replication. Interested to hear what @noahmclean has to say as well.

AllenKennedy commented 3 years ago

Thanks for the explanation. I guess I will keep my Mac. Most users will be uncomfortable with non-reproducibility.

sbodorkos commented 3 years ago

This is an interesting one. I have looked at that code, and like @ryanickert, I don't immediately see why Ludwig used random numbers in it. All I can guess is that he wanted to 'fuzz' the data, to avoid strange results in smallish datasets where too many of the pairwise comparisons yield exactly the same value, and the resulting median is strange? I don't know.

The 'fuzzing' perturbs each slope and intercept value by a maximum of ±5e-8 * its magnitude, so I don't think it is a 'key' part of the arithmetic. But there are a couple of different aspects to it.

The first is intra-app repeatability, as @AllenKennedy says, and it's definitely important that the same function, with the same arguments, delivers the same result each time. So in that respect, I think we do need Squid3 to seed the same value to the random number generator each time, to ensure that the specific intra-Squid3 issue raised by Allen does not recur.

The second one, which bothers me more, is whether there is any way to relate random numbers generated by GetRobSlope in SQUID 2.50 to the equivalent code-block in Squid3, to ensure that equations using RobReg in SQUID 2.50 give the identical result as expressions in Squid3. Jim @bowring presumably we would need to explicitly seed both random-number generators with the same number? We might need code changes in both Squids to make this work properly (if it's even possible).

I think we need to address this quite carefully, because I doubt this is the sole occurrence of random-number usage in the code; it's just the only one we have intersected thus far. Especially with respect to the broader 'LudwigLibrary', which encompasses a lot of the functionality embodied in Isoplot. I am thinking specifically of Isoplot's UnMix functionality; I'm certain that uses random numbers, and those results definitely needs to be repeatable inside one Squid, as well as (ideally) between Squids.

bowring commented 3 years ago

I have experimented with VBA 7.1 in Excel 16.1 for Mac and discovered that the Rnd function does provide a different number for each call unless it is prefaced by a call to Rnd(-1) - could be any negative number - that causes it to initialize to the same series each time. Thus, every invocation of test() and test2() below produces the same result.

Sub test()
    Rnd (-1)
    Debug.Print Rnd

    Rnd (-1)
    Debug.Print Rnd
End Sub

produces:

 0.03584582 
 0.03584582 

while

Sub test2()
    Rnd (-1)
    Debug.Print Rnd

    Debug.Print Rnd
End Sub

produces:

 0.03584582 
 0.08635235 

However, test3() produces a different result each time:

Sub test3()

    Debug.Print Rnd
    Debug.Print Rnd

End Sub

My inspection of Ludwig's code for GetRobSlope and RobustReg2 finds only calls to Rnd. This means that the Isoplot3 library is NOT producing reproducible results either unless Ludwig is initializing the random number generator elsewhere.

@sbodorkos - you may need to confirm whether you get the same random numbers every time inside GetRobSlope. If the answer is NO, then we need to change Squid2.5 to call Rnd(-1) to make the results reproducible. If the answer is YES, then Squid2.5 is good in this regard.

Then, once Squid2.5 is confirmed to generate reproducible results, we need to solve the problem of Squid3 reproducing the same results. We have three choices:

1) Make Squid3 internally reproducible by seeding the random number generator in the LudwigLibrary GetRobSlope - this will produce the same results each time but not identical to the results in Squid2.5.

2) Use Squid2.5 to generate the first 1000 random numbers it uses and provide them to the Java implementation of LudwigLibrary as a new pseudo random number generator that will give the same result as the generator inside the VBA code.

3) Dispense with the random numbers altogether in both Squid2.5 and Squid3 as discussed above.

Note: the above argument is for this specific application only inside of GetRobSlope. We will have to perform a study of the other functions in the Java implementation of LudwigLibrary (thus used by Squid3) that may also depend on random number generators to determine a way forward.

bowring commented 3 years ago

@AllenKennedy - In version 1.7.11 - to be released soon - this is fixed via a change in LudwigLibrary that seeds the random number generator thus guaranteeing reproducibility.