elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
785 stars 323 forks source link

Imprecise variance calculation in MeanVariance.java #81

Closed foolnotion closed 3 years ago

foolnotion commented 3 years ago

Hi,

I ported some of your numerically stable methods to C++ and C# and noticed some weird results.

Consider the following series:

x = [
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47,
            150494407424305.47
]

Since x is constant the variance should be zero or very close to zero. However, using the MeanVariance class, I get:

naive variance = 5.425347222222222e-05
sample variance = 5.9185606060606055e-05

When the series values are low:

x = [
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47,
            305.47
]

then the precision is good:

naive variance = 6.788729774740758e-28
sample variance = 7.405887026989918e-28

Unfortunately I can't test the original java code as I don't have experience with java and my attempt at building ELKI failed, but I think this should be fairly easy to confirm. My code is basically identical to this.

I realize the values are big but they are not that big. The same issue concerning variance is also present in the PearsonCorrelation class.

kno10 commented 3 years ago

Numerical precision is relative to the input magnitude. Compared to an input that is e+14, a value of the magnitude e-5 is a very reasonable numerical precision.

More precisely, if you print standard deviation / mean you should get a value of about 5e-17.

Since double precision float offers about 16-17 decimal digits, this is the best you can expect from an inexpensive single-pass algorithm.

Note that ELKI, if you pass in an array of values rather than using the incremental API, uses the two-pass approach which produces the result that you expected:

    MeanVariance m = new MeanVariance();
    m.put(array);
    System.err.println(m.getPopulationVariance());

will print 0 for this array.

If you want a more challenging problem, consider the two data sets {1e20,1,-1e20} and {1e20,-1e20,1} that obviously should have the same variance. But you'll have a hard time finding a solution for this that does not come at noticeable CPU overhead or that fails in other, more important, cases.

For further details, please see the references given in the source code.

foolnotion commented 3 years ago

Thanks for the clarification. I've gone through the references in the source code.

Numerical precision is relative to the input magnitude. Compared to an input that is e+14, a value of the magnitude e-5 is a very reasonable numerical precision.

More precisely, if you print standard deviation / mean you should get a value of about 5e-17.

I am not sure exactly what you mean. The standard deviation of that e-5 value is in the e-3 range.

Since double precision float offers about 16-17 decimal digits, this is the best you can expect from an inexpensive single-pass algorithm.

As far as I can tell the non-weighted single-pass implementation in MeanVariance.java is actually the Youngs-Cramer algorithm. At least in this case it seems less accurate than Welford's algorithm (although in general they should have similar numerical behavior) but runs approximately 2x faster. That being said thanks again it is clear now why these results occur.

kno10 commented 3 years ago

You need to compare the value to the mean of the input data. The standard deviation is not 1e-5 times the input mean.

Try Welfords, it will likely have the same accuracy.

foolnotion commented 3 years ago

I implemented Welford in the PearsonCorrelation class in C# (because the real problem was getting large correlation between two series where one of them is constant):

    public void Add(double x, double y) {
      if (sumWe <= 0.0) {
        sumX = meanX = x;
        sumY = meanY = y;
        sumXX = sumYY = sumXY = 0;
        sumWe = 1;
        return;
      }

      #region Youngs-Cramer algorithm
      // Delta to previous mean
      //double deltaX = x * sumWe - sumX, deltaY = y * sumWe - sumY;
      //double oldWe = sumWe;
      //// Incremental update
      //sumWe += 1;
      //double f = 1.0 / (sumWe * oldWe);
      //// Update
      //sumXX += f * deltaX * deltaX;
      //sumYY += f * deltaY * deltaY;
      //// should equal weight * deltaY * neltaX!
      //sumXY += f * deltaX * deltaY;
      //// Update means
      //sumX += x;
      //sumY += y;
      #endregion

      #region Welford algorithm
      sumWe += 1;

      double dx = x - meanX;
      double dy = y - meanY;

      double f = 1.0 / sumWe;
      meanX += f * dx;
      meanY += f * dy;

      double dx1 = x - meanX;
      double dy1 = y - meanY;

      sumXX += dx * dx1;
      sumYY += dy * dy1;
      sumXY += dx1 * dy;
      #endregion
    }

Adding those large values for X returns zero variance using Welford (although probably there are other cases where Welford fails and Youngs-Cramer doesn't).

kno10 commented 3 years ago

It actually makes sense that for the constant case Welford works better, because it computes the difference to the current mean (which then will be exactly 0), whereas the Youngs-Cramer approach suffers from the well known incremental sum problem:

sum = 0
for i in range(10): sum += 0.1
# sum is 0.9999999999999999

OTOH, Youngs-Cramer style seems to often be faster because of better CPU pipelining, and a constant series is not the typical case where you would use this in the first place, because you then the solution is trivial.

foolnotion commented 3 years ago

I was curious about the performance so I dug a little deeper and implemented SIMD versions of Welford and Youngs-Cramer using C++/Eigen. The results are interesting. Youngs-Cramer is insanely fast.

Measurements taken with nanobench using 100M random doubles sampled from a U(-100, 100) uniform distribution. Test CPU is an AMD Ryzen 3900X. Eigen code usually compiles down to AVX, the code was built with GCC 10.2.0 using -O3 -march=znver2.

What I call schubert below is the unmodified single-pass MeanVariance implementation taken from ELKI (just made into C++ but otherwise identical).

| relative |               ns/op |                op/s |    err% |          ins/op |         bra/op |   miss% |     total | benchmark
|---------:|--------------------:|--------------------:|--------:|----------------:|---------------:|--------:|----------:|:----------
|   100.0% |      659,724,346.00 |                1.52 |    0.3% |1,700,000,701.00 | 200,000,675.00 |    0.0% |      7.26 | `welford`
|   141.8% |      465,238,978.00 |                2.15 |    0.1% |1,400,000,493.00 | 100,000,469.00 |    0.0% |      5.12 | `welford-batch`
|   570.3% |      115,671,531.00 |                8.65 |    0.3% |  400,000,248.00 |  25,000,125.00 |    0.0% |      1.27 | `welford-eigen`
|   230.5% |      286,269,511.00 |                3.49 |    0.2% |2,400,000,314.00 | 400,000,289.00 |    0.0% |      3.15 | `schubert`
|   236.3% |      279,231,306.00 |                3.58 |    0.3% |1,900,000,308.00 | 200,000,284.00 |    0.0% |      3.07 | `youngs-cramer`
|   473.8% |      139,236,883.00 |                7.18 |    0.4% |1,600,000,164.00 | 100,000,141.00 |    0.0% |      1.53 | `youngs-cramer-batch`
| 1,822.9% |       36,190,293.00 |               27.63 |    1.1% |  425,000,150.00 |  25,000,048.00 |    0.0% |      0.40 | `youngs-cramer-eigen`

The means and variances are reasonably close:

schubert:            [0.00042451354762833076, 3332.9993823457685]
welford:             [0.0004245135476293431, 3332.9993823457685]
welford-batch        [0.0004245135476293431, 3332.9993823457685]
welford-eigen        [0.00042451354762913676, 3332.999382346099]
youngs-cramer:       [0.00042451354762833076, 3332.9993823457685]
youngs-cramer-batch: [0.00042451354762833076, 3332.9993823457685]
youngs-cramer-eigen: [0.0004245135476299173, 3332.999382346099]

The code:

I was quite surprised about these speed-ups but couldn't find any obvious bugs in the code, the returned values seem to check out between the versions.

You are right about the pipelining, in Welford this division is the main bottleneck, presumably due to data dependencies (the generated code itself looks efficient - avx instructions).

kno10 commented 3 years ago

My experiments with hand-written AVX in the paper linked were similar. However, computing variance of a double[] is a very simple case, and the two-pass algorithm also scores very well on this - it only does a single division between the passes. Things get more interesting when you want to integrate this into other problems. Such as GMM clustering where you need weighted covariances, and you don't want to keep all weights in memory; data arriving as a data stream, etc. - but in these cases, AVX parallelization of the aggregation operation is the least concern.