boostorg / accumulators

An awesome library from Boost
http://boost.org/libs/accumulators
22 stars 54 forks source link

Computed Variance Can be Negative #26

Open NAThompson opened 5 years ago

NAThompson commented 5 years ago

This is an example from Golub's paper:

#include <iostream>
#include <vector>
#include <random>
#include <cmath>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics.hpp>

template<class Real>
void golub_example()
{
    std::random_device rd{};
    auto seed = rd();
    std::mt19937 gen(seed);
    std::normal_distribution<Real> dis(500.0, 0.01);

    std::vector<Real> v(50000);
    for(size_t i = 0; i < v.size(); ++i)
    {
        v[i] = dis(gen);
    }

    accumulator_set<Real, stats<tag::variance(lazy)>> acc;
    acc = std::for_each(v.begin(), v.end(), acc);
    std::cout << "Variance = " << variance(acc) << "\n";
    std::cout << "Seed = " << seed << "\n";
}

int main()
{
   golub_example<float>();
}

Output:

Variance (accumulator) = -79.9219
Seed = 1925235287
NAThompson commented 5 years ago

@CromwellEnage : I saw on the mailing list that you are now looking to maintain this library; could you fix this bug? The fix is in Higham's book Accuracy and Stability of Numerical Algorithms, first chapter.

CromwellEnage commented 5 years ago

@NAThompson, I'll look into it and let you know when I or someone else have implemented a solution.

NAThompson commented 5 years ago

@CromwellEnage : This also implements the correct algorithm.

antoine79 commented 5 months ago

@NAThompson , variance(lazy) by definition is lazy and computes variance as the difference between the two accumulators. This issue can be closed in my opinion as variance(immediate) exists already and behaves very closely to the algorithm that you point out. See:

#include <iostream>
#include <vector>
#include <random>
#include <cmath>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics.hpp>

using namespace boost::accumulators;

template<class Real>
void golub_example()
{
    std::mt19937 gen(1925235287);
    int N = 50000;
    std::normal_distribution<Real> dis(500.0, 0.01);

    std::vector<Real> v(N);
    for(size_t i = 0; i < v.size(); ++i)
    {
        v[i] = dis(gen);
    }

    accumulator_set<Real, stats<tag::variance(immediate)>> acc;
    acc = std::for_each(v.begin(), v.end(), acc);
    std::cout << "Variance = " << variance(acc) << "\n";
}

int main()
{
   golub_example<float>();
}

Output:

Variance = 0.000100213

Now, if N=500e3 instead of 50e3, then we have another numeric problem as the variance becomes 0.7 instead of 1e-4. Another algorithm becomes necessary, like the shifted algorithm where variance is computed by subtracting from each value the value of the first data point. I have a proposal if that's of interest to anyone.