At the moment, var does a naive sum by adding up the squared deviations from the mean. However, when var is called on a collection, we can speed it up and also reduce the floating-point error significantly by using pairwise summation with a recursive algorithm -- roughly:
(Note that this would require implementing fused statistics like mean_and_var from StatsBase, or else we would have to do more than one pass -- one for mean and one for var.)
Interesting. Do you have references about this? One tricky part would be to compute the variance of means without storing them in a intermediate array, or the performance benefit would probably be lost.
At the moment,
var
does a naive sum by adding up the squared deviations from the mean. However, whenvar
is called on a collection, we can speed it up and also reduce the floating-point error significantly by using pairwise summation with a recursive algorithm -- roughly:(Note that this would require implementing fused statistics like
mean_and_var
from StatsBase, or else we would have to do more than one pass -- one formean
and one forvar
.)