elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.95k stars 24.74k forks source link

Variance calculation should use more sophisticated algorithm #58275

Open polyfractal opened 4 years ago

polyfractal commented 4 years ago

ExtendedStats uses a naive algorithm for calculating variance which can lead to floating point errors when the difference between sumSq and sum is very small (see here and here). This won't be caught by our existing Kahan summation which is more concerned about accumulated errors when summing values.

There are alternative algos that behave better, which will be increasingly important as other tools build on variance (like t-test).

elasticmachine commented 4 years ago

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

mysticaltech commented 4 years ago

Yes exactly, the way it is done now with sum of squares can lead to a numerical phenomenon called Catastrophic Cancellation due to floating point arithmetic, that ultimately calculates an incorrect variance and even sometime negative values. The Welford algorithm is the key, and the way I see it done in Elastic Search is to first implement an aggregate called the M2 moment that the Welford algo uses, when this is done, the formula becomes straightforward. More info on all of this here https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm

vishnugt commented 4 years ago

Can I send a PR to change the existing algo to Welford algorithm?

mysticaltech commented 4 years ago

I think a PR never hurts, this would be great! 😇 🙏🏻

vishnugt commented 4 years ago

@polyfractal I read your comment here, so shall I leave this to be handled by a pro?

nik9000 commented 4 years ago

If you open a PR with the new algorithm one of us would be happy to advise on the backwards compatibility side. There is usually something fairly simple that'd make it work. The difficult thing is usually figuring out what that is.

On Mon, Jun 22, 2020, 08:09 Vishnu Gt notifications@github.com wrote:

@polyfractal https://github.com/polyfractal I read your comment here https://github.com/elastic/elasticsearch/pull/37384#issuecomment-645999659, so shall I leave this to be handled by a pro?

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/58275#issuecomment-647476322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUXIQEQBGYACTAXADR3MTRX5CWXANCNFSM4OAY6YPQ .

vishnugt commented 4 years ago

Awesome, I'll start working on the PR with the new algorithm.

polyfractal commented 4 years ago

Yeah, what Nik said :) Usually these sort of things fall into one of two camps:

  1. New algo is compatible with old algo somehow (maybe by falling back to loss of precision, etc), so it just has to be "translated" back and forth
  2. New algo is fundamentally incompatible, so we end up maintaining "old" in newer versions for a while while the BWC matrix overlaps.

Since this is a loss of precision situation, we can probably figure out a way to translate back/forth. But that's just a guess, haven't really thought about it much :)

vishnugt commented 4 years ago

I suspect the PR I submitted for this issue was somehow missed, so commenting here, just in case.

nik9000 commented 4 years ago

I suspect the PR I submitted for this issue was somehow missed, so commenting here, just in case.

Yup, it did. I've marked it for review on our side. I'll see about reviewing it in a bit if no one else gets to it first.

elasticsearchmachine commented 11 months ago

Pinging @elastic/es-analytics-geo (Team:Analytics)