Open polyfractal opened 4 years ago
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)
Yes exactly, the way it is done now with sum of squares can lead to a numerical phenomenon called Catastrophic Cancellation due to floating point arithmetic, that ultimately calculates an incorrect variance and even sometime negative values. The Welford algorithm is the key, and the way I see it done in Elastic Search is to first implement an aggregate called the M2 moment that the Welford algo uses, when this is done, the formula becomes straightforward. More info on all of this here https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
Can I send a PR to change the existing algo to Welford algorithm?
I think a PR never hurts, this would be great! 😇 🙏🏻
@polyfractal I read your comment here, so shall I leave this to be handled by a pro?
If you open a PR with the new algorithm one of us would be happy to advise on the backwards compatibility side. There is usually something fairly simple that'd make it work. The difficult thing is usually figuring out what that is.
On Mon, Jun 22, 2020, 08:09 Vishnu Gt notifications@github.com wrote:
@polyfractal https://github.com/polyfractal I read your comment here https://github.com/elastic/elasticsearch/pull/37384#issuecomment-645999659, so shall I leave this to be handled by a pro?
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/58275#issuecomment-647476322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUXIQEQBGYACTAXADR3MTRX5CWXANCNFSM4OAY6YPQ .
Awesome, I'll start working on the PR with the new algorithm.
Yeah, what Nik said :) Usually these sort of things fall into one of two camps:
Since this is a loss of precision situation, we can probably figure out a way to translate back/forth. But that's just a guess, haven't really thought about it much :)
I suspect the PR I submitted for this issue was somehow missed, so commenting here, just in case.
I suspect the PR I submitted for this issue was somehow missed, so commenting here, just in case.
Yup, it did. I've marked it for review on our side. I'll see about reviewing it in a bit if no one else gets to it first.
Pinging @elastic/es-analytics-geo (Team:Analytics)
ExtendedStats uses a naive algorithm for calculating variance which can lead to floating point errors when the difference between
sumSq
andsum
is very small (see here and here). This won't be caught by our existing Kahan summation which is more concerned about accumulated errors when summing values.There are alternative algos that behave better, which will be increasingly important as other tools build on variance (like t-test).