apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.46k stars 3.7k forks source link

Geometric mean calculation #7849

Closed jmarxuach closed 4 years ago

jmarxuach commented 5 years ago

Description

Most people are familiar with the “arithmetic mean”, which is also commonly called an average. Geometric mean has utility in science, finance, and statistics.

A geometric mean, unlike an arithmetic mean, tends to dampen the effect of very high or low values, which might bias the mean if a straight average (arithmetic mean) were calculated.

Mathematical definition: The nth root of the product of n numbers.

Practical definition: The average of the logarithmic values of a data set, converted back to a base 10 number. That is : (1/count)* ( log(n1)+log(n2)+log(n3)+....) = Log Mean. Then we converter back to base 10 by calculating 10^LogMean.

Implemetation in Driud : Now Druid has longSum, doubleSum, etc.. To get geometric mean we would need a logSum metric, and a post-aggregation to calculate 10^(logSum/count). That's it.

I think implementation is very simple and geometric mean is very useful as I explain below.

Motivation

Geometric mean is used by scientists and biologists, and also used in many other fields, most notably financial reporting. This is because when evaluating investment returns as annual percent change data over several years (or fluctuating interest rates), it is the geometric mean, not the arithmetic mean, that tells you what the average financial rate of return would have had to have been over the entire investment period to achieve the end result. This term is also so called the Compound Annual Growth Rate or CAGR. Population biologists also use the same calculation to determine average growth rates of populations, and this growth rate is referred to as the Intrinsic Rate of Growth when the calculation is applied to estimates of population increases where there are no density-dependent forces regulating the population.

Druid is the perfect tool to have financial or biologists events and the geometric mean is essential.

Thanks.

jmarxuach commented 5 years ago

After some research, I found a solution.

I enabled javascript and i create an aggregator and a post-aggregator .

Agregator SUM_LOG = ( log(n1)+log(n2)+log(n3)+....)


{
  "type": "javascript",
  "name": "SUM_LOG",
  "fieldNames": ["column"],
  "fnAggregate" : "function(current, column)      { 
                 if (a>0)  return  current + (Math.log(column)/ Math.log(10)); else return 0;
   }",
  "fnCombine"   : "function(partialA, partialB) { return partialA + partialB; }",
  "fnReset"     : "function()                   { return 0; }"
}

Then I created a post-aggregator : 10^(logSum/count)

{
  "type": "javascript",
  "name": "GEO_MEAN",
  "fieldNames" : ["SUM_LOG", "count"],
  "function": "function(SUM_LOG, count) { return Math.pow(10, (SUM_LOG/count)); }"
}

And that's it !! My feature proposal a sumlog is not needed if you enable javascript.

stale[bot] commented 4 years ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.