Incorrect Mean calculation

rmaheshkumarblr commented 2 years ago

Mean is calculated incorrectly when the value for the column is really high (Example: EpochTimestamp) and the size of the dataset is high as well (Dataset Size).

Based on my analysis:

In Mean.scala file, the mean is not calculated using the mean function provided by Spark directly, instead the Sum is calculated, the Count is calculated and then a division is being performed.
https://github.com/awslabs/deequ/blob/933417676189bc7833166f976fd024a4b2177292/src/main/scala/com/amazon/deequ/analyzers/Mean.scala#L32
Spark Sum return type is a bigint, so if the sum is really high then an overflow happens and the output is incorrect. As an alternate using the mean function of Spark gives the correct result.

Don't have the entire context behind the calculation of Sum and Count and then calculating the Mean. Would love to hear more about it.

shehzad-qureshi commented 1 year ago

i think that the reason for doing sum then division is to account for previous states to update the mean; that being said this is indeed a bug because Double doesn't have the same precision as Long and overflows will be missed.

There's a larger problem that all metrics are currently represented by Double so we'll need to change some of the underlying architecture to support Long metric values as well.

explicite commented 1 year ago

Maybe will can go with simple change? Move from: Double -> BigDecimal Long -> BigInt

Is there any idea how this should be solved? I'm happy to help here.

awslabs / deequ

Incorrect Mean calculation #426

Mean is calculated incorrectly when the value for the column is really high (Example: EpochTimestamp) and the size of the dataset is high as well (Dataset Size).