Open rmaheshkumarblr opened 2 years ago
i think that the reason for doing sum then division is to account for previous states to update the mean; that being said this is indeed a bug because Double
doesn't have the same precision as Long
and overflows will be missed.
There's a larger problem that all metrics are currently represented by Double
so we'll need to change some of the underlying architecture to support Long metric values as well.
Maybe will can go with simple change? Move from:
Double
-> BigDecimal
Long
-> BigInt
Is there any idea how this should be solved? I'm happy to help here.
Mean is calculated incorrectly when the value for the column is really high (Example: EpochTimestamp) and the size of the dataset is high as well (Dataset Size).
Based on my analysis:
In Mean.scala file, the mean is not calculated using the mean function provided by Spark directly, instead the Sum is calculated, the Count is calculated and then a division is being performed.
https://github.com/awslabs/deequ/blob/933417676189bc7833166f976fd024a4b2177292/src/main/scala/com/amazon/deequ/analyzers/Mean.scala#L32
Spark Sum return type is a bigint, so if the sum is really high then an overflow happens and the output is incorrect. As an alternate using the mean function of Spark gives the correct result.
Don't have the entire context behind the calculation of Sum and Count and then calculating the Mean. Would love to hear more about it.