awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.3k stars 538 forks source link

Incorrect Mean calculation #426

Open rmaheshkumarblr opened 2 years ago

rmaheshkumarblr commented 2 years ago

Mean is calculated incorrectly when the value for the column is really high (Example: EpochTimestamp) and the size of the dataset is high as well (Dataset Size).

Based on my analysis:

Don't have the entire context behind the calculation of Sum and Count and then calculating the Mean. Would love to hear more about it.

shehzad-qureshi commented 1 year ago

i think that the reason for doing sum then division is to account for previous states to update the mean; that being said this is indeed a bug because Double doesn't have the same precision as Long and overflows will be missed.

There's a larger problem that all metrics are currently represented by Double so we'll need to change some of the underlying architecture to support Long metric values as well.

explicite commented 1 year ago

Maybe will can go with simple change? Move from: Double -> BigDecimal Long -> BigInt

Is there any idea how this should be solved? I'm happy to help here.