awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.27k stars 536 forks source link

fix ratio in constraint_message #524

Open Aigul9 opened 10 months ago

Aigul9 commented 10 months ago

Hello!

I've just set up the library and noticed this thing:

Here is the data example: image

The tests: image

And the sample of the results: image

As you can see, the first constraint_message says that 60% of data didn't meet the requirement, although 60% of it did meet. In the second row, it says that 0% didn't meet which means that 100% is passed successfully, thought it's the opposite: none of the values among ga_visits column is unique.

Description of changes: I propose to change the formula of calculating ratio in constraint_message, so it becomes the ratio of mismatched values. If we use val ratio = mismatchCount.toDouble / primaryCount, then the results for my case would be 4/10=0.4 and 10/10=1 "didn't meet the constraint requirement".

Another approach is to omit not in the message, however, I'm not sure if it follows the logic.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.