OHDSI / WhiteRabbit

WhiteRabbit is a small application that can be used to analyse the structure and contents of a database as preparation for designing an ETL. It comes with RabbitInAHat, an application for interactive design of an ETL to the OMOP Common Data Model with the help of the the scan report generated by White Rabbit.
http://ohdsi.github.io/WhiteRabbit
Apache License 2.0
174 stars 85 forks source link

Numeric stats reveal values with small cell count in some cases #302

Open MaximMoinat opened 3 years ago

MaximMoinat commented 3 years ago

The combination of numeric statistics and value counts can give information values that occur less frequently than the given small cell counts.

Observed behaviour

Let's say we have a field with 1000 records that is empty except for the one value of 2.0.

The value count correctly hides the value with a small cell count. LDEX Frequency
  999
List truncated...  

But from the field overview we can see that this one truncated value was 2, as all the statistics take on this only value.

Table Field Type Max length N rows N rows checked Fraction empty N unique values Average Min 0,25 Median 0,75 Max                  
table field VARCHAR 1 1000 1000 100,0% 2 2,00 2,00 2,00 2,00 2,00 2,00

Expected behaviour

The values with a small cell count are also hidden from the numeric statistics. In this example, the numeric stats should not be shown at all as it reveals too much information.

Proposed solution

Two options:

  1. Simple solution: if number of unique values in a field is small (e.g. < 5 unique values), do not output the numeric statistics of that field.
  2. Rigorous solution: hide the numeric statistic if that value occurs in the data and has a frequency lower than the small cell count.