WhiteRabbit is a small application that can be used to analyse the structure and contents of a database as preparation for designing an ETL. It comes with RabbitInAHat, an application for interactive design of an ETL to the OMOP Common Data Model with the help of the the scan report generated by White Rabbit.
The combination of numeric statistics and value counts can give information values that occur less frequently than the given small cell counts.
Observed behaviour
Let's say we have a field with 1000 records that is empty except for the one value of 2.0.
The value count correctly hides the value with a small cell count.
LDEX
Frequency
999
List truncated...
But from the field overview we can see that this one truncated value was 2, as all the statistics take on this only value.
Table
Field
Type
Max length
N rows
N rows checked
Fraction empty
N unique values
Average
Min
0,25
Median
0,75
Max
table
field
VARCHAR
1
1000
1000
100,0%
2
2,00
2,00
2,00
2,00
2,00
2,00
Expected behaviour
The values with a small cell count are also hidden from the numeric statistics. In this example, the numeric stats should not be shown at all as it reveals too much information.
Proposed solution
Two options:
Simple solution: if number of unique values in a field is small (e.g. < 5 unique values), do not output the numeric statistics of that field.
Rigorous solution: hide the numeric statistic if that value occurs in the data and has a frequency lower than the small cell count.
The combination of numeric statistics and value counts can give information values that occur less frequently than the given small cell counts.
Observed behaviour
Let's say we have a field with 1000 records that is empty except for the one value of
2.0
.But from the field overview we can see that this one truncated value was 2, as all the statistics take on this only value.
Expected behaviour
The values with a small cell count are also hidden from the numeric statistics. In this example, the numeric stats should not be shown at all as it reveals too much information.
Proposed solution
Two options: