Closed sebastian closed 4 years ago
Just an idea - we might be able to remedy this by including the complete string values in the list of substrings, and prioritizing substrings with longer length over those with shorter length.
Or if the known values represent more than some X% of the total number of distinct real values, then just sample from the real known values instead? I.e. we only need to do synthesis if we don't have enough enough real values.
This case should be captured anyway when we do the distinct value analysis - we only generate random substrings if (based on % of distinct values) the string values do not look like they are categorical. The problem is that we generate random strings from fragments without including longer strings, even when these are available from the distinct col analysis.
Just an idea - we might be able to remedy this by including the complete string values in the list of substrings, and prioritizing substrings with longer length over those with shorter length.
I like this idea.
This case should be captured anyway when we do the distinct value analysis - we only generate random substrings if (based on % of distinct values) the string values do not look like they are categorical. The problem is that we generate random strings from fragments without including longer strings, even when these are available from the distinct col analysis.
Currently, in order to determine categorical columns, we check that the ratio between the count of suppressed values and the number of distinct values is less than 10%. This uses the assumption that every suppressed value is a unique. In this particular case, using this condition, we obtain that around 19% of values are suppressed. When using a different way to compute the percent of suppressed values, i.e. the number of suppressed values as a ratio of the number of total rows in the table, we obtain that less than 0.1% of values are suppressed, for this particular case.
I think that we can improve these approximations by computing the ratio between the median count for the distinct values and the number of distinct values.
I don't think normalising using the median of the counts will work as intended since the median of the suppressed counts is much lower than the median of the unsuppressed counts - see my comments here.
A better normalising factor might be half the low-count threshold however I'm not sure if the low-count threshold is a constant or computed dynamically per-query. (@sebastian ?)
When using a different way to compute the percent of suppressed values, i.e. the number of suppressed values as a ratio of the number of total rows in the table, we obtain that less than 0.1% of values are suppressed, for this particular case.
I think suppressed rows vs non-suppressed rows is the interesting factor here (rather than the fraction of suppressed distinct values – which requires some additional guesswork). If, like in this case, 0.1% of the rows are suppressed, then randomly picking from the known unsuppressed values will in all likelihood yield a value corresponding to something one would get if picking from the un-anonymized dataset too!
Closed with #291.
See the following two images. The first one is the anonymized result from Aircloak:
Clearly a high number of job class titles exist that can be shown. All the same the synthesizer improvises its own values when generating a sample table: