diffix / explorer

Tool to automatically explore and generate stats on data anonymized using Diffix
MIT License
2 stars 1 forks source link

Making up sample values where known values exists #264

Closed sebastian closed 4 years ago

sebastian commented 4 years ago

See the following two images. The first one is the anonymized result from Aircloak:

Screen Shot 2020-08-10 at 14 24 53

Clearly a high number of job class titles exist that can be shown. All the same the synthesizer improvises its own values when generating a sample table:

Screen Shot 2020-08-10 at 14 25 03

dandanlen commented 4 years ago

Just an idea - we might be able to remedy this by including the complete string values in the list of substrings, and prioritizing substrings with longer length over those with shorter length.

sebastian commented 4 years ago

Or if the known values represent more than some X% of the total number of distinct real values, then just sample from the real known values instead? I.e. we only need to do synthesis if we don't have enough enough real values.

dandanlen commented 4 years ago

This case should be captured anyway when we do the distinct value analysis - we only generate random substrings if (based on % of distinct values) the string values do not look like they are categorical. The problem is that we generate random strings from fragments without including longer strings, even when these are available from the distinct col analysis.

AndreiBozantan commented 4 years ago

Just an idea - we might be able to remedy this by including the complete string values in the list of substrings, and prioritizing substrings with longer length over those with shorter length.

I like this idea.

AndreiBozantan commented 4 years ago

This case should be captured anyway when we do the distinct value analysis - we only generate random substrings if (based on % of distinct values) the string values do not look like they are categorical. The problem is that we generate random strings from fragments without including longer strings, even when these are available from the distinct col analysis.

Currently, in order to determine categorical columns, we check that the ratio between the count of suppressed values and the number of distinct values is less than 10%. This uses the assumption that every suppressed value is a unique. In this particular case, using this condition, we obtain that around 19% of values are suppressed. When using a different way to compute the percent of suppressed values, i.e. the number of suppressed values as a ratio of the number of total rows in the table, we obtain that less than 0.1% of values are suppressed, for this particular case.

I think that we can improve these approximations by computing the ratio between the median count for the distinct values and the number of distinct values.

dandanlen commented 4 years ago

I don't think normalising using the median of the counts will work as intended since the median of the suppressed counts is much lower than the median of the unsuppressed counts - see my comments here.

A better normalising factor might be half the low-count threshold however I'm not sure if the low-count threshold is a constant or computed dynamically per-query. (@sebastian ?)

sebastian commented 4 years ago

When using a different way to compute the percent of suppressed values, i.e. the number of suppressed values as a ratio of the number of total rows in the table, we obtain that less than 0.1% of values are suppressed, for this particular case.

I think suppressed rows vs non-suppressed rows is the interesting factor here (rather than the fraction of suppressed distinct values – which requires some additional guesswork). If, like in this case, 0.1% of the rows are suppressed, then randomly picking from the known unsuppressed values will in all likelihood yield a value corresponding to something one would get if picking from the un-anonymized dataset too!

AndreiBozantan commented 4 years ago

Closed with #291.