Measuring coverage for various anonymization schemes

Coverage ranges from 0 to 1. It is computed as the average coverage over all columns. We know what all the columns are from the raw data.

If a given column does not even exist in the anonymized dataset (i.e. it was removed by the anonymization), then the coverage value for that column is 0.

If a given continuous column exists in the anonymized dataset, but there is no way to make range queries over it, then the coverage value for that column is 0. If range queries can be made over the column, then the coverage value is 1. Note that for Aircloak and raw datasets range queries can be made. We may have to establish different tests for this for different anonymization schemes.

For enumerative columns that exist in the anonymized dataset, we compute coverage the same as we already do (the ratio of the number of distinct column values in the anonymized dataset over the number of distinct column values with more than one user from the raw dataset).

Note that for some differential privacy anonymization schemes, you simply won't be able to make additional queries at some point. When this happens, any remaining unqueried columns will have a coverage value of 0. (I'll make a new issue for this when we have such an anonymization scheme in place.)

gda-score / code

Measuring coverage for various anonymization schemes #17