gda-score / code

Tools for generating General Data Anonymity Scores (www.gda-score.org)
MIT License
7 stars 1 forks source link

Measure and store coverage information for raw table #14

Open yoid2000 opened 6 years ago

yoid2000 commented 6 years ago

As discussed in issue #15, for the purpose of computing coverage there are now two types of columns, continuous and enumerative. The contribution to the coverage value will be computed differently for these two types. Specifically, we'll treat enumerative somewhat like we've been doing, but we'll do it separately from accuracy measures, and pre-compute the information we need for the raw table and store it in a separate table on the db server machines (db001.gda-score.org and others when they exist).

For enumerative columns, I'd like to pre-compute the number of column value combinations that have more than one distinct user in the raw table. Then for each anonymization method, we'll measure what fraction of these can be viewed in the anonymized table.

For any table tab, I want to create another table tab_cov which contains the enumerative coverage information for that table. Note that continuous columns can be completely ignored in the following.

tab_cov has the following columns:

  1. num_columns: This is the number of columns that comprise the information in the row.
  2. col_names: This is a string that contains the names of all of the columns for the row. Specifically, the string is formated as ,col1,col2,col3.... In other words, each column name is prepended with a comma ,.
  3. num_values: The number of distinct value combinations for the corresponding columns.
  4. num_single_uid: The number of value combinations for which there is one distinct user.
  5. num_multiple_uid: The number of value combinations for which there is more than one distinct user.

Note that, unlike coverage measures up to now, we should compute value combinations for more than two enumerative columns. You can do it like this:

First, compute the above measures for single columns.

Then, for all single columns where more than 1% of the values have multiple distinct users, generate pairs of columns and make the above measures. Then iterate: for pairs where more than 1% of the values have mulitple distinct users, generate groups of three columns, etc.

I would also say that we don't need more than 100 instances of any given combination size. In other words, we don't need more than 100 single columns, 100 pairs of columns, 100 groups of 3 columns, etc.

As with #12, please produce a file with SQL CREATE and INSERT commands for the tables.

yoid2000 commented 6 years ago

@srnb I updated the issue. It is ready now.