diffix / explorer

Tool to automatically explore and generate stats on data anonymized using Diffix
MIT License
2 stars 1 forks source link

Don't make up string values when real values are known #217

Closed sebastian closed 4 years ago

sebastian commented 4 years ago

Take the gda_banking dataset on Attack:

The analysis results contains values such as:

 distinct.values:

[
  {
    "count": 4500,
    "value": "OWNER"
  },
  {
    "count": 869,
    "value": "DISPONENT"
  }
]

exploration_info:

{
  "column": "type",
  "columnType": "text",
  "dataSource": "gda_banking",
  "table": "disp"
}

is_email:

{
  "isEmail": false
}

sample_values:

[
  "DISP",
  "DISP",
  "DISPNEN",
  "DISONE",
  "DISONE",
  "DISONE",
  "DISPNEN",
  "DIS",
  "DISPNEN",
  "DISPNEN",
  "DISONE",
  "DISONE",
  "DISPNEN",
  "DIS",
  "DISONE",
  "DIS",
  "DISONE",
  "DISP",
  "DISPNEN",
  "DISONE",
  "DISONE",
  "DISONE",
  "OWNONE",
  "DISP",
  "DISPNEN",
  "DISP",
  "DISONE",
  "DISONE",
  "DISPNEN",
  "DISONE"
]

text.length.counts:

{
  "nonSuppressedCount": 5369,
  "nonSuppressedNonNullCount": 5369,
  "nonSuppressedRows": 2,
  "nullCount": 0,
  "nullRows": 0,
  "suppressedCount": 0,
  "suppressedCountRatio": 0,
  "suppressedRowRatio": 0,
  "suppressedRows": 0,
  "totalCount": 5369,
  "totalRows": 2
}

In fact it's clear from the result that all the values are known. All the same the sample values are made up. In these cases where we know a significant portion of the data it makes much more sense to pick samples from this distribution rather than to make up values.

Otherwise we end up with entirely unnecessarily nonsensical sample tables like this:

Screen Shot 2020-07-22 at 15 19 38

AndreiBozantan commented 4 years ago

Closed with #225.