Don't make up string values when real values are known

Take the gda_banking dataset on Attack:

The analysis results contains values such as:

 distinct.values:

[
  {
    "count": 4500,
    "value": "OWNER"
  },
  {
    "count": 869,
    "value": "DISPONENT"
  }
]

exploration_info:

{
  "column": "type",
  "columnType": "text",
  "dataSource": "gda_banking",
  "table": "disp"
}

is_email:

{
  "isEmail": false
}

sample_values:

[
  "DISP",
  "DISP",
  "DISPNEN",
  "DISONE",
  "DISONE",
  "DISONE",
  "DISPNEN",
  "DIS",
  "DISPNEN",
  "DISPNEN",
  "DISONE",
  "DISONE",
  "DISPNEN",
  "DIS",
  "DISONE",
  "DIS",
  "DISONE",
  "DISP",
  "DISPNEN",
  "DISONE",
  "DISONE",
  "DISONE",
  "OWNONE",
  "DISP",
  "DISPNEN",
  "DISP",
  "DISONE",
  "DISONE",
  "DISPNEN",
  "DISONE"
]

text.length.counts:

{
  "nonSuppressedCount": 5369,
  "nonSuppressedNonNullCount": 5369,
  "nonSuppressedRows": 2,
  "nullCount": 0,
  "nullRows": 0,
  "suppressedCount": 0,
  "suppressedCountRatio": 0,
  "suppressedRowRatio": 0,
  "suppressedRows": 0,
  "totalCount": 5369,
  "totalRows": 2
}

In fact it's clear from the result that all the values are known. All the same the sample values are made up. In these cases where we know a significant portion of the data it makes much more sense to pick samples from this distribution rather than to make up values.

Otherwise we end up with entirely unnecessarily nonsensical sample tables like this:

Screen Shot 2020-07-22 at 15 19 38

diffix / explorer

Don't make up string values when real values are known #217