diffix / explorer

Tool to automatically explore and generate stats on data anonymized using Diffix
MIT License
2 stars 1 forks source link

Explorer synthetic data should show the right amount of decimals for Real columns #373

Closed fjab closed 3 years ago

fjab commented 3 years ago
Screenshot 2020-11-13 at 14 34 44

See attached screenshot – this is from the Clinic table on demo.aircloak. fee_billed only has up to two decimals in the dataset, so showing it down to 14 decimals is not an accurate representation of the data.

dandanlen commented 3 years ago

Since all data is bucketed, I'm not sure there's a way for us to know for sure what the maximum accuracy is for decimal values. Obviously in this case, it's a fee, so it makes sense to round to two decimals, but the values are generated by interpolating within buckets. We could approach it like "interpolate with accuracy equal to the next-smaller bucket size" - so in this case, if the bucket size for the query was 0.1, we would interpolate in multiples of 0.05. I can look at this - might be feasible in the time remaining but not sure.

In the general case, just because we can only extract information up to a resolution of 0.01 doesn't mean there are no values with a greater resolution... So it's not clear if the above is always the best approach.

fjab commented 3 years ago

Hm, that makes sense. Not trivial. However, we don't even need to bucket to get anything out in this case. I can actually output fee_billed directly, and many values bypass the anonymizer. So another simple solution would be to assume the highest number of decimals that passes the anonymizer is a good value (?).

sebastian commented 3 years ago

Just discussed this with @AndreiBozantan. It's probably a safe assumption that fewer decimal places are actually needed. I.e. default to two decimal places. If there are enough sample values that came from the cloak and it's possible to derive a higher number of decimal places from those samples, then we can deviate from the default.

AndreiBozantan commented 3 years ago

Fixed with #379