Closed fjab closed 3 years ago
Since all data is bucketed, I'm not sure there's a way for us to know for sure what the maximum accuracy is for decimal values. Obviously in this case, it's a fee, so it makes sense to round to two decimals, but the values are generated by interpolating within buckets. We could approach it like "interpolate with accuracy equal to the next-smaller bucket size" - so in this case, if the bucket size for the query was 0.1
, we would interpolate in multiples of 0.05
. I can look at this - might be feasible in the time remaining but not sure.
In the general case, just because we can only extract information up to a resolution of 0.01
doesn't mean there are no values with a greater resolution... So it's not clear if the above is always the best approach.
Hm, that makes sense. Not trivial. However, we don't even need to bucket to get anything out in this case. I can actually output fee_billed directly, and many values bypass the anonymizer. So another simple solution would be to assume the highest number of decimals that passes the anonymizer is a good value (?).
Just discussed this with @AndreiBozantan. It's probably a safe assumption that fewer decimal places are actually needed. I.e. default to two decimal places. If there are enough sample values that came from the cloak and it's possible to derive a higher number of decimal places from those samples, then we can deviate from the default.
Fixed with #379
See attached screenshot – this is from the Clinic table on demo.aircloak. fee_billed only has up to two decimals in the dataset, so showing it down to 14 decimals is not an accurate representation of the data.