diffix / explorer

Tool to automatically explore and generate stats on data anonymized using Diffix
MIT License
2 stars 1 forks source link

Reduce accuracy in real values when noise is high #356

Closed sebastian closed 3 years ago

sebastian commented 3 years ago

You produce real values such as "Maximum 309.68" (reported by Felix, and I don't know exactly where it's from – Clinic DB for sure). Given the amount of noise we introduce the high resolution is misleading.

How about rounding such values to an accuracy level the noise would allow?

fjab commented 3 years ago

I think that would be great. Having said that, it's the same issue with Aircloak proper, no?

dandanlen commented 3 years ago

The min and max can come from one of two different sources

Let me know which column this was and i can take a closer look.

fjab commented 3 years ago

The column I saw this at was Clinic dataset, billings table, column fee_billed.

Although now that I think about it, I have no idea whether analysts would like that rounding or not. To me as an experimental physicist ;) it sounds wrong, but no idea how a data scientist sees that. Maybe we should check with them instead of just making assumptions.

dandanlen commented 3 years ago

When I run select max("fee_billed") from billings I get 314.81 so it looks like this mirrors the behavior in Aircloak.

Also, I didn't think there was a way to get at the noise value for min and max?

sebastian commented 3 years ago

Also, I didn't think there was a way to get at the noise value for min and max?

You are right. We don't have a good estimate of that. So maybe this issue is moot after all.