diffix / reference

Reference implementation for the Open Diffix anonymization mechanism.
https://www.open-diffix.org
Other
3 stars 0 forks source link

Add simpler count distinct AID variant to anonymizer #364

Closed edongashi closed 2 years ago

edongashi commented 2 years ago

I want to avoid the complex count distinct version in favor of a simple and specialized count(distinct aid) function. Because I find it hard to understand the flattening code, I need feedback on this particular function before moving on with the histogram aggregate. What you see here is my best guess which is probably wrong.

I also need the single AID version of isLowCount.

cristianberneanu commented 2 years ago

I also need the single AID version of isLowCount.

This made sense initially, but, after thinking about it for a while, I don't think it is the correct approach to the problem. Each histogram bar is a separate bucket, which we should treat like regular buckets. Meaning that we gather all associated AIDVs and use them together for applying the LCF.

Same thing for the count(distinct AID) function: I am not convinced anymore that we can use the simplified version in the general case, when multiple AIDs are present.

Lets take. for example, the scenario of a banking dataset, where the AIDs are the SSN and account_id columns. The same person can have multiple account and an account can have multiple owners (shared ownership). If we want to count the number of distinct accounts, we still need to do flattening per person when summing contributions, in order to hide outlier individuals with many accounts, and vice versa.

edongashi commented 2 years ago

Ok, I won't merge until we think more about this.

Each histogram bar is a separate bucket, which we should treat like regular buckets.

But in order to determine the histogram bin we need to know how many rows an AID has contributed. With multiple AIDs I'm not sure how to do the counting...

cristianberneanu commented 2 years ago

But in order to determine the histogram bin we need to know how many rows an AID has contributed. With multiple AIDs I'm not sure how to do the counting...

The same way we do count(distinct column), but replace column with the counted AID.