Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Fishing attack for queries with 2 or 3 users #3982

Open yoid2000 opened 4 years ago

yoid2000 commented 4 years ago

There is an attack that allows an attacker to learn with 100% confidence column values associated with any two specific users that can be isolated together. This is not a critical attack for our customers, but should be fixed before we do the next challenge.

The attack is possible because of the hard lower-bound of 2 on the number of distinct UIDs used for LCF. In other words, a bucket with 1 distinct user will always be suppressed, but a bucket with 2 distinct users may be reported.

Suppose that there is a WHERE clause that defines exactly two users (i.e. WHERE firstname = 'Bob', and there happens to be exactly two Bob's in the dataset). Any query that produces any output row at all reveals the values associated with both users.

For instance, if the following query (this is the so-called "per-bucket" case, because of the group by):

SELECT age, count(*)
FROM table
WHERE firstname = 'Bob'
GROUP BY 1

produces this output:

age count(*)
28 2

then we know for certain that both Bob's are age 28.

Alternatively, the attacker could learn the same thing with this query (this is the so-called "global" case, because no group by):

SELECT count(*)
FROM table
WHERE firstname = 'Bob' and age = 28

if the query returned a value of 2 or more.

If the attacker can isolate three users (i.e. there are three Bob's), then the situation is in a way worse because there is a roughly 1/6 chance that a value will be reported if all three users have that same value, and if a value is reported, there is a good chance that it is because the three users all have that value.

For instance, suppose that there are five values. The probability that three users share the same value is 1/25. The probability that two of the three users share the same value is around 1/5. But the probability that the value is reported if two users share it is quite low, around 1/750. Therefore, if there are three users, and a value is reported, it is highly likely in this case that all three users share the value.

I want to prevent this attack, but I don't want to raise the LCF threshold back up, at least not for the normal case.

I propose that we deal with this as follows. Define selected columns as the columns in the GROUP BY of an anonymizing query. For example, the query:

SELECT name, age, gender, count(*)
FROM table
GROUP BY 1,2

has name, age, and gender as selected columns. Define N as the number of selected columns (here N=3).

Define a bucket as one row of the cloak output. A bucket is defined by the values of the selected columns (each bucket has a distinct set of column values).

Define a bucket group as the set of one or more buckets where N-1 selected columns have the same value, and the Nth column value varies. So in the above query, you might have a bucket group where age=10 and gender='F', and then each bucket in the bucket group has a different name.

Compute the number of distinct UIDs for a bucket group as either the number of recorded distinct UIDs from the ac_min_uid and ac_max_ud for each bucket, or the number of distinct UIDs ac_count_duid for the bucket with the most distinct UIDs, whichever is greater.

For each group, compute an LCF using mean 7 and stddev 1. The seed for this computation should be based on the min and max UIDs across the whole group, and the number of distinct UIDs as computed in the previous paragraph.

If the group fails LCF, then all buckets in the group are treated as though they each individually failed LCF.

If a bucket does not fail any of the group LCF computations, then the normal LCF computation, with mean 4 and stddev 0.5, is applied to the individual bucket.

yoid2000 commented 4 years ago

If it is not convenient to fix this before the challenge, I could probably live without it.