Various LR attacks using string functions

yoid2000 commented 4 years ago

... Updating to include new attacks from UPenn

General comments

All three of these attacks require randomness in the column values themselves.

Original attack

Reinhard after all found a linear reconstruction attack that works on text columns. The attack is possible on text columns that 1) have a lot of randomness built into them, and 2) are relatively long. He was able to successfully run the attack on the uid column of gda_taxi. That string is 32-characters long, each character being a hex value. (I don't exactly recall, but likely that column was generated using a random hash).

Recall that the attack requires that the attacker select a group of some 10s of users, and then generates 1000 - 2000 random subsets of those users.

Because of the randomness inherent in the column values, Reinhard was able to generate enough random groups using the clause:

WHERE substring(uid from x for y) LIKE '%z%'.

In particular, with 16 symbols, there is a 50% probability that a given 1-character string (i.e. 'A') appears within a substring of length 8, a slightly larger probability that it appears in a substring of length 9, and so on. In any event, Reinhard was able to make enough random groups this way to attack the gender column. (I've asked Reinhard to look into whether he can attack columns where there are more values.)

For the attack to work, the analyst needs essentially to know the set of values in the column as prior knowledge.

Two example queries for the attack are:

SELECT count(DISTINCT uid)
FROM table
WHERE random_col LIKE '%xyz%' and
        substring(random_col FROM 3 FOR 8) LIKE '%a%'

SELECT count(DISTINCT uid)
FROM table
WHERE random_col LIKE '%xyz%' and
        substring(random_col FROM 3 FOR 8) LIKE '%b%'

The random_col LIKE '%xyz%' clause is there to select a set of a few 10s of users from which to build random subsets. The three constants in the second clause are modified to get different sets.

UPenn attack 1

This attack relies on randomness in the last digits of the pickup_latitude column in the taxi database.

Basic form of the query is:

SELECT COUNT(*)
FROM rides
AND trip_distance = 0.87
AND SUBSTRING(CAST(pickup_latitude AS text),5,1) IN ('0','1','2','3','4')

Attacker changes the character position and set of selected digits to get different random user sets. This allows for around 1200 distinct queries. Note that the trip_distance clause is needed to limit the number of matching rows to a relatively small number (a few 10s) so that the linear equation can be computed.

This variant generates more random sets (around 300K):

SELECT COUNT(*)
FROM rides
AND trip_distance = 0.87
AND SUBSTRING(CAST(pickup_latitude AS text),5,1) IN ('0','1','2','3','4','6','7')  
AND SUBSTRING(CAST(pickup_latitude AS text),7,1) IN ('1','2','3','5','6','7','9')

UPenn attack 2:

This gives around 1800 sets:

SELECT COUNT(*)
FROM rides
AND trip_distance = 0.87
AND uid LIKE '%2%'
AND uid LIKE '%7%'
AND uid LIKE '%0%'
AND uid LIKE '%D%'

In the case of the taxi database, the med column is also pretty random, so by using both med and uid columns, the attacker could make around 30K sets.

Solutions

From the first attack, we need to prevent the combination of substring() and LIKE. From the second attack, we need to limit the number of instances of substring() (where we can regard substring() IN (x,y,z) as for instance three instances). From the third attack, we need to limit the number of instances of LIKE. Unfortunately these limits cannot be per-column but rather per-query, because one can mix and match between different columns to get the same effect (at least for the UPenn queries).

If I recall correctly, some of the TF queries had quite a few LIKE statements. So perhaps what we might want to do is to let the analyst whitelist columns that are safe (not random), and only apply limits to non-whitelisted columns.

Alternatively (and probably better), we should make it so that the admin can tag columns to have the limits applied so that in the normal case no extra configuration is required. (The normal case here not so much being where there are not columns with randomness, but rather where the customer is not really concerned with the attack.)

See proposed limits at https://github.com/Aircloak/aircloak/issues/4370#issuecomment-717037716 below

yoid2000 commented 4 years ago

Neither @sebastian nor I see a good reason not to just disallow the combination of substring() and LIKE on the same column. In other words, either substring() or LIKE are ok, but not both.

yoid2000 commented 4 years ago

I've labeled this as being needed in the ebony release. But it is not yet ready. I need to do a little more experimentation. Should be ready end of August.

yoid2000 commented 3 years ago

My former student Reinhard ran a bunch of measurements to see how effective different numbers of LIKE and substring conditions are on this attack.

My conclusion is that the fix should:

Allow the analyst to blacklist columns that have too much randomness.
Limit the total number of LIKE conditions across all blacklisted columns to one.
Limit the total number of substring() conditions across all blacklisted columns to two.
Disallow the same blacklisted column from having both substring() and LIKE conditions.

In practice I don't see any need for Telefonica to actually blacklist columns. To verify this I could write up a description of the attack and we can ask them if they think they need it.

I also don't expect to be running another bounty program with the current "proxy" form of cloak. Given this, and assuming that Telefonica doesn't ask for it, I really don't see a strong need to implement this at all. @sebastian ?

sebastian commented 3 years ago

Is there a correspondence between a column being isolating and it having this blacklist capability? Could we say just reuse the isolating property rather than introducing a new classification method?

Also blacklist is confusing it use. We already have the ability to exclude columns from being shown in the model altogether. That seems closer (in concept) to what a blacklist would be expected to do?

yoid2000 commented 3 years ago

We already have the ability to exclude columns from being shown in the model altogether.

But this ability still allows the columns to be used in WHERE clauses, yes?

yoid2000 commented 3 years ago

Is there a correspondence between a column being isolating and it having this blacklist capability?

Not really. The columns for this attack are those with inherent randomness in them. Such columns would tend to be isolating as well, but still it is different.

yoid2000 commented 3 years ago

Regarding which columns to whitelist, in strict terms we care about "how random" the symbols in the values are, and how many symbols there are. The more symbols and the more random, the better any given attack will be.

Actually, "random" here is a misnomer. What we really care about is how "mixed" the symbols are, in the sense of any given symbol is about as likely to appear in a given position as any other symbol.

Nevertheless I would regard a string or substring to be mixed enough when either the string has been generated through a strongly random process like a hash function or very noisy numeric data (high precision low accuracy digits), or through the assignment of incrementing values (the low-order digits).

If there are two or more such well-mixed digits, then the column may be blacklisted.

Note that if the column really isn't one where the attacker could know all the values in the column, or all the values that match some condition from another column (like all the values for a given day), then it really isn't necessary to blacklist the column.

yoid2000 commented 3 years ago

We already have the ability to exclude columns from being shown in the model altogether.

IIRC, these columns cannot be in the SELECT, but can be in WHERE clauses, yes?

Also blacklist is confusing it use.

I agree. We need some different tagging designation.

sebastian commented 3 years ago

We already have the ability to exclude columns from being shown in the model altogether. But this ability still allows the columns to be used in WHERE clauses, yes?

Only in a really restricted manner. Take a look at our design doc: https://docs.google.com/document/d/11jdS3lRnbGOCBhM7_kaKcy4dYYoQlIyYsnS6JYfykzE/edit#heading=h.awfkdr6xy69m

We explicitly started out with a limited set of capabilities. Specifically a column marked as unselectable can only be used in a WHERE-clause under the following circumstances:

If marked as a key and then only with another key-column

sebastian commented 3 years ago

... how many symbols there are...

What do you mean by "symbol" in this context?

yoid2000 commented 3 years ago

What do you mean by "symbol" in this context?

A single character in a string (or number converted to a string).

yoid2000 commented 3 years ago

If marked as a key and then only with another key-column

I don't quite understand what this means in the context of WHERE clause. Can you give an example. It may certainly well be the case that a greylisted column is safe from these LR attacks...

edongashi commented 3 years ago

We use the following terminology

Isolating column (obvious)
Unselectable (greylisted) column - Unselectable columns can only be joined together, counted, and/or grouped by.
Excluded (blacklisted, ignored) column - Excluded columns will not appear in the data source and cannot be referenced in any way from queries.

Any expression that is somehow derived from an unselectable column is also considered unselectable. Those expressions cannot appear in SELECT (except COUNT) or WHERE.

sebastian commented 3 years ago

I don't quite understand what this means in the context of WHERE clause.

The following queries would be allowed:

SELECT count(*)
FROM a, b
WHERE a.unselectableKeyColumn = b.unselectableKeyColumn 

SELECT count(*)
FROM a INNER JOIN b ON a.unselectableKeyColumn = b.unselectableKeyColumn

More or less everything else is forbidden.

yoid2000 commented 3 years ago

More or less everything else is forbidden.

Ok. Then all grey-listed columns are safe from this attack.

But of course grey-listing is much more restrictive than the restrictions described here.

sebastian commented 3 years ago

But of course grey-listing is much more restrictive than the restrictions described here.

Question still is: should we maybe say that for such columns (and maybe we provide some way to help determine if a column is in fact in this category or not) they should strictly speaking be marked as unelectable (greylisted).

yoid2000 commented 3 years ago

The counter example, though, would be pickup_latitude in the taxi table. It has a lot of randomness that can be exploited, but you still want to use it in where clauses. You couldn't greylist that. And limiting the number of LIKE and substring would not hinder usefulness. (Though arguably you could push it through a view that does some rounding and gets rid of most of the randomness.)

sebastian commented 3 years ago

Let's hold off on implementing this until we know for a fact that Telefonica needs it.

sebastian commented 3 years ago

Telefonica is happy with not having this for now.

Aircloak / aircloak