IN() values must be from shadow table

yoid2000 commented 4 years ago

Values in IN() must be from the shadow table or else we are susceptible to a linear reconstruction attack that simply puts randomly selected values that each isolate a different user in the IN() statement.

cristianberneanu commented 4 years ago

Is this practical with non-isolating columns?
How about NOT IN values?
Should this be included in the next release?

yoid2000 commented 4 years ago

Is this practical with non-isolating columns?

I don't understand the question.

How about NOT IN values?

These are already constrained to be shadow values, right? (equivalent to col <> val AND col <> val ...)

Should this be included in the next release?

There are now several vulnerability fixes queued up (#4472, #4434, #4370) that all add additional restrictions, and none of which is individually very complex. We should probably either include none of them or all of them in the next release. @fjab ?

cristianberneanu commented 4 years ago

I don't understand the question.

Let me rephrase it then: is it practical to find enough values which isolate a single user for a column that is non-isolating?

These are already constrained to be shadow values, right? (equivalent to col <> val AND col <> val ...)

No. The shadow values check is only done for <> and not like.

yoid2000 commented 4 years ago

Let me rephrase it then: is it practical to find enough values which isolate a single user for a column that is non-isolating?

I still don't understand. You mean practical for the attacker? To run the attack, the attacker needs to constrain the query in such a way that the shadow-table values end up isolating users. This is theoretically possible but probably pretty hard. Does this answer the question?

No. The shadow values check is only done for <> and not like.

But NOT IN is equivalent to <> so I think it is already covered.

cristianberneanu commented 4 years ago

You mean practical for the attacker?

I mean practical for the attacker in the current design, where the column has to be non-isolating.

But NOT IN is equivalent to <> so I think it is already covered.

You are right, I forgot about the normalization we are doing.

yoid2000 commented 4 years ago

I mean practical for the attacker in the current design, where the column has to be non-isolating.

Ah. Well that's always a question ;)

Non-isolating columns can still have a lot of distinct values. The bounty attackers have had some success in attacking non-isolating columns after all, so the mechanism (or the setting of 80%) isn't as effective as I had supposed.

I think in any event that with the recent changes (broader definition of implicit range etc.) we may be able to get rid of the isolated column designation altogether.

So the short answer is that yes, at least in a bounty setting, the attack is practical...

cristianberneanu commented 4 years ago

@yoid2000 If we force IN values to be in the shadow table, is there a point in also forcing the IN columns to be non-isolating? I see the former check as a super-set of the latter.

Also, what happens if the shadow analysis is turned off? This change will prevent any IN conditions in that case.

cristianberneanu commented 4 years ago

Also, the subject of an IN condition has to be clear. Does this requirement make sense anymore?

yoid2000 commented 4 years ago

If we force IN values to be in the shadow table, is there a point in also forcing the IN columns to be non-isolating? I see the former check as a super-set of the latter.

I agree. There would be point in making it non-isolating only.

Also, what happens if the shadow analysis is turned off?

It's a policy decision. But IN() is probably pretty useful, and I would think a customer might want the option of enabling IN() even when shadow analysis is turned off. Especially given that the attack this defends against (linear reconstruction) is not so easy to run and has limited scope. @sebastian ?

Also, the subject of an IN condition has to be clear. Does this requirement make sense anymore?

Yes, because this is how we are able to seed the noise layers without floating.

cristianberneanu commented 4 years ago

@sebastian This means IN will be broken by default when shadow analysis is disabled. A similar thing happens now with WHEN clauses. We can either:

allow any value when shadow analysis is turned off;
add configuration settings to disable frequent values checking; if so, should these be merge together in a single setting or do you want them separate?

sebastian commented 4 years ago

I agree. There would be point in making it non-isolating only.

@yoid2000 is the a "no" missing in that sentence? I.e. there is no point in making it non-isolating only anymore if one applies the restriction that the value must be in the shadow db?

On the other hand, if we allow you to turn of the shadow db restriction for those that have the shadow db analysis queries turned off, maybe it still makes sense to have the restriction that the value cannot be isolating?

allow any value when shadow analysis is turned off;

I don't like this, as it's a somewhat non-obvious side effect that, while described in the docs, for all intents and purposes will not be known by the administrator.

add configuration settings to disable frequent values checking;

I would much rather we have an explicit setting that allows turning of the check.

if so, should these be merge together in a single setting or do you want them separate?

with which other setting?

cristianberneanu commented 4 years ago

with which other setting?

The one which allows any values in WHEN clauses (from #4656).

sebastian commented 4 years ago

Merging them probably makes sense. Less knobs to expose.

cristianberneanu commented 4 years ago

@yoid2000 How do you want us to handle aggregated IN conditions, i.e. something like HAVING MAX(column) IN (1,2,3)?

BTW, this change will lead to more cumbersome usage of the system: we only store 200 frequent values per column and we don't list them anywhere. So some values with many users won't be allowed and one can't know which values are frequent either..

fjab commented 4 years ago

Would it be problematic from an anonymization perspective to list the most frequent values? Is there a situation in which the analyst might look at the list to understand why certain queries work / don't work / create unexpected results?

sebastian commented 4 years ago

Would it be problematic from an anonymization perspective to list the most frequent values?

How would we do it?

Adding additional fake tables that can be queried?
Expand the SQL query language somehow?
Show it just in the air web interface?

edongashi commented 4 years ago

Show it just in the air web interface?

Should be the easiest to use. The popovers could be modified to support it, or they could provide a link to a separate page.

sebastian commented 4 years ago

I don't like it very much when information that is needed to write queries is only available through the web interface. I suppose it could be exposed as an API endpoint too, or through a query such as SHOW COMMON VALUES FOR column FROM table. I prefer the query to an additional API endpoint.

fjab commented 4 years ago

Agree; goal should be that Aircloak can be used as fully as possible without the web interface. If there is an additional visual way to show it in the web interface, that's a nice bonus. Re: API or SQL, I have a hard time judging that. Let's say someone is querying Aircloak from a Jupyter notebook, could they easily access such an API endpoint?

sebastian commented 4 years ago

Let's say someone is querying Aircloak from a Jupyter notebook, could they easily access such an API endpoint?

They could, but likely it would require a lot of extra effort (assuming they would otherwise use the Postgres interface). If we support an SQL query to provide the list of common values then that can be run whichever way one queries Aircloak (REST API, Postgres interface, or web interface).

cristianberneanu commented 4 years ago

The discussion here side-tracked from the main topic, so I made separate issue for it: #4672

cristianberneanu commented 4 years ago

@yoid2000 Hope you didn't miss my question in the extra noise: How do you want us to handle aggregated IN conditions, i.e. something like HAVING MAX(column) IN (1,2,3)? I don't see a good way to handle this case.

yoid2000 commented 4 years ago

@cristianberneanu am finishing something up then will take a close look, probably just after lunch.

yoid2000 commented 4 years ago

@cristianberneanu I think I'm missing the significance of your question.

How do you want us to handle aggregated IN conditions, i.e. something like HAVING MAX(column) IN (1,2,3)?

Why not the same limitation? (i.e. the values in the IN expression need to be in the shadow table).

cristianberneanu commented 4 years ago

Why not the same limitation? (i.e. the values in the IN expression need to be in the shadow table).

Because that won't make much sense for most aggregates. If you have a per-user aggregate like COUNT/SUM/AVG(column), how do you compute frequent values for it from a list of global frequent values?

yoid2000 commented 4 years ago

Ah, yes of course. For MAX() itself it would still make sense to require the value be in the shadow table, since the max is still a specific value from the column. count() I think is safe (meaning we don't need to care what the values in the IN() statement are). Exploiting count would require that different users have different counts, most per-user counts are unique from other users, and the attacker knows what they are.

sum() and avg() are tricky because if we are aggregating per user, and there is only one row per user, then sum(column) is the same as column, in which case we want the IN value to be in the shadow table. But I presume that we cannot easily know when this is the case. So I believe that we would need to disallow sum and avg in combination with IN. Or maybe simpler just to disallow all numeric aggregates with IN. (Actually, I was a bit surprised that this wasn't already the case....)

@sebastian how damaging do you think this would be?

cristianberneanu commented 4 years ago

So, in conclusion:

count and count(distinct) are always safe;
max and min have to be in the shadow table;
avg, sum, stddev are forbidden;

Is this correct?

yoid2000 commented 4 years ago

Yes this is how I see it.

On Tue, Aug 25, 2020, 10:23 Cristian Berneanu notifications@github.com wrote:

So, in conclusion:

count and count(distinct) are always safe;

max and min have to be in the shadow table;

avg, sum, stddev are forbidden;

Is this correct?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Aircloak/aircloak/issues/4477#issuecomment-679881074, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQP5KNCVTPB3NR4UXPNZPDSCNYHXANCNFSM4OGJ2WWA .

Aircloak / aircloak

IN() values must be from shadow table #4477