Closed yoid2000 closed 4 years ago
NOT IN
values?Is this practical with non-isolating columns?
I don't understand the question.
How about NOT IN values?
These are already constrained to be shadow values, right? (equivalent to col <> val AND col <> val ...
)
Should this be included in the next release?
There are now several vulnerability fixes queued up (#4472, #4434, #4370) that all add additional restrictions, and none of which is individually very complex. We should probably either include none of them or all of them in the next release. @fjab ?
I don't understand the question.
Let me rephrase it then: is it practical to find enough values which isolate a single user for a column that is non-isolating?
These are already constrained to be shadow values, right? (equivalent to col <> val AND col <> val ...)
No. The shadow values check is only done for <>
and not like
.
Let me rephrase it then: is it practical to find enough values which isolate a single user for a column that is non-isolating?
I still don't understand. You mean practical for the attacker? To run the attack, the attacker needs to constrain the query in such a way that the shadow-table values end up isolating users. This is theoretically possible but probably pretty hard. Does this answer the question?
No. The shadow values check is only done for <> and not like.
But NOT IN
is equivalent to <>
so I think it is already covered.
You mean practical for the attacker?
I mean practical for the attacker in the current design, where the column has to be non-isolating.
But NOT IN is equivalent to <> so I think it is already covered.
You are right, I forgot about the normalization we are doing.
I mean practical for the attacker in the current design, where the column has to be non-isolating.
Ah. Well that's always a question ;)
Non-isolating columns can still have a lot of distinct values. The bounty attackers have had some success in attacking non-isolating columns after all, so the mechanism (or the setting of 80%) isn't as effective as I had supposed.
I think in any event that with the recent changes (broader definition of implicit range etc.) we may be able to get rid of the isolated column designation altogether.
So the short answer is that yes, at least in a bounty setting, the attack is practical...
@yoid2000 If we force IN
values to be in the shadow table, is there a point in also forcing the IN
columns to be non-isolating?
I see the former check as a super-set of the latter.
Also, what happens if the shadow analysis is turned off? This change will prevent any IN
conditions in that case.
Also, the subject of an IN
condition has to be clear. Does this requirement make sense anymore?
If we force IN values to be in the shadow table, is there a point in also forcing the IN columns to be non-isolating? I see the former check as a super-set of the latter.
I agree. There would be point in making it non-isolating only.
Also, what happens if the shadow analysis is turned off?
It's a policy decision. But IN()
is probably pretty useful, and I would think a customer might want the option of enabling IN()
even when shadow analysis is turned off. Especially given that the attack this defends against (linear reconstruction) is not so easy to run and has limited scope. @sebastian ?
Also, the subject of an IN condition has to be clear. Does this requirement make sense anymore?
Yes, because this is how we are able to seed the noise layers without floating.
@sebastian This means IN
will be broken by default when shadow analysis is disabled. A similar thing happens now with WHEN
clauses. We can either:
I agree. There would be point in making it non-isolating only.
@yoid2000 is the a "no" missing in that sentence? I.e. there is no point in making it non-isolating only anymore if one applies the restriction that the value must be in the shadow db?
On the other hand, if we allow you to turn of the shadow db restriction for those that have the shadow db analysis queries turned off, maybe it still makes sense to have the restriction that the value cannot be isolating?
allow any value when shadow analysis is turned off;
I don't like this, as it's a somewhat non-obvious side effect that, while described in the docs, for all intents and purposes will not be known by the administrator.
add configuration settings to disable frequent values checking;
I would much rather we have an explicit setting that allows turning of the check.
if so, should these be merge together in a single setting or do you want them separate?
with which other setting?
with which other setting?
The one which allows any values in WHEN
clauses (from #4656).
Merging them probably makes sense. Less knobs to expose.
@yoid2000 How do you want us to handle aggregated IN
conditions, i.e. something like HAVING MAX(column) IN (1,2,3)
?
BTW, this change will lead to more cumbersome usage of the system: we only store 200 frequent values per column and we don't list them anywhere. So some values with many users won't be allowed and one can't know which values are frequent either..
Would it be problematic from an anonymization perspective to list the most frequent values? Is there a situation in which the analyst might look at the list to understand why certain queries work / don't work / create unexpected results?
Would it be problematic from an anonymization perspective to list the most frequent values?
How would we do it?
Show it just in the air web interface?
Should be the easiest to use. The popovers could be modified to support it, or they could provide a link to a separate page.
I don't like it very much when information that is needed to write queries is only available through the web interface. I suppose it could be exposed as an API endpoint too, or through a query such as SHOW COMMON VALUES FOR column FROM table
. I prefer the query to an additional API endpoint.
Agree; goal should be that Aircloak can be used as fully as possible without the web interface. If there is an additional visual way to show it in the web interface, that's a nice bonus. Re: API or SQL, I have a hard time judging that. Let's say someone is querying Aircloak from a Jupyter notebook, could they easily access such an API endpoint?
Let's say someone is querying Aircloak from a Jupyter notebook, could they easily access such an API endpoint?
They could, but likely it would require a lot of extra effort (assuming they would otherwise use the Postgres interface). If we support an SQL query to provide the list of common values then that can be run whichever way one queries Aircloak (REST API, Postgres interface, or web interface).
The discussion here side-tracked from the main topic, so I made separate issue for it: #4672
@yoid2000 Hope you didn't miss my question in the extra noise: How do you want us to handle aggregated IN conditions, i.e. something like HAVING MAX(column) IN (1,2,3)
? I don't see a good way to handle this case.
@cristianberneanu am finishing something up then will take a close look, probably just after lunch.
@cristianberneanu I think I'm missing the significance of your question.
How do you want us to handle aggregated
IN
conditions, i.e. something likeHAVING MAX(column) IN (1,2,3)
?
Why not the same limitation? (i.e. the values in the IN
expression need to be in the shadow table).
Why not the same limitation? (i.e. the values in the IN expression need to be in the shadow table).
Because that won't make much sense for most aggregates. If you have a per-user aggregate like COUNT/SUM/AVG(column)
, how do you compute frequent values for it from a list of global frequent values?
Ah, yes of course. For MAX()
itself it would still make sense to require the value be in the shadow table, since the max
is still a specific value from the column. count()
I think is safe (meaning we don't need to care what the values in the IN()
statement are). Exploiting count
would require that different users have different counts, most per-user counts are unique from other users, and the attacker knows what they are.
sum()
and avg()
are tricky because if we are aggregating per user, and there is only one row per user, then sum(column)
is the same as column
, in which case we want the IN
value to be in the shadow table. But I presume that we cannot easily know when this is the case. So I believe that we would need to disallow sum
and avg
in combination with IN
. Or maybe simpler just to disallow all numeric aggregates with IN
. (Actually, I was a bit surprised that this wasn't already the case....)
@sebastian how damaging do you think this would be?
So, in conclusion:
count
and count(distinct)
are always safe;max
and min
have to be in the shadow table;avg
, sum
, stddev
are forbidden;Is this correct?
Yes this is how I see it.
On Tue, Aug 25, 2020, 10:23 Cristian Berneanu notifications@github.com wrote:
So, in conclusion:
- count and count(distinct) are always safe;
- max and min have to be in the shadow table;
- avg, sum, stddev are forbidden;
Is this correct?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Aircloak/aircloak/issues/4477#issuecomment-679881074, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQP5KNCVTPB3NR4UXPNZPDSCNYHXANCNFSM4OGJ2WWA .
Values in IN() must be from the shadow table or else we are susceptible to a linear reconstruction attack that simply puts randomly selected values that each isolate a different user in the IN() statement.