Closed yoid2000 closed 4 years ago
As discussed offline we should look into getting proper low-effect detection and implement it correctly.
@yoid2000 will come up with a design.
This is a blocker for the upcoming attack challenge.
Note: we don't have a proper low-effect detection mechanism in place yet. So hence we should think of an alternative permanent solution here... added a label to indicate it is a blocker for the attack challenge.
The shadow table prevents this attack. The variant that spreads the attack over multiple queries is fixed by setting the number of allowed unsafe <>
conditions to zero instead of one.
So in short this is not blocking the attack challenge.
The shadow table prevents this attack. The variant that spreads the attack over multiple queries is fixed by setting the number of allowed unsafe <> conditions to zero instead of one.
Is that the way it is documented in the diffix paper? It's certainly not the way we describe our restrictions in our restrictions paper... this is highly inconsistent with how we are describing our system as operating. As such it's very unsatisfactory
You are referring to the allowing one instead of zero?
In the diffix doc, it states that you cannot have any negative conditions that are not in the shadow...
So yes it is currently inconsistent...
We should make it consistent and decide on one way or the other.
Currently it is configurable and the default is 1.
We could keep it configurable, and make the default 0.
If this screws up TF, we could then tell them to either set it to 1, or compute a shadow table...
I don't understand what needs to be done here.
@sebastian please confirm that we should set the default value to zero for the allowed number of negands or IN values that are not in the shadow table.
Paul, yes so it seems. That way our system is safe. Please document this well as well.
@cristianberneanu ok?
@sebastian I had already documented it that way ;)
@sebastian I had already documented it that way ;)
I meant in changelog
I meant in changelog
I don't know what this means.
I'll make the change and update the changelog.
BTW @yoid2000 , should we take this chance and increase the number of stored entries in the shadow table? 200 seems pretty low, I think 10k would be a more reasonable amount.
We had 10K before, and I had it changed to 200.
The danger is opening up cases where a shadow entry can be chaff or an isolator in a limited data context.
I did think more about this in the last couple days, and making an attack is harder than I previously thought. We may be able to increase the number, but probably not to 10K. Let me think about this some more.
But for now leave it at 200.
By the way, I understand we allow the admin to manually label columns as isolating.
Do we also have a way to allow the admin to manually generate shadow tables? This might be an acceptable solution for TF.
The danger is opening up cases where a shadow entry can be chaff or an isolator in a limited data context.
Shouldn't this be correlated with the number of users a value belongs to instead?
Do we also have a way to allow the admin to manually generate shadow tables?
AFAIK, no.
Shouldn't this be correlated with the number of users a value belongs to instead?
I don't think so. Well perhaps it should also be correlated with the number of users a value belongs to (currently we have a cutoff of 10, but that is in fact probably too small).
The general situation is that an attacker is running an attack based on selecting some fraction of users (enough users that there won't be much suppression, but ideally no more than that). The attacker wants to come up with a set of five negands that are chaff, so in other words none of the selected users have those negand values.
If you only correlate with the number of users per value, then if the dataset has a very large number of users, you could still have a situation where there are very many shadow values, and so among a small set of users it is easy to find 5 of them that belong to no users.
then if the dataset has a very large number of users
Then the threshold can be made relative instead of absolute.
The attacker wants to come up with a set of five negands that are chaff, so in other words none of the selected users have those negand values.
We could probe for such cases dynamically. I think a nicer analyst experience would be given by something like:
max_rare_negative_conditions = 0
, then shadow tables are disabled;max_rare_negative_conditions = 1
, then shadow tables contain 10k entries;max_rare_negative_conditions > 1
, then negative conditions are probed;@cristianberneanu I don't want to have 10K entries. I'm at the moment only comfortable with 200.
Probing for negative conditions is essentially what we are trying to do with LED, if I understand you? If and when we do that, then we won't need the shadow table...
The probing doesn't need to happen on the real data (though that would be ideal), it could be performed over the shadow table statistics.
If we know the average user id density and we store the min/max user ids for each shadow value, we could compute the density of the intersection between different conditions and apply an heuristic on that.
But maybe this would be too vulnerable to be worth doing, I was hoping we find an algorithm that is more analyst-friendly than the current design (which might reject too much), while, at the same time, is easier to implement than full-blown LED ...
By the way, my thinking behind having 200 shadow table values, in addition to of course protecting privacy, was that any disallowed values would just not have that much effect on the query. On average, a single negand would only change your result by 0.5%. Of course, for a given query a disallowed negand may have a much bigger effect, but overall my hope was that 200 shadow table values would still be pretty good.
Anyway, can we put this discussion on hold? At the moment no customer is complaining about the limitation, and i don't see an easy fix, so maybe table it for now...
Anyway, can we put this discussion on hold? At the moment no customer is complaining about the limitation, and i don't see an easy fix, so maybe table it for now...
Sure, it was mostly food for future thought.
In #2273, I describe a negative condition chaff attack where the noise generated by multiple zero-effect conditions is used to detect presence or absense of the victim in a difference attack. The short-term fix (which turned out to be a huge pain in the ass) uses the shadow table described in #2486. The idea here is to allow only a single negative condition for values that are not in the shadow table.
This, however, is not enough. It turns out that you can get the same effect by issuing multiple queries, each with a different chaff negative condition, and then taking the average of the difference over time.
So for instance, while the attack in #2273 made the following attack pair:
Instead we would turn that into multiple pairs of queries like this:
For the bucket holding the victim, the chaff condition results in a bigger difference between the bucket counts. If we take the average of the difference over a bunch of these pairs, then we'll see which bucket has the noisier difference and find our victim.
Note that this attack relies on the reverse negative condition (i.e. only one female), because our current restrictions prevent two negative conditions on rare values, so it is not as bad as it might otherwise be. However, I have always had a concern that in fact the shadow table approach isn't quite good enough, because the shadow table runs over the full data, and is less valid when other conditions are present (some context).
The real solution to all this is to count the number of users that each negative condition isolates, and to drop low-effect conditions. We've discussed doing this both with probing and with
CASE
. The latter seems preferable because we don't need multiple queries. (Though the downside is that theCASSE
query itself is more expensive.)Maybe the time has come to really implement low-effect detection/prevention