Closed yoid2000 closed 4 years ago
@cristianberneanu @sebastian FYI
so we are stuck with a noise layer per node. ... One implication of a low LE threshold is that we cannot inform the analyst when LE has taken place.
Just for clarification. We are stuck with noise layers even for conditions that we consider having no effect, and that we compensate for, right?
I wonder if there isn't some way to distinguish between us compensating for the lack of a value, and there being a value... I don't have a clear attack, it's just something in the back of my mind.
Just for clarification. We are stuck with noise layers even for conditions that we consider having no effect, and that we compensate for, right?
Yes. I looked for a way but couldn't find one. The problem is that if you have no noise layer for zero-effect (0e) nodes, then you also need to have no noise layer for one-effect (1e) noise layers (in which the answer is adjusted to remove the effect of the user). Otherwise the attacker can distinguish between the two by the presence or absence of noise.
But if you remove noise for both 0e and 1e nodes, then the attacker can start concentrating on the difference between buckets that have 2 users + noise, and buckets that have 1 user but which get adjusted to 0 users and no noise. We could perhaps take the position that it is hard to find such cases, but what I call compound posors (posors with multiple AND'd conditions, like OR (A and B and C)
) give the attacker much more flexibility in choosing which users are selected with the posor.
I wonder if there isn't some way to distinguish between us compensating for the lack of a value, and there being a value...
Not sure what you are referring to here.
I wonder if there isn't some way to distinguish between us compensating for the lack of a value, and there being a value...
Not sure what you are referring to here.
If I read one of your other previous issues correctly your proposal for compensating for 1e
users was by add 1x
of noise to the query to compensate for the user? Rather than to actually adjust the aggregate by the contribution of the 1e
user?
Oh, and you are wondering if an attacker can detect that we've done this, presumably because of the difference between our adjustment and the true effect of dropping the condition?
Oh, and you are wondering if an attacker can detect that we've done this, presumably because of the difference between our adjustment and the true effect of dropping the condition?
Yes, exactly that.
I discuss this in #3914.
Just a quick heads up. I'm going to change how we make noise layers in the design here. We can have fewer noise layers if we only make noise layers for conditions that fail to pass a noisy LCF-style threshold. Will let you know when the change is complete.
Ok, I posted the changes in the first comment. Will work on some more examples, especially sub-queries.
@cristianberneanu at the moment I'm not seeing any reason for limiting OR
to a single sub-query. Since we don't drop conditions per se (but instead adjust), the fact that a node spans multiple sub-queries doesn't prevent us from making the necessary noise layers and adjustments.
If you can find a counter example I'd like to see it.
In what follows I use the term 'node' to refer to a single posor, negand, or range condition, where posor is OR (c1 AND [NOT] c2 AND [NOT] c3) for one or more conditions cN, negand is AND NOT (c1 and [NOT] c2 and [NOT] c3) for one or more conditions cN, and range is AND col BETWEEN X and Y. I'm not including posands when I say "node".
This is not 100% clear to me. For the following condition:
a = 1 or a <> 2 or (a between 0 and 100) or (a <> 3 and b between 1 and 4)
which are the nodes?
final-sd is the standard deviation of the noise layer (as normally computed)
What does this mean? A noisy aggregator has a final SD, a noise layer adds to that SD.
Are adjustments done to the SD for computing the aggregators or to the aggregators themselves?
How is an aggregator like sum(x)
adjusted?
The noise associated with LE conditions is not factored into the noise reporting
This will really hurt usability. The value of the noise estimation consists entirely in it being an upper bound on the real noise added. We could add less noise, but adding more will make it useless for analysts (they can't use it to make any assumptions about the result).
(col3 = 'a' and col4 <> 'j' and col4 <> 'k') OR (col3 = 'b' and col4 <> 'j' and col4 <> 'k')
From what I see in the example, the probe checks each posor, but it then doesn't correctly check for each negands in the context of the posor. The checks in the query are simply col4 = 'j'
, when they should be at the least be col3 = 'a' and col4 = 'j'
plus col3 = 'b' and col4 = 'j'
.
Furthermore, when multiple negands are present, all possible combinations should be checked. Otherwise, we won't detect if both col4 <> 'j'
and col4 <> 'k'
are LE in the context of the posands. This will greatly complicate the probe query.
I would also like to see an example of a query with a having filter and a where filter in the inner most subquery. Something like:
select cli_district_id, sum(cnt)
from (
select uid, cli_district_id, count(*) as cnt
from transactions
where acc_district_id in (1,2)
group by 1,2
having count(*) between 0 and 1000
) t
group by 1
I don't have a good feeling about normalization across subqueries. It can be done in some scenarios, but it will get really hairy when combining joins, grouping, having and where filters. I think we should leave it for the next version.
Overall, I feel that this change will be very complex and I am not optimistic that we will finish sooner than 3-4 months. And this if we don't find other problems with it during implementation ... maybe it would be better to have one release fully allocated to this purpose?
This is not 100% clear to me. For the following condition:
a = 1 or a <> 2 or (a between 0 and 100) or (a <> 3 and b between 1 and 4)
which are the nodes?
Normalization means turning the expression into a set of and groups
seperated by OR
and up to one NOT OR
.
The above expression has four and group
expressions. Each such expression is one node. In addition three of the four and group
expressions have one or more of negand or range. Each of these is a node that needs to be checked if the containing and group
is not LE. There are four of these negand/range nodes.
That means there are a total of eight nodes here, all of which need to be checked. The CASE
statements would include the following:
-- and groups (must be checked):
CASE WHEN a = 1...
CASE WHEN a <> 2...
CASE WHEN a between 0 and 100...
CASE WHEN (a <> 3 and b between 1 and 4)
-- negands and ranges (checked if the corresponding and group is not LE)
CASE WHEN a = 2...
CASE WHEN a < 0 or >= 100...
CASE WHEN a = 3
CASE WHEN b < 1 or >= 4
-- negands and ranges (checked if the corresponding and group is not LE)
But shouldn't this be:
CASE WHEN a = 2...
CASE WHEN a NOT BETWEEN 0 AND 100
CASE WHEN a = 3 AND b BETWEEN 1 AND 4
CASE WHEN a <> 3 AND b NOT BETWEEN 1 AND 4
final-sd is the standard deviation of the noise layer (as normally computed)
What does this mean? A noisy aggregator has a final SD, a noise layer adds to that SD. Are adjustments done to the SD for computing the aggregators or to the aggregators themselves? How is an aggregator like sum(x) adjusted?
No changes are made to the final-sd itself, or to the currently computed noise layers. The changes take place in addition to the current mechanism.
So let's give an example. Say that you have a bucket, and you've computed the final-sd for that bucket as normal, and the value is final-sd = 1
. Say there are three normal noise layers (i.e. not associated with any LE nodes). You compute three noise values and sum them together. Let's say that the resulting noise is n=2.11
. Up to now this is normal operation and doesn't change.
Now suppose there is an LE posor node with 2e. Because it is a posor, you want to adjust down. Since final-sd = 1
, you will adjust down by 2. Thus the resulting noise will be n=0.11
.
As another example, say that you have a bucket and the aggregator is sum(x)
, and you've computed the final-sd for that bucket as final-sd = 311.1
. Say there are two normal noise layers, and so based on normal operation you compute a noise as n = -681.1
.
Now suppose there is an LE negand node with 1e. Since it is a negand, you want to adjust up. Since final-sd = 311.1
, you will add 311.1 to the noise value, ending with a final noise of n = -370
.
But shouldn't this be:
CASE WHEN a NOT BETWEEN 0 AND 100
Yes, with the exception that strictly speaking in standard SQL a not between 0 and 100
is equivalent a < 0 or a > 100
, but since the cloak treats between
a little different, we want to do a < 0 or a >= 100
instead.
CASE WHEN a = 3 AND b BETWEEN 1 AND 4
CASE WHEN a <> 3 AND b NOT BETWEEN 1 AND 4
Yes, I think you are right.
The noise associated with LE conditions is not factored into the noise reporting
This will really hurt usability. The value of the noise estimation consists entirely in it being an upper bound on the real noise added. We could add less noise, but adding more will make it useless for analysts (they can't use it to make any assumptions about the result).
This is the result of a trade-off. Either we can set the LE noisy threshold to 2-3-4, or we can set it to 1-2. If we do the latter, then we cannot report the noise and so the analyst has less information to work with. If we do the former, then we'll end up with more nodes checking positive for LE, and therefore more adjustments and more noise overall.
I think it is better to have less noise but less knowledge of the noise than more noise.
One way we can mitigate this lack of reporting is to report when at least several conditions are LE (but don't say which ones obviously). This way at least the analyst can know that a substantial amount of LE took place. If the analyst really cares, he or she could check the conditions one by one using LCF...
By the way, the noise estimate is not an upper bound on the noise added. It is the SD of the noise added. There could be a lot more noise due to flattening, which we cannot report to the analyst. Of course it would also turn out that the random noise is unusually high (several standard deviations), though this is rare.
Now suppose there is an LE posor node with 2e. Because it is a posor, you want to adjust down. Since final-sd = 1, you will adjust down by 2. Thus the resulting noise will be n=0.11.
So the mean of the noise is what is changed, correct? Instead of a mean of 0, we will have a mean of +/- 1-2 SD. I think putting it like this is clearer. And this will have to be done in all cases, including when dropping outliers, getting the top users, etc., right?
Furthermore, when multiple negands are present, all possible combinations should be checked. Otherwise, we won't detect if both col4 <> 'j' and col4 <> 'k' are LE in the context of the posands. This will greatly complicate the probe query.
We can't possibly check all combinations, and I don't think it is necessary. I think it would be difficult for an analyst to come up with a query in practice where two or more negands are individually not LE but are together LE with 1e. Given the cost of defending against this, it is not worth it. Let's not worry about it.
So the mean of the noise is what is changed, correct? Instead of a mean of 0, we will have a mean of +/- 1-2 SD. I think putting it like this is clearer.
Yes, it could also be seen this way. We shift the mean. But the shift is +- 1-2 SD per LE node. There is also the possible shift for non-LE nodes (see R6).
And this will have to be done in all cases, including when dropping outliers, getting the top users, etc., right?
I don't think we'd want to do this. We want to approximate what would happen if the LE node were dropped from the query. I don't know how dropping the LE node would affect these other aspects, but I don't think it would be like this. I guess this is the danger of thinking of the adjustment in terms of an adjustment on the mean. That is only the case for the final noise value. In this sense, it might be clearer to really think of it as a final adjustment...
(col3 = 'a' and col4 <> 'j' and col4 <> 'k') OR (col3 = 'b' and col4 <> 'j' and col4 <> 'k')
From what I see in the example, the probe checks each posor, but it then doesn't correctly check for each negands in the context of the posor. The checks in the query are simply col4 = 'j', when they should be at the least be col3 = 'a' and col4 = 'j' plus col3 = 'b' and col4 = 'j'.
Yes you are right. Each negand/range should be checked in the context of the rest of the posor. But not all combinations (as I say above).
I'll fix the example.
I fixed the example. Please have a look. The normalized conditions were:
(col3 = 'a' and col4 <> 'j' and col4 <> 'k') OR (col3 = 'b' and col4 <> 'j' and col4 <> 'k')
In the fix, I now have this:
CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid ....
CASE WHEN col4 = 'k' AND col4 <> 'j' THEN uid ....
The reason I don't include col3 = 'a'
or col3 = 'b'
is that those are already in the inner-most SELECT so the conditions are already in effect.
But I'll admit that this is pretty confusing. On one hand, we are doing the normalization and using that for the CASE statement, but on the other hand, in the inner SELECT we are filtering on pre-normalized conditions.
So it is under-specified at this point what filter conditions go into the inner SELECT and what goes in the subsequent CASE....
I'll try to clarify this.
I don't have a good feeling ...
This always makes me think of star wars...
about normalization across subqueries. It can be done in some scenarios, but it will get really hairy when combining joins, grouping, having and where filters. I think we should leave it for the next version.
Maybe so.
Overall, I feel that this change will be very complex and I am not optimistic that we will finish sooner than 3-4 months. And this if we don't find other problems with it during implementation ... maybe it would be better to have one release fully allocated to this purpose?
No kidding :(
One way we can mitigate this lack of reporting is to report when at least several conditions are LE
This would leak information as well. By trying different combinations, the analyst can determine which are LE and which are not, by looking at the warning.
We can't possibly check all combinations, and I don't think it is necessary. I think it would be difficult for an analyst to come up with a query in practice where two or more negands are individually not LE but are together LE with 1e. Given the cost of defending against this, it is not worth it. Let's not worry about it.
I don't think it is that simple. For one, negands can be easily combined with ranges. The resulting combination could be LE, but it won't be detected if we check the range and the negand separately. Then the algorithm has to be resistant to chaff conditions.
How about this condition: dept='cs' (c1) and gender <> 'm' (c2) and salary <> 1000 (c3)
?
To detect if c2
is LE or c3
is LE or c2 and c3
is LE we need to compute the counts for c1 and c2 and c3
, c1 and c2
, c1 and c3
and c1
.
But the shift is +- 1-2 SD per LE node.
So the final mean is +/-1-2 final_SD * LE node count?
In this sense, it might be clearer to really think of it as a final adjustment
This "final adjustemnt" has a vague definition, in my current understanding.
CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid ....
The second condition is redundant in this case, but it might not be in other cases.
One way we can mitigate this lack of reporting is to report when at least several conditions are LE
This would leak information as well. By trying different combinations, the analyst can determine which are LE and which are not, by looking at the warning.
We should be able to prevent this by again using noisy threshold of some sort.
So the final mean is +/-1-2 final_SD * LE node count?
The final mean would be sum(All LE adjustments) * final_sd
So if we have four LE nodes, with adjustments 0, 1, -2, -1 respectively, then the final mean would be -2 * final_sd
How about this condition:
dept='cs' (c1) and gender <> 'm' (c2) and salary <> 1000 (c3)
? To detect ifc2
is LE orc3
is LE orc2 and c3
is LE we need to compute the counts forc1 and c2 and c3
,c1 and c2
,c1 and c3
andc1
.
My point is that we don't need to detect if c2 and c3
is LE.
Please remember that the goal is not to prevent LE conditions at all costs, the goal is to prevent reasonable attacks. We often get trapped into this way of thinking (we get completely focused on the mechanism and lose sight of the attacks we are defending against).
So if you want to argue for checking every combination of negands and ranges, you should come up with an attack that is realistic and exploits such a combination. Maybe there is one but I don't see it. The attacker would have to come up with some (not a and not b)
where not a
is not LE, not b
is not LE, (not a and not b)
is LE with e1 (for a difference attack) or e0 (for a chaff attack), and if e1 happens to isolate the intended victim. I struggle to come up with an example...
CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid ....
The second condition is redundant in this case, but it might not be in other cases.
Why is it redundant? I don't see it.
Why is it redundant? I don't see it.
col4 = 'j'
implies col4 <> 'k'
col4 = 'j'
impliescol4 <> 'k'
duh. right.
@cristianberneanu maybe you can confirm something for me.
I made the statement somewhere that the inner select should retain the posands and posors in the original query, but remove the negands and negors.
The goal to this being that the data that flows into the CASE statements should include everything that you are testing for, but no more (for efficiency's sake).
It seems to me this means that, conceptually at least, you simply replace any negands and negors with True
.
Does this sound right to you?
It seems to me this means that, conceptually at least, you simply replace any negands and negors with True.
You could do that, but it would be just as simple to drop the conditions entirely. I don't see the advantage.
The goal to this being that the data that flows into the CASE statements should include everything that you are testing for, but no more (for efficiency's sake). The reason I don't include col3 = 'a' or col3 = 'b' is that those are already in the inner-most SELECT so the conditions are already in effect.
They have to be included in both locations: in the inner-most SELECT in order to reduce the amount of data processed and in each CASE in order to compute the correct counts.
If you don't include the right posands for each posor in each coressponding CASE statement, invalid values will be aggregated. In the case of the given example, CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid ...
will aggregate both posors at the same time, i.e. (col3 = 'a' or col3 = 'b') and col4 = 'j' and col4 <> 'k'
, which is not useful.
It seems to me this means that, conceptually at least, you simply replace any negands and negors with True.
You could do that, but it would be just as simple to drop the conditions entirely. I don't see the advantage.
Yes of course, that's why "conceptually".
If you don't include the right posands for each posor in each coressponding CASE statement, invalid values will be aggregated. In the case of the given example,
CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid ...
will aggregate both posors at the same time, i.e.(col3 = 'a' or col3 = 'b') and col4 = 'j' and col4 <> 'k'
, which is not useful.
But I think this is exactly what we want (to aggregate both posors). This is because col4 <> 'j' and col4 <> 'k'
appear in both posors of the normalized expression, and so we want to check them against (col3 = 'a' or col3 = 'b')
. Because the thing we want to evaluate is how the removal of one of those negands would affect the output of the whole query.
But I think this is exactly what we want (to aggregate both posors). This is because
col4 <> 'j' and col4 <> 'k'
appear in both posors of the normalized expression, and so we want to check them against(col3 = 'a' or col3 = 'b')
. Because the thing we want to evaluate is how the removal of one of those negands would affect the output of the whole query.
Hmmmm, ok I'm not so sure, because in another query the analyst could remove just one of the negands...
If you don't include the right posands for each posor in each coressponding CASE statement, invalid values will be aggregated. In the case of the given example,
CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid ...
will aggregate both posors at the same time, i.e.(col3 = 'a' or col3 = 'b') and col4 = 'j' and col4 <> 'k'
, which is not useful.
@cristianberneanu
Ok, I agree with you. The negands and ranges in a given posor have to be treated independently from duplicated negands and ranges in other posors. Each is tested independently, and each can contribute its own noise and adjustment.
I'll update the first comment to reflect this.
@cristianberneanu I updated the bullet items and example 1. I'll do more updating later.
So if you want to argue for checking every combination of negands and ranges, you should come up with an attack that is realistic and exploits such a combination. Maybe there is one but I don't see it. The attacker would have to come up with some (not a and not b) where not a is not LE, not b is not LE, (not a and not b) is LE with e1 (for a difference attack) or e0 (for a chaff attack), and if e1 happens to isolate the intended victim. I struggle to come up with an example...
Let's take the case of the single female in the CS department. The combined condition dept <> 'CS' and gender <> 'f'
is LE, according to the definition for LE used throughout this topic, while each of the individual sub-conditions are not LE. The current algorithm won't detect this case, which means one can use such conditions to exclude a single user from the result set.
My intuition is that an attacker can then use filters such as
dept <> 'CS' and gender <> 'f' and salary <> 1000
dept <> 'CS' and gender <> 'f' and salary <> 1001
....
to find out additional information about the victim, such as the salary range, by exploiting the difference between 0e, 1e and 2+e conditions, but maybe I am missing something and this is not a feasible attack.
Ok, I agree with you. The negands and ranges in a given posor have to be treated independently from duplicated negands and ranges in other posors. Each is tested independently, and each can contribute its own noise and adjustment.
This greatly complicates the cases where filtering is done before and after grouping, as the resulting CASE
statements will get very complex. This why I would like to see an example for a query containing WHERE
and HAVING
filters on columns different than the ones being grouped.
Let's take the case of the single female in the CS department. The combined condition
dept <> 'CS' and gender <> 'f'
is LE, according to the definition for LE used throughout this topic, while each of the individual sub-conditions are not LE.
I don't see this as LE (combined or otherwise). This query excludes all CS dept members and all females. If you drop both conditions, then all of these are included, which is a lot of individuals and so not LE.
This greatly complicates the cases where filtering is done before and after grouping, as the resulting CASE statements will get very complex. This why I would like to see an example for a query containing WHERE and HAVING filters on columns different than the ones being grouped.
I'll work on this now.
This query excludes all CS dept members and all females.
There is an AND
here between the filters (not an OR
). The combined condition excludes all female members of the CS department, which should be a single individual in this case.
It could be rewritten as NOT (dept = 'CS' and gender = 'f')
.
Ok, let me add back what I removed/scratched.
dept <> 'CS' and gender <> 'f'
is the same as not (deps = 'CS' or gender = 'f')
.NOT (dept = 'CS' and gender = 'f')
is an altogether different thing, and would be equal to dept <> 'CS' or gender <> 'f'
.Here is a photo of the two sets. The top one being dept <> 'CS' and gender <> 'f'
and the bottom one being not (dept = 'CS' and gender = 'f')
@sebastian no I think you are right.
Actually, the expression that excludes the lone woman would be:
WHERE dept <> CS or gender <> F
Here's the logic: CS woman --> false or false --> false --> exclude CS man --> false or true --> true --> include math woman --> true or false --> true --> include math man --> true or true --> true --> include
So this is two negors. Currently I think what to do with negors is underspecified. But if we normalize as:
WHERE not (dept = CS and gender = F)
Then the entire WHERE
clause would be removed from the probe query. In the CASE
statement we would test for:
CASE WHEN dept = CS and gender = F then .....
This would be detected as LE, and we would adjust and add noise layer accordingly.
So it looks like this works, but I don't think it was clear from the stated rules to do this...
You guys are right, there should be an OR
there, I got confused while thinking about the inverted condition.
It is super confusing actually. I frequently have to build truth tables to convince myself of one thing or another. I think the problem is that SQL logic and natural language don't match. Saying "men and woman" is the reverse of saying gender = M and gender = F
. The former includes everyone, the latter includes noone.
@cristianberneanu I added a new example (number 6) which has negands split over two sub-queries.
I still need to double check it, but basic idea should be right.
@cristianberneanu ok Example 6 looks correct...
Ok, I agree with you. The negands and ranges in a given posor have to be treated independently from duplicated negands and ranges in other posors. Each is tested independently, and each can contribute its own noise and adjustment.
This greatly complicates the cases where filtering is done before and after grouping, as the resulting CASE statements will get very complex. This why I would like to see an example for a query containing WHERE and HAVING filters on columns different than the ones being grouped.
This doesn't mean, however, that you need to check each combination. If you have N negands, you one-by-one reverse each to posand while holding the others as negands. So you still end up with N CASE statements (for the negands ... more if you have posors....)
@cristianberneanu @sebastian for now I'm done here.
Cristian please look and let me know what is confusing or still seems wrong.
@cristianberneanu I added a new example (number 6) which has negands split over two sub-queries.
@yoid2000 This is not exactly what I had in mind. The reason I asked for this specific query:
select cli_district_id, sum(cnt)
from (
select uid, cli_district_id, count(*) as cnt
from transactions
where acc_district_id in (1,2)
group by 1,2
having count(*) between 0 and 1000
) t
group by 1
was because there are a few edge cases in it that I don't understand how they should be handled.
More specific, the acc_district_id
column needs to be floated, so that the two posors can be checked separately. The act of floating a non-grouped_by
column will distort any computed aggregates, including the count(*)
one which needs to be checked against the range.
The resulting CASE
statements will need to filter the data for the right posor, compute the aggregate, then check the range condition, something that I don't think will be easy to do or even possible in the general case (especially for more complex aggregates, like stddev
).
The act of floating a non-grouped_by column will distort any computed aggregates, including the count(*) one which needs to be checked against the range.
Ok, I see your point now. Good catch.
One question @cristianberneanu. As it now stands, do we make a noise layer from that having clause (i.e. from a clause that uses an aggregate like count(*)
as the thing being evaluated)?
One question @cristianberneanu. As it now stands, do we make a noise layer from that having clause (i.e. from a clause that uses an aggregate like count(*) as the thing being evaluated)?
I am not sure what happens now. I don't see any explicit mentions to aggregates in the noise layers code.
In what follows I use the term 'node' to refer to a single posor, negand, or range condition, where posor is
OR (c1 AND [NOT] c2 AND [NOT] c3)
for one or more conditionscN
, negand isAND NOT (c1 and [NOT] c2 and [NOT] c3)
for one or more conditionscN
, and range isAND col BETWEEN X and Y
. I'm not including posands when I say "node".High-level design:
and groups
separated byOR
andNOT OR
and group
is a posor, and each posor is an LE-checked nodeLIKE
wildcard symbolsTrue
) from the expression for the inner SELECTIN
) are replaced by the above two bullets.count(*)
used to compute final-sd in partcount(*)
used to compute final-sd in partOR (col1=a AND col2=b AND col3=c)
)R1: First, since we are making the LE threshold quite low, the distortion due to LE nodes is relatively small, so not as important to tell the analyst as otherwise would be. Second, the analyst can figure out on his own if a condition has a good chance of being LE by seeing if the condition is LCF. Third, telling the analyst which conditions are LE is complex, especially given normalization of expressions. Finally, since we are pushing the LE threshold so low, telling the analyst would be giving him more information than I'm comfortable with. The fact that the LE threshold is a little noisy helps some, but not as much as I'd like (the LCF threshold is already pushing the boundaries of my comfort zone).
R2: Note that we no longer have the notion of bucket group. That is because we are adjusting answers instead of dropping SQL conditions, so we can adjust on a per-bucket basis.
R3: I had been talking about a hard threshold for LE checking (threshold = 2), but that opened us up to attacks where the attacker knows that a given condition will affect either 1 or 2 users (1e/2e attack). One of the users is the victim, and the other is a "dummy" user (there only for the purpose of boosting the count to 1 or 2). The attacker has to know that the dummy user will be in the answer for sure, and is trying to determine if the victim is in the answer or not. If the victim is in the answer, then there are two users affected by the condition, and no adjustment is made. If the victim is not in the answer, then there is one user affected by the condition and the answer is adjusted to 0.
It is very rare to find the conditions where this can occur using one attribute. However, we allow
OR (a AND b AND c)
, wherea
,b
, andc
can be attributes from any columns. This allows an attacker to fine-tune a posor to select two specific users. I haven't looked into how likely this is in our datasets, but I would not want to assume that it is hard.The fact that we still have uid noise layers helps here, but I don't want to depend on us always having uid noise layers. Anyway, by making the LE threshold at least a little noisy, we introduce some uncertainty into the attack.
R4: I want to minimize the amount of noise we generate with nodes, especially because there could be many of them. Towards this end, I'm proposing a composite noise layer that is seeded from multiple nodes. The danger with a composite noise layer is that chaff conditions can be used to generate an arbitrary number of different seeds, thus allowing the noise layer to be averaged away. This means that LE nodes cannot be part of the composite. As a result, we put nodes that are not LE in the single composite noise layer, and each LE node contributes a noise layer.
In principle, one would have hoped that by doing LED, one could eliminate the noise layers altogether for conditions that are dropped. This may be possible if we set the threshold for LED the same as we set it for LCF. However, this has two problems. First, substantially more conditions would be dropped than is the case with the current design, and the resulting amount of distortion is probably similar. However, dropping introduces a systematic bias in the data, whereas noise does not (has a mean of zero).
R5: We use one unit of final-sd to approximately remove the effect of the condition. If final-sd is 1, then this exactly removes the effect. If final-sd is more than 1, then we will tend to over-compensate, because the final-sd usually tries to cover the more extreme data points.
TODO: Experiment to determine if final-sd is good enough, or if we should scale it in some way.
R6: This is new and probably utterly confusing. This is in response to an equations-based averaging attack I recently thought of that is made possible by posors in particular, but to a lesser extend by negands. The attack is designed to remove the noise from data aggregates, especially counts of distinct users.
The attack generates random sets of values, and requests the count. So for instance:
then
WHERE age IN (5,10,18,22,28,...)
,WHERE age IN (2,3,13,17,20,31,...)
and so on.Each of these queries can be formulated as an equation, and then the set of equations solved to produce exact counts.
I believe that this attack would not work with per-node static noise because each value would have a consistent bias, and this bias would appear in the final answers. With composite noise layers, this would no longer be the case. Each query would result in a different seed. To defend against this without introducing individual noise layers, we consistently adjust answers for each value up, down, or not at all. This adjustment will persist in the set of equations, and the final answers should be slightly off.
Note that this would be just the first step in an attack that, for instance, then exploits external knowledge or tries to continue with an intersection attack.
The reason we don't adjust when the count-sd is relatively high is that I presume that it is much harder to attack individuals when different individuals contribute different counts, and so the attack has less value. Thus we only do it when count-sd is relatively low.
TODO: validate this attack and the defense
TODO: determine the appropriate number of non-LE nodes we require to start adjusting (currently conservatively set as three).
R7: This is a consequence of handling noise differently for LE and non-LE nodes. If we report the noise accurately, then an attacker could determine if a condition was LE simply by observing the reported noise value. As such, we can under-report the amount of noise. Again if this is very important to the analyst, they should avoid LE conditions, or use LCF to better approximate which conditions may be contributing noise. Note that if we simply added a full noise layer for all nodes, then we could report noise accurately.
R8: The reason we need to use the floated values for the column rather than the floated values that are affected by the nodes is because with 0e nodes, there are no floated values.
Design Examples
Example 1
By way of example, consider the following query:
Normalized nodes look like this:
The
col1 = ?
part comes from the selectedcol1
. The value of course becomes known when the bucket is materialized by the DB.The
col1 = ? and col2 = 'x'
part of both nodes is redundant. Thecol2 = 'x'
part is redundant because the query filters for those conditions in the probe query, and so by definition the condition will always beTrue
. So we can simplify as:We would make a probe query like follows (might be bugs here...this just gives the basic idea). Note that this probe pretends that the seed material cannot be generated from SQL inspection.
The inner-most
SELECT
filters on posands and posors. This gives us everything we need to check LE and gather seed values. (Note that this might as well be a good opportunity to float col2 as well. Then you don't have to do it in the main query. As long as you are making a full scan here anyway, the extra work won't cost much.)Note that the
CASE
statements for col4 reverse the condition from negative to positive.To check for LE (per bucket), we would first check
count(DISTINCT col3_a_uid)
andcount(DISTINCT col3_b_uid)
. These correspond to the two nodes(col1 = ? and col2 = 'x' and col3 = 'a' and col4 <> 'j' and col4 <> 'k')
and(col1 = ? and col2 = 'x' and col3 = 'b' and col4 <> 'j' and col4 <> 'k')
. If the counts are 0 or 1, then they are LE. If 3 or more, they are not LE.If 2, then we need to seed a noisy threshold. The seed consists of the set of seed elements of all the five conditions in the node. In this particular example we could seed from just SQL inspection plus the value of
col1
in the bucket, but for this example lets suppose that is not the case. The values used for the seed material for the first posor comes from the following (all the other symbols are as we do today):col1 = ?
: from the bucket valuecol2 = 'x': from
min(col2)and
max(col2)`col3 = 'a': from
min(col3)and
max(col3)`count(DISTINCT col4_a_j_uid)
is 0, then frommin(col4)
andmax(col4)
count(DISTINCT col4_a_j_uid)
is not 0, then frommin(col4_a_j_col)
andmax(col4_a_j_col)
.count(DISTINCT col4_a_k_uid)
is 0, then frommin(col4)
andmax(col4)
count(DISTINCT col4_a_k_uid)
is not 0, then frommin(col4_a_k_col)
andmax(col4_a_k_col)
.If either if these nodes are LE for a given bucket, then we make a noise layer from the same seed and add this noise layer to the final answer for that bucket. We also adjust the final answer down (because these are posors) by 1 or 2 if the node is 1e or 2e respectively. Note that if both nodes are LE for a given bucket, then most likely the bucket will fail LCF since LCF is performed after adjustment.
If for a given bucket a posor node is not LE, then the two negands inside the posor are each checked for LE. Note that if the same condition is LE within different posors, then to avoid having duplicate seeds, the second and subsequent seeds have an additional symbol which is a counter and differs for each subsequent layer.
Looking at
col4 <> 'j'
, the check follows the same steps as for the posor above. Namely, ifcount(DISTINCT col4_a_j_uid)
is 0 or 1, the negand node is LE, if 3 or more it is not LE, and if 2 we make a seed from the node usingmin(col4_a_k_col)
andmax(col4_a_k_col)
as the values, and use the resulting seed for a noisy threshold check.If the node is LE, the we'll generate a separate noise layer for this node in the final answer, and adjust the final answer by 1 or 2 if 1e or 2e respectively.
Finally, if either or both posors are not LE, then we'll make a composite noise layer. This noise layer is seeded from all conditions in each posor that is not LE, but without duplicate conditions.
By way of example, suppose that the posor
(col1 = ? and col2 = 'x' and col3 = 'a' and col4 <> 'j' and col4 <> 'k')
is LE and the posor(col1 = ? and col2 = 'x' and col3 = 'b' and col4 <> 'j' and col4 <> 'k')
is not LE. Further suppose that the negandcol4 <> 'j'
is LE. Then the composite noise layer would be seeded fromcol1 = ?
,col2 = 'x'
,col3 = 'b'
, andcol4 <> 'k'
.If on the other hand both posors are not LE, and the negand
col4 <> 'j'
is LE, then the composite noise layer would be seeded fromcol1 = ?
,col2 = 'x'
,col3 = 'b'
,col3 = 'a', and
col4 <> 'k'`.Example 2
The expression would be normalized to:
Then we would make a probe query like follows (once again assuming that the seeds cannot be composed from column inspection):
The inner-most select has the normalized expression minus the negand.
The
CASE
statements filter for three things. The negand, and the each of the two posors (both of which consist of multiple posands and negands).For the purpose of seeding, the conditions in the two posors are:
Otherwise, LE checks are made as in the preceeding example. That is, first checking each posor for LE for each bucket, and if the second posor is not LE, then checking the negand.
Example 3
Note that this query works on the
gda_banking
data source of attack.aircloak.com.An example of a probe query is this (assuming that seeds can be made from SQL inspection):
In this probe query, the seed material for both conditions can be taken from SQL inspection, so no need for floating.
After normalization, the filter expression is this:
In this case, since we are selecting the same column in the
WHERE
clause, there will be duplicate conditions in the seed material so one of them will be dropped for seeding.Ranges (
BETWEEN
) must be individually LE checked just as negands are checked. In other words, if the parent posor is not LE, then the ranges inside the posor are LE checked. As with negands, the range is reversed, socnt >= 0 AND cnt < 1000
becomes(cnt < 0 OR cnt >= 1000)
.It so happens that in this query neither posor is LE, so we need to check the range to see if it is LE. For seeding, we use the same seed as we do for
BETWEEN
conditions today. In this case, the range is LE, so we need a separate noise layer for the range.The composite noise layer in this case would be seeded from two posors, but without the range. Therefore, the seeding material comes from the two conditions
cli_district_id = 1
andcli_district_id = 2
.Example 4
This is the same query as Example 3, but with different numbers. The probe query would be the same as well, but here it is just for convenience:
After normalization, the conditions look like this:
In this case, the second node is LE (0e) because there are no
cli_district_id
with value 999. The first node has 222 distinct UIDs (is not LE), and so the range must also be checked. The range is not LE: it has 441 distinct UIDs.This means that the second node has its own separate noise layer, and all conditions of the first node contribute to the composite noise layer.
Example 5
The normalized expression for this query is:
The nodes of this expression are spread over two sub-queries. The probe is this:
The output of the above query is this:
This means that for bucket 55, the first posor is not LE but the others are. For bucket 74, the second and fourth posors are not LE.
Regarding bucket 55, there would be four noise layers. One is a composite built from the conditions in the first posor (
acct_district_id = 1
andcli_district_id = 55
), and the other three are individual node noise layers built from the other three posors respectively.Regarding bucket 74, there are three noise layers. One is a composite built from the conditions in the second and fourth nodes (
acct_district_id = 1
,acct_district_id = 60
andcli_district_id = 74
). The other two are built from the first and third nodes respectively.Example 6
Analyst query:
This example has negands across two sub-queries.
Rule 1: remove negands from probe inner select, which means
having
condition andoperation
condition are gone from inner select.Normalized expression is:
Two posors, and each posor has two negands. This leads to a total of 6 CASE statements (one per posor, one per negand per posor).