Closed yoid2000 closed 4 years ago
On a first look, it seems aggregates are ignored and the inner expression is treated as a regular filter, i.e. count(*)
doesn't generate any additional noise layers, while sum(price)
generates noise layers for price
.
@cristianberneanu what do you think of this?
SELECT cli_district_id,
count(DISTINCT ac3_uid) AS ac3,
count(DISTINCT ac4_uid) AS ac4,
count(DISTINCT ac1_uid) AS ac1,
count(DISTINCT ac2_uid) AS ac2
FROM (
select uid, cli_district_id,
CASE WHEN acct_district_id = 1 and (cnt < 0 or cnt >= 1000)
THEN uid
ELSE NULL
END AS ac3_uid,
CASE WHEN acct_district_id = 2 and (cnt < 0 or cnt >= 1000)
THEN uid
ELSE NULL
END AS ac4_uid,
CASE WHEN (acct_district_id = 1 and cnt between 0 and 1000)
THEN uid
ELSE NULL
END AS ac1_uid,
CASE WHEN (acct_district_id = 2 and cnt between 0 and 1000)
THEN uid
ELSE NULL
END AS ac2_uid
from (
select distinct t1.uid as uid, t1.cli_district_id as cli_district_id,
t2.acct_district_id as acct_district_id, t1.cnt as cnt
from (
select uid, cli_district_id, count(*) as cnt
from transactions
where acct_district_id in (1,2)
group by 1,2
) t1
join (
select uid, cli_district_id, acct_district_id
from transactions
where acct_district_id in (1,2)
) t2
on t1.uid = t2.uid and t1.cli_district_id = t2.cli_district_id
) t
) t
GROUP BY 1
Probably nicer ways to write it, but I think it is a way of carrying up acct_district_id without messing up the group by
. (I wonder if there is a way to do it without requiring that distinct
at the join point.)
@cristianberneanu what do you think of this?
I think it looks pretty ugly and complex. It is very hard to check the correctness. We should add some limitations, at least in the first version, in order to ease the implementation.
This would probably be easier to compute using window functions. But we don't support this feature yet and I am not sure if we can even support it everywhere.
Addendum: It would be simpler to check each posor independently and then combine the results using JOINs or issuing multiple queries.
I think it looks pretty ugly and complex. It is very hard to check the correctness. We should add some limitations, at least in the first version, in order to ease the implementation.
This I agree with. I just wanted to have some way of doing it on paper (just to see if it is even possible). This join thing does look to me like a basic way of floating a value that is hidden behind a group by
. But having a limitation that you can't have posors or negands behind a group by
is probably not a particularly painful limitation.
Addendum: It would be simpler to check each posor independently and then combine the results using JOINs or issuing multiple queries.
Simpler maybe, though multiple queries of course gets expensive...
@cristianberneanu what other edge cases are you worried about?
Now that this is a good basic design, might be good to decide what functionality we really need at this point (versus stuff we can add later) and come up with an MVP that anyway satisfies Telefonica.
This join thing does look to me like a basic way of floating a value that is hidden behind a group by.
Simpler maybe, though multiple queries of course gets expensive...
The same JOIN trick could be used to combine the results into a single query. That would potentially be easier to understand and implement. The performance of the method would depend on the query context: if indexes could be used, it might be faster, if a full scan is needed, it could be slower.
But using window functions would be the best fit and would have the best performance. The drawback being that they probably aren't supported on MongoDB.
@cristianberneanu what other edge cases are you worried about?
Beside the last one, there aren't others on my mind right now.
@sebastian @cristianberneanu
Ok. I think it would be good if we have a separate issue that lays down the requirements/features of a first implementation. Then we can interpret the design given here in light of those requirements...
I'll start that issue in a bit.
@cristianberneanu @sebastian please see #3926
On a first look, it seems aggregates are ignored and the inner expression is treated as a regular filter, i.e. count(*) doesn't generate any additional noise layers, while sum(price) generates noise layers for price.
Do we need to change something here?
Do we need to change something here?
No, I don't think so. I vaguely remember deciding that it would be pretty hard to generate an attack using count(*)
as the isolating condition. If I remember right, there were some tricky aspects to floating that or deciding what the seed should be, so we just let it go.
@cristianberneanu @sebastian
By the way, it occurs to me that we could stick HAVING count(distinct uid) > 1
at the bottom of the probe query and in some cases avoid getting a lot of buckets transmitted that would otherwise be LCF filtered in the main query.
overtaken by #3994
In what follows I use the term 'node' to refer to a single posor, negand, or range condition, where posor is
OR (c1 AND [NOT] c2 AND [NOT] c3)
for one or more conditionscN
, negand isAND NOT (c1 and [NOT] c2 and [NOT] c3)
for one or more conditionscN
, and range isAND col BETWEEN X and Y
. I'm not including posands when I say "node".High-level design:
and groups
separated byOR
andNOT OR
and group
is a posor, and each posor is an LE-checked nodeLIKE
wildcard symbolsTrue
) from the expression for the inner SELECTIN
) are replaced by the above two bullets.count(*)
used to compute final-sd in partcount(*)
used to compute final-sd in partOR (col1=a AND col2=b AND col3=c)
)R1: First, since we are making the LE threshold quite low, the distortion due to LE nodes is relatively small, so not as important to tell the analyst as otherwise would be. Second, the analyst can figure out on his own if a condition has a good chance of being LE by seeing if the condition is LCF. Third, telling the analyst which conditions are LE is complex, especially given normalization of expressions. Finally, since we are pushing the LE threshold so low, telling the analyst would be giving him more information than I'm comfortable with. The fact that the LE threshold is a little noisy helps some, but not as much as I'd like (the LCF threshold is already pushing the boundaries of my comfort zone).
R2: Note that we no longer have the notion of bucket group. That is because we are adjusting answers instead of dropping SQL conditions, so we can adjust on a per-bucket basis.
R3: I had been talking about a hard threshold for LE checking (threshold = 2), but that opened us up to attacks where the attacker knows that a given condition will affect either 1 or 2 users (1e/2e attack). One of the users is the victim, and the other is a "dummy" user (there only for the purpose of boosting the count to 1 or 2). The attacker has to know that the dummy user will be in the answer for sure, and is trying to determine if the victim is in the answer or not. If the victim is in the answer, then there are two users affected by the condition, and no adjustment is made. If the victim is not in the answer, then there is one user affected by the condition and the answer is adjusted to 0.
It is very rare to find the conditions where this can occur using one attribute. However, we allow
OR (a AND b AND c)
, wherea
,b
, andc
can be attributes from any columns. This allows an attacker to fine-tune a posor to select two specific users. I haven't looked into how likely this is in our datasets, but I would not want to assume that it is hard.The fact that we still have uid noise layers helps here, but I don't want to depend on us always having uid noise layers. Anyway, by making the LE threshold at least a little noisy, we introduce some uncertainty into the attack.
R4: I want to minimize the amount of noise we generate with nodes, especially because there could be many of them. Towards this end, I'm proposing a composite noise layer that is seeded from multiple nodes. The danger with a composite noise layer is that chaff conditions can be used to generate an arbitrary number of different seeds, thus allowing the noise layer to be averaged away. This means that LE nodes cannot be part of the composite. As a result, we put nodes that are not LE in the single composite noise layer, and each LE node contributes a noise layer.
In principle, one would have hoped that by doing LED, one could eliminate the noise layers altogether for conditions that are dropped. This may be possible if we set the threshold for LED the same as we set it for LCF. However, this has two problems. First, substantially more conditions would be dropped than is the case with the current design, and the resulting amount of distortion is probably similar. However, dropping introduces a systematic bias in the data, whereas noise does not (has a mean of zero).
R5: We use one unit of final-sd to approximately remove the effect of the condition. If final-sd is 1, then this exactly removes the effect. If final-sd is more than 1, then we will tend to over-compensate, because the final-sd usually tries to cover the more extreme data points.
TODO: Experiment to determine if final-sd is good enough, or if we should scale it in some way.
R6: This is new and probably utterly confusing. This is in response to an equations-based averaging attack I recently thought of that is made possible by posors in particular, but to a lesser extend by negands. The attack is designed to remove the noise from data aggregates, especially counts of distinct users.
The attack generates random sets of values, and requests the count. So for instance:
then
WHERE age IN (5,10,18,22,28,...)
,WHERE age IN (2,3,13,17,20,31,...)
and so on.Each of these queries can be formulated as an equation, and then the set of equations solved to produce exact counts.
I believe that this attack would not work with per-node static noise because each value would have a consistent bias, and this bias would appear in the final answers. With composite noise layers, this would no longer be the case. Each query would result in a different seed. To defend against this without introducing individual noise layers, we consistently adjust answers for each value up, down, or not at all. This adjustment will persist in the set of equations, and the final answers should be slightly off.
Note that this would be just the first step in an attack that, for instance, then exploits external knowledge or tries to continue with an intersection attack.
The reason we don't adjust when the count-sd is relatively high is that I presume that it is much harder to attack individuals when different individuals contribute different counts, and so the attack has less value. Thus we only do it when count-sd is relatively low.
TODO: validate this attack and the defense
TODO: determine the appropriate number of non-LE nodes we require to start adjusting (currently conservatively set as three).
R7: This is a consequence of handling noise differently for LE and non-LE nodes. If we report the noise accurately, then an attacker could determine if a condition was LE simply by observing the reported noise value. As such, we can under-report the amount of noise. Again if this is very important to the analyst, they should avoid LE conditions, or use LCF to better approximate which conditions may be contributing noise. Note that if we simply added a full noise layer for all nodes, then we could report noise accurately.
R8: The reason we need to use the floated values for the column rather than the floated values that are affected by the nodes is because with 0e nodes, there are no floated values.
Design Examples
Example 1
By way of example, consider the following query:
Normalized nodes look like this:
The
col1 = ?
part comes from the selectedcol1
. The value of course becomes known when the bucket is materialized by the DB.The
col1 = ? and col2 = 'x'
part of both nodes is redundant. Thecol2 = 'x'
part is redundant because the query filters for those conditions in the probe query, and so by definition the condition will always beTrue
. So we can simplify as:We would make a probe query like follows (might be bugs here...this just gives the basic idea). Note that this probe pretends that the seed material cannot be generated from SQL inspection.
The inner-most
SELECT
filters on posands and posors. This gives us everything we need to check LE and gather seed values. (Note that this might as well be a good opportunity to float col2 as well. Then you don't have to do it in the main query. As long as you are making a full scan here anyway, the extra work won't cost much.)Note that the
CASE
statements for col4 reverse the condition from negative to positive.To check for LE (per bucket), we would first check
count(DISTINCT col3_a_uid)
andcount(DISTINCT col3_b_uid)
. These correspond to the two nodes(col1 = ? and col2 = 'x' and col3 = 'a' and col4 <> 'j' and col4 <> 'k')
and(col1 = ? and col2 = 'x' and col3 = 'b' and col4 <> 'j' and col4 <> 'k')
. If the counts are 0 or 1, then they are LE. If 3 or more, they are not LE.If 2, then we need to seed a noisy threshold. The seed consists of the set of seed elements of all the five conditions in the node. In this particular example we could seed from just SQL inspection plus the value of
col1
in the bucket, but for this example lets suppose that is not the case. The values used for the seed material for the first posor comes from the following (all the other symbols are as we do today):col1 = ?
: from the bucket valuecol2 = 'x': from
min(col2)and
max(col2)`col3 = 'a': from
min(col3)and
max(col3)`count(DISTINCT col4_a_j_uid)
is 0, then frommin(col4)
andmax(col4)
count(DISTINCT col4_a_j_uid)
is not 0, then frommin(col4_a_j_col)
andmax(col4_a_j_col)
.count(DISTINCT col4_a_k_uid)
is 0, then frommin(col4)
andmax(col4)
count(DISTINCT col4_a_k_uid)
is not 0, then frommin(col4_a_k_col)
andmax(col4_a_k_col)
.If either if these nodes are LE for a given bucket, then we make a noise layer from the same seed and add this noise layer to the final answer for that bucket. We also adjust the final answer down (because these are posors) by 1 or 2 if the node is 1e or 2e respectively. Note that if both nodes are LE for a given bucket, then most likely the bucket will fail LCF since LCF is performed after adjustment.
If for a given bucket a posor node is not LE, then the two negands inside the posor are each checked for LE. Note that if the same condition is LE within different posors, then to avoid having duplicate seeds, the second and subsequent seeds have an additional symbol which is a counter and differs for each subsequent layer.
Looking at
col4 <> 'j'
, the check follows the same steps as for the posor above. Namely, ifcount(DISTINCT col4_a_j_uid)
is 0 or 1, the negand node is LE, if 3 or more it is not LE, and if 2 we make a seed from the node usingmin(col4_a_k_col)
andmax(col4_a_k_col)
as the values, and use the resulting seed for a noisy threshold check.If the node is LE, the we'll generate a separate noise layer for this node in the final answer, and adjust the final answer by 1 or 2 if 1e or 2e respectively.
Finally, if either or both posors are not LE, then we'll make a composite noise layer. This noise layer is seeded from all conditions in each posor that is not LE, but without duplicate conditions.
By way of example, suppose that the posor
(col1 = ? and col2 = 'x' and col3 = 'a' and col4 <> 'j' and col4 <> 'k')
is LE and the posor(col1 = ? and col2 = 'x' and col3 = 'b' and col4 <> 'j' and col4 <> 'k')
is not LE. Further suppose that the negandcol4 <> 'j'
is LE. Then the composite noise layer would be seeded fromcol1 = ?
,col2 = 'x'
,col3 = 'b'
, andcol4 <> 'k'
.If on the other hand both posors are not LE, and the negand
col4 <> 'j'
is LE, then the composite noise layer would be seeded fromcol1 = ?
,col2 = 'x'
,col3 = 'b'
,col3 = 'a', and
col4 <> 'k'`.Example 2
The expression would be normalized to:
Then we would make a probe query like follows (once again assuming that the seeds cannot be composed from column inspection):
The inner-most select has the normalized expression minus the negand.
The
CASE
statements filter for three things. The negand, and the each of the two posors (both of which consist of multiple posands and negands).For the purpose of seeding, the conditions in the two posors are:
Otherwise, LE checks are made as in the preceeding example. That is, first checking each posor for LE for each bucket, and if the second posor is not LE, then checking the negand.
Example 3
Note that this query works on the
gda_banking
data source of attack.aircloak.com.An example of a probe query is this (assuming that seeds can be made from SQL inspection):
In this probe query, the seed material for both conditions can be taken from SQL inspection, so no need for floating.
After normalization, the filter expression is this:
In this case, since we are selecting the same column in the
WHERE
clause, there will be duplicate conditions in the seed material so one of them will be dropped for seeding.Ranges (
BETWEEN
) must be individually LE checked just as negands are checked. In other words, if the parent posor is not LE, then the ranges inside the posor are LE checked. As with negands, the range is reversed, socnt >= 0 AND cnt < 1000
becomes(cnt < 0 OR cnt >= 1000)
.It so happens that in this query neither posor is LE, so we need to check the range to see if it is LE. For seeding, we use the same seed as we do for
BETWEEN
conditions today. In this case, the range is LE, so we need a separate noise layer for the range.The composite noise layer in this case would be seeded from two posors, but without the range. Therefore, the seeding material comes from the two conditions
cli_district_id = 1
andcli_district_id = 2
.Example 4
This is the same query as Example 3, but with different numbers. The probe query would be the same as well, but here it is just for convenience:
After normalization, the conditions look like this:
In this case, the second node is LE (0e) because there are no
cli_district_id
with value 999. The first node has 222 distinct UIDs (is not LE), and so the range must also be checked. The range is not LE: it has 441 distinct UIDs.This means that the second node has its own separate noise layer, and all conditions of the first node contribute to the composite noise layer.
Example 5
The normalized expression for this query is:
The nodes of this expression are spread over two sub-queries. The probe is this:
The output of the above query is this:
This means that for bucket 55, the first posor is not LE but the others are. For bucket 74, the second and fourth posors are not LE.
Regarding bucket 55, there would be four noise layers. One is a composite built from the conditions in the first posor (
acct_district_id = 1
andcli_district_id = 55
), and the other three are individual node noise layers built from the other three posors respectively.Regarding bucket 74, there are three noise layers. One is a composite built from the conditions in the second and fourth nodes (
acct_district_id = 1
,acct_district_id = 60
andcli_district_id = 74
). The other two are built from the first and third nodes respectively.Example 6
Analyst query:
This example has negands across two sub-queries.
Rule 1: remove negands from probe inner select, which means
having
condition andoperation
condition are gone from inner select.Normalized expression is:
Two posors, and each posor has two negands. This leads to a total of 6 CASE statements (one per posor, one per negand per posor).