yoid2000 commented 4 years ago

In what follows I use the term 'node' to refer to a single posor, negand, or range condition, where posor is OR (c1 AND [NOT] c2 AND [NOT] c3) for one or more conditions cN, negand is AND NOT (c1 and [NOT] c2 and [NOT] c3) for one or more conditions cN, and range is AND col BETWEEN X and Y. I'm not including posands when I say "node".

High-level design:

What conditions are LE checked? (LE checked conditions are called "nodes")
- Expressions are normalized to and groups separated by OR and NOT OR
- Each and group is a posor, and each posor is an LE-checked node
- Each such posor can have posands, negands, and ranges internally
- Negands are ranges are LE-checked if the parent posor is not LE
- Note: if the same negand or range is in different posors, each is treated individually from its replicas
- In future, probably we check LIKE wildcard symbols
What conditions go into the inner SELECT?
- Negands and ranges are removed (i.e. set to True) from the expression for the inner SELECT
What nodes are considered LE?
- Nodes are considered LE or not on a per-bucket basis (which means that we need to select all buckets in the probe query). (R2)
- Nodes that affect zero or one user are LE
- Nodes that affect two users are LE with 1/3 probability (based on seeded noisy threshold) (R3)
What noise layers do nodes have? (R4)
- LE nodes (nodes that are deemed LE by above rules) have individual noise layers
- Non-LE nodes are part of a composite noise layer
- There are no other layers associated with nodes (negands, posors, negors, ranges). In other words, the current layers associated with negands and posors (IN) are replaced by the above two bullets.
- All posands have their own noise layers just as today
What is final-sd, count-sd, and count-avg?
- final-sd is the standard deviation of the noise layer (as normally computed)
- count-sd is the interim measure (for stats based) standard deviation of count(*) used to compute final-sd in part
- count-avg is the interim measure (for stats based) average of count(*) used to compute final-sd in part
What adjustments are made?
- LE nodes: (R5)
- Adjust down for posors, up for negands
- Adjust one unit of final-sd for 1e nodes (nodes that affect one user for the given bucket)
- Adjust two units of final-sd for 2e nodes (nodes that affect two users for the given bucket)
- Non-LE nodes: (R6)
- No adjustment if count-sd > count-avg
- Otherwise, adjust up with 1/3 probability, down with 1/3 probability, and none with 1/3 probability
  - Probability based on seeded noisy threshold
- Adjust one unit of count-avg
LCF checks are made after adjustment
How are the seeds for nodes computed?
- Seeds have the same elements as normal seeds (table name, column name, salt, symbols, values, ...)
- Seeds are static (do not contain uid)
- Seeds are per-node, which means that a seed can have multiple columns and values (e.g. OR (col1=a AND col2=b AND col3=c))
- Seed values are based on SQL inspection where possible
- Otherwise:
- If 0e, use the floated min and max value of the column in the probe query (R8)
- Else use the floated min and max value of rows that are affected by the node
- If a duplicate node (i.e. same negand or range in multiple posors) has duplicate layers, then a counter is used in subsequent layers to generate a different seed.
What the analyst is told:
- The analyst is not informed when a condition is LE (R1)
- The noise associated with LE conditions is not factored into the noise reporting (R7)

R1: First, since we are making the LE threshold quite low, the distortion due to LE nodes is relatively small, so not as important to tell the analyst as otherwise would be. Second, the analyst can figure out on his own if a condition has a good chance of being LE by seeing if the condition is LCF. Third, telling the analyst which conditions are LE is complex, especially given normalization of expressions. Finally, since we are pushing the LE threshold so low, telling the analyst would be giving him more information than I'm comfortable with. The fact that the LE threshold is a little noisy helps some, but not as much as I'd like (the LCF threshold is already pushing the boundaries of my comfort zone).

R2: Note that we no longer have the notion of bucket group. That is because we are adjusting answers instead of dropping SQL conditions, so we can adjust on a per-bucket basis.

R3: I had been talking about a hard threshold for LE checking (threshold = 2), but that opened us up to attacks where the attacker knows that a given condition will affect either 1 or 2 users (1e/2e attack). One of the users is the victim, and the other is a "dummy" user (there only for the purpose of boosting the count to 1 or 2). The attacker has to know that the dummy user will be in the answer for sure, and is trying to determine if the victim is in the answer or not. If the victim is in the answer, then there are two users affected by the condition, and no adjustment is made. If the victim is not in the answer, then there is one user affected by the condition and the answer is adjusted to 0.

It is very rare to find the conditions where this can occur using one attribute. However, we allow OR (a AND b AND c), where a, b, and c can be attributes from any columns. This allows an attacker to fine-tune a posor to select two specific users. I haven't looked into how likely this is in our datasets, but I would not want to assume that it is hard.

The fact that we still have uid noise layers helps here, but I don't want to depend on us always having uid noise layers. Anyway, by making the LE threshold at least a little noisy, we introduce some uncertainty into the attack.

R4: I want to minimize the amount of noise we generate with nodes, especially because there could be many of them. Towards this end, I'm proposing a composite noise layer that is seeded from multiple nodes. The danger with a composite noise layer is that chaff conditions can be used to generate an arbitrary number of different seeds, thus allowing the noise layer to be averaged away. This means that LE nodes cannot be part of the composite. As a result, we put nodes that are not LE in the single composite noise layer, and each LE node contributes a noise layer.

In principle, one would have hoped that by doing LED, one could eliminate the noise layers altogether for conditions that are dropped. This may be possible if we set the threshold for LED the same as we set it for LCF. However, this has two problems. First, substantially more conditions would be dropped than is the case with the current design, and the resulting amount of distortion is probably similar. However, dropping introduces a systematic bias in the data, whereas noise does not (has a mean of zero).

R5: We use one unit of final-sd to approximately remove the effect of the condition. If final-sd is 1, then this exactly removes the effect. If final-sd is more than 1, then we will tend to over-compensate, because the final-sd usually tries to cover the more extreme data points.

TODO: Experiment to determine if final-sd is good enough, or if we should scale it in some way.

R6: This is new and probably utterly confusing. This is in response to an equations-based averaging attack I recently thought of that is made possible by posors in particular, but to a lesser extend by negands. The attack is designed to remove the noise from data aggregates, especially counts of distinct users.

The attack generates random sets of values, and requests the count. So for instance:

SELECT count(DISTINCT uid)
FROM table
WHERE age IN (3,11,18,20,28, ....)

then WHERE age IN (5,10,18,22,28,...), WHERE age IN (2,3,13,17,20,31,...) and so on.

Each of these queries can be formulated as an equation, and then the set of equations solved to produce exact counts.

I believe that this attack would not work with per-node static noise because each value would have a consistent bias, and this bias would appear in the final answers. With composite noise layers, this would no longer be the case. Each query would result in a different seed. To defend against this without introducing individual noise layers, we consistently adjust answers for each value up, down, or not at all. This adjustment will persist in the set of equations, and the final answers should be slightly off.

Note that this would be just the first step in an attack that, for instance, then exploits external knowledge or tries to continue with an intersection attack.

The reason we don't adjust when the count-sd is relatively high is that I presume that it is much harder to attack individuals when different individuals contribute different counts, and so the attack has less value. Thus we only do it when count-sd is relatively low.

TODO: validate this attack and the defense

TODO: determine the appropriate number of non-LE nodes we require to start adjusting (currently conservatively set as three).

R7: This is a consequence of handling noise differently for LE and non-LE nodes. If we report the noise accurately, then an attacker could determine if a condition was LE simply by observing the reported noise value. As such, we can under-report the amount of noise. Again if this is very important to the analyst, they should avoid LE conditions, or use LCF to better approximate which conditions may be contributing noise. Note that if we simply added a full noise layer for all nodes, then we could report noise accurately.

R8: The reason we need to use the floated values for the column rather than the floated values that are affected by the nodes is because with 0e nodes, there are no floated values.

Design Examples

Example 1

By way of example, consider the following query:

SELECT col1, count(*)
FROM table
WHERE col2 = 'x' AND
      col3 IN ('a','b') AND
      col4 NOT IN ('j','k')
GROUP BY 1

Normalized nodes look like this:

(col1 = ? and col2 = 'x' and col3 = 'a' and col4 <> 'j' and col4 <> 'k') OR
(col1 = ? and col2 = 'x' and col3 = 'b' and col4 <> 'j' and col4 <> 'k')

The col1 = ? part comes from the selected col1. The value of course becomes known when the bucket is materialized by the DB.

The col1 = ? and col2 = 'x' part of both nodes is redundant. The col2 = 'x' part is redundant because the query filters for those conditions in the probe query, and so by definition the condition will always be True. So we can simplify as:

(col3 = 'a' and col4 <> 'j' and col4 <> 'k') OR
(col3 = 'b' and col4 <> 'j' and col4 <> 'k')

We would make a probe query like follows (might be bugs here...this just gives the basic idea). Note that this probe pretends that the seed material cannot be generated from SQL inspection.

SELECT col1,
       min(col2), max(col2),  -- to seed noise layers
       min(col3), max(col3),  -- to seed noise layers that are 0e
       min(col4), max(col4),  -- to seed noise layers that are 0e
       min(col3_a_col), max(col3_a_col),  -- to seed noise layer if not 0e
       min(col3_b_col), max(col3_b_col),  -- to seed noise layer if not 0e
       count(DISTINCT col3_a_uid),        -- to determine if LE
       count(DISTINCT col3_b_uid),        -- to determine if LE
       min(col4_a_j_col), max(col4_a_j_col),  -- to seed noise layer if not 0e
       min(col4_a_k_col), max(col4_a_k_col),  -- to seed noise layer if not 0e
       count(DISTINCT col4_a_j_uid),        -- to determine if LE
       count(DISTINCT col4_a_k_uid)         -- to determine if LE
       min(col4_b_j_col), max(col4_b_j_col),  -- to seed noise layer if not 0e
       min(col4_b_k_col), max(col4_b_k_col),  -- to seed noise layer if not 0e
       count(DISTINCT col4_b_j_uid),        -- to determine if LE
       count(DISTINCT col4_b_k_uid)         -- to determine if LE
FROM (
    SELECT col2,     -- needed to seed noisy thresholds
        col3, col4,  -- need to select these columns in case condition is 0e
        CASE WHEN col3 = 'a' AND
                 col4 = 'j' AND col4 <> 'k' THEN uid
             ELSE NULL
        END AS col4_a_j_uid,        -- tests col4<>'j', if parent node not LE
        CASE WHEN col3 = 'a' AND
                 col4 = 'j' AND col4 <> 'k' THEN col4
             ELSE NULL
        END AS col4_a_j_col,        -- matching values
        CASE WHEN col3 = 'a' AND
                 col4 = 'k' AND col4 <> 'j' THEN uid
             ELSE NULL
        END AS col4_a_k_uid,        -- tests col4<>'k', if parent node not LE
        CASE WHEN col3 = 'a' AND
                 col4 = 'k' AND col4 <> 'j' THEN col4
             ELSE NULL
        END AS col4_a_k_col,        -- matching values
        CASE WHEN col3 = 'b' AND
                 col4 = 'j' AND col4 <> 'k' THEN uid
             ELSE NULL
        END AS col4_b_j_uid,        -- tests col4<>'j', if parent node not LE
        CASE WHEN col3 = 'b' AND
                 col4 = 'j' AND col4 <> 'k' THEN col4
             ELSE NULL
        END AS col4_b_j_col,        -- matching values
        CASE WHEN col3 = 'b' AND
                 col4 = 'k' AND col4 <> 'j' THEN uid
             ELSE NULL
        END AS col4_b_k_uid,        -- tests col4<>'k', if parent node not LE
        CASE WHEN col3 = 'b' AND
                 col4 = 'k' AND col4 <> 'j' THEN col4
             ELSE NULL
        END AS col4_b_k_col,        -- matching values
        CASE WHEN col3 = 'a' AND
             col4 <> 'j' and col4 <> 'k' THEN uid
             ELSE NULL
        END AS col3_a_uid,        -- matching UIDs for first OR node
        CASE WHEN col3 = 'a' AND
             col4 <> 'j' and col4 <> 'k' THEN col3
             ELSE NULL
        END AS col3_a_col,        -- matching values
        CASE WHEN col3 = 'b' AND
             col4 <> 'j' and col4 <> 'k' THEN uid
             ELSE NULL
        END AS col3_b_uid,        -- matching UIDs for second OR node
        CASE WHEN col3 = 'b' AND
             col4 <> 'j' and col4 <> 'k' THEN col3
             ELSE NULL
        END AS col3_b_col        -- matching values
    FROM (
        SELECT uid, col2, col3, col4
        FROM table
        -- this WHERE clause filters for the needed data (posands and posors)
        WHERE col2 = 'x' AND
              col3 IN ('a','b')
    ) t
) t
GROUP BY 1

The inner-most SELECT filters on posands and posors. This gives us everything we need to check LE and gather seed values. (Note that this might as well be a good opportunity to float col2 as well. Then you don't have to do it in the main query. As long as you are making a full scan here anyway, the extra work won't cost much.)

Note that the CASE statements for col4 reverse the condition from negative to positive.

To check for LE (per bucket), we would first check count(DISTINCT col3_a_uid) and count(DISTINCT col3_b_uid). These correspond to the two nodes (col1 = ? and col2 = 'x' and col3 = 'a' and col4 <> 'j' and col4 <> 'k') and (col1 = ? and col2 = 'x' and col3 = 'b' and col4 <> 'j' and col4 <> 'k'). If the counts are 0 or 1, then they are LE. If 3 or more, they are not LE.

If 2, then we need to seed a noisy threshold. The seed consists of the set of seed elements of all the five conditions in the node. In this particular example we could seed from just SQL inspection plus the value of col1 in the bucket, but for this example lets suppose that is not the case. The values used for the seed material for the first posor comes from the following (all the other symbols are as we do today):

col1 = ?: from the bucket value
col2 = 'x': frommin(col2)andmax(col2)`
col3 = 'a': frommin(col3)andmax(col3)`
`col4 <> 'j':
- If count(DISTINCT col4_a_j_uid) is 0, then from min(col4) and max(col4)
- If count(DISTINCT col4_a_j_uid) is not 0, then from min(col4_a_j_col) and max(col4_a_j_col).
`col4 <> 'k':
- If count(DISTINCT col4_a_k_uid) is 0, then from min(col4) and max(col4)
- If count(DISTINCT col4_a_k_uid) is not 0, then from min(col4_a_k_col) and max(col4_a_k_col).

If either if these nodes are LE for a given bucket, then we make a noise layer from the same seed and add this noise layer to the final answer for that bucket. We also adjust the final answer down (because these are posors) by 1 or 2 if the node is 1e or 2e respectively. Note that if both nodes are LE for a given bucket, then most likely the bucket will fail LCF since LCF is performed after adjustment.

If for a given bucket a posor node is not LE, then the two negands inside the posor are each checked for LE. Note that if the same condition is LE within different posors, then to avoid having duplicate seeds, the second and subsequent seeds have an additional symbol which is a counter and differs for each subsequent layer.

Looking at col4 <> 'j', the check follows the same steps as for the posor above. Namely, if count(DISTINCT col4_a_j_uid) is 0 or 1, the negand node is LE, if 3 or more it is not LE, and if 2 we make a seed from the node using min(col4_a_k_col) and max(col4_a_k_col) as the values, and use the resulting seed for a noisy threshold check.

If the node is LE, the we'll generate a separate noise layer for this node in the final answer, and adjust the final answer by 1 or 2 if 1e or 2e respectively.

Finally, if either or both posors are not LE, then we'll make a composite noise layer. This noise layer is seeded from all conditions in each posor that is not LE, but without duplicate conditions.

By way of example, suppose that the posor (col1 = ? and col2 = 'x' and col3 = 'a' and col4 <> 'j' and col4 <> 'k') is LE and the posor (col1 = ? and col2 = 'x' and col3 = 'b' and col4 <> 'j' and col4 <> 'k') is not LE. Further suppose that the negand col4 <> 'j' is LE. Then the composite noise layer would be seeded from col1 = ?, col2 = 'x', col3 = 'b', and col4 <> 'k'.

If on the other hand both posors are not LE, and the negand col4 <> 'j' is LE, then the composite noise layer would be seeded from col1 = ?, col2 = 'x', col3 = 'b', col3 = 'a', andcol4 <> 'k'`.

Example 2

SELECT col1, col2, count(*)
FROM table
WHERE col3 = 'x' AND
      (col4 = 'j' OR (col4 = 'k' and col4 <> 'l'))
GROUP BY 1,2

The expression would be normalized to:

(col3 = 'x' AND col4 = 'j') OR
(col3 = 'x' AND col4 = 'k' and col4 <> 'l')

Then we would make a probe query like follows (once again assuming that the seeds cannot be composed from column inspection):

SELECT col1, col2,
       min(col3), max(col3),  -- to seed noise layer if 0e
       min(col4), max(col4),  -- to seed noise layer if 0e
       min(col3_xj_col), max(col3_xj_col),  -- to seed noise layer if not 0e
       min(col4_xj_col), max(col4_xj_col),  -- to seed noise layer if not 0e
       min(col3_xkl_col), max(col3_xkl_col),  -- to seed noise layer if not 0e
       min(col4_xkl_col), max(col4_xkl_col),  -- to seed noise layer if not 0e
       count(DISTINCT col34_xj_uid),          -- to determine if LE
       count(DISTINCT col34_xkl_uid),         -- to determine if LE
       count(DISTINCT col4_l_uid)             -- to determine if LE
FROM (
    SELECT col1, col2,    -- need so that we can check bucket groups
           col3, col4,    -- need to select these columns in case condition is 0e
        -- first posor
        CASE WHEN col3 = 'x' AND col4 = 'j' THEN uid
             ELSE NULL
        END AS col34_xj_uid,        -- matching UIDs
        CASE WHEN col3 = 'x' AND col4 = 'j' THEN col3
             ELSE NULL
        END AS col3_xj_col,        -- matching values, col3
        CASE WHEN col3 = 'x' AND col4 = 'j' THEN col4
             ELSE NULL
        END AS col4_xj_col,        -- matching values, col4
        -- second posor
        CASE WHEN col3 = 'x' AND col4 = 'k' and col4 <> 'l' THEN uid
             ELSE NULL
        END AS col34_xkl_uid,        -- matching UIDs
        CASE WHEN col3 = 'x' AND col4 = 'k' and col4 <> 'l' THEN col3
             ELSE NULL
        END AS col3_xkl_col,        -- matching values, col3
        CASE WHEN col3 = 'x' AND col4 = 'k' and col4 <> 'l' THEN col4
             ELSE NULL
        END AS col4_xkl_col,        -- matching values, col4
        -- negand (in case posor is not LE)
        CASE WHEN col4 = 'l' THEN uid
             ELSE NULL
        END AS col4_l_uid,        -- matching UIDs negand
        CASE WHEN col4 = 'l' THEN col4
             ELSE NULL
        END AS col4_l_col,        -- matching values negand
    FROM (
        SELECT uid, col1, col2, col3, col4
        FROM table
        -- this WHERE clause filters for the needed data (posands and posors)
        WHERE (col3 = 'x' AND col4 = 'j') OR
              (col3 = 'x' AND col4 = 'k')
    ) t
) t
GROUP BY 1,2

The inner-most select has the normalized expression minus the negand.

The CASE statements filter for three things. The negand, and the each of the two posors (both of which consist of multiple posands and negands).

For the purpose of seeding, the conditions in the two posors are:

(col1 = ? and col2 = ? and col3 = 'x' and col4 = 'j')
(col1 = ? and col2 = ? and col3 = 'x' and col4 = 'k' and col4 <> 'l')

Otherwise, LE checks are made as in the preceeding example. That is, first checking each posor for LE for each bucket, and if the second posor is not LE, then checking the negand.

Example 3

select cli_district_id, sum(cnt)
from (
    select uid, cli_district_id, count(*) as cnt
    from transactions
    group by 1,2
    having count(*) between 0 and 1000
) t
where cli_district_id in (1,2)
group by 1

Note that this query works on the gda_banking data source of attack.aircloak.com.

An example of a probe query is this (assuming that seeds can be made from SQL inspection):

SELECT cli_district_id,
       count(DISTINCT have_uid),          -- to determine if LE in case others are not
       count(DISTINCT cli_1_uid),         -- to determine if LE
       count(DISTINCT cli_2_uid)          -- to determine if LE
FROM (
    select uid, cli_district_id,
        CASE WHEN cnt < 0 OR cnt >= 1000 THEN uid
             ELSE NULL
        END AS have_uid,        -- UIDs outside of HAVING statement
        CASE WHEN cli_district_id = 1 AND
                cnt >= 0 AND cnt < 1000 THEN uid
             ELSE NULL
        END AS cli_1_uid,        -- matching UIDs
        CASE WHEN cli_district_id = 2 AND
                cnt >= 0 AND cnt < 1000 THEN uid
             ELSE NULL
        END AS cli_2_uid        -- matching UIDs
    from (
        select uid, cli_district_id, count(*) as cnt
        from transactions
        group by 1,2
    ) t
    where cli_district_id in (1,2)
) t
GROUP BY 1

In this probe query, the seed material for both conditions can be taken from SQL inspection, so no need for floating.

After normalization, the filter expression is this:

(cli_district_id = ? AND cli_district_id = 1 AND cnt >= 0 AND cnt < 1000) OR
(cli_district_id = ? AND cli_district_id = 2 AND cnt >= 0 AND cnt < 1000)

In this case, since we are selecting the same column in the WHERE clause, there will be duplicate conditions in the seed material so one of them will be dropped for seeding.

Ranges (BETWEEN) must be individually LE checked just as negands are checked. In other words, if the parent posor is not LE, then the ranges inside the posor are LE checked. As with negands, the range is reversed, so cnt >= 0 AND cnt < 1000 becomes (cnt < 0 OR cnt >= 1000).

It so happens that in this query neither posor is LE, so we need to check the range to see if it is LE. For seeding, we use the same seed as we do for BETWEEN conditions today. In this case, the range is LE, so we need a separate noise layer for the range.

The composite noise layer in this case would be seeded from two posors, but without the range. Therefore, the seeding material comes from the two conditions cli_district_id = 1 and cli_district_id = 2.

Example 4

select cli_district_id, sum(cnt)
from (
    select uid, cli_district_id, count(*) as cnt
    from transactions
    group by 1,2
    having count(*) between 100 and 200
) t
where cli_district_id in (1,999)
group by 1

This is the same query as Example 3, but with different numbers. The probe query would be the same as well, but here it is just for convenience:

SELECT cli_district_id,
       count(DISTINCT have_uid),          -- to determine if LE in case others are not
       count(DISTINCT cli_1_uid),         -- to determine if LE
       count(DISTINCT cli_999_uid)          -- to determine if LE
FROM (
    select uid, cli_district_id,
        CASE WHEN cnt < 100 OR cnt >= 200 THEN uid
             ELSE NULL
        END AS have_uid,        -- UIDs outside of HAVING statement
        CASE WHEN cli_district_id = 1 AND
                cnt >= 100 AND cnt < 200 THEN uid
             ELSE NULL
        END AS cli_1_uid,        -- matching UIDs
        CASE WHEN cli_district_id = 999 AND
                cnt >= 100 AND cnt < 200 THEN uid
             ELSE NULL
        END AS cli_999_uid        -- matching UIDs
    from (
        select uid, cli_district_id, count(*) as cnt
        from transactions
        group by 1,2
    ) t
    where cli_district_id in (1,999)
) t
GROUP BY 1

After normalization, the conditions look like this:

(cli_district_id = ? AND cli_district_id = 1 AND cnt >= 100 AND cnt < 200) OR
(cli_district_id = ? AND cli_district_id = 999 AND cnt >= 100 AND cnt < 200)

In this case, the second node is LE (0e) because there are no cli_district_id with value 999. The first node has 222 distinct UIDs (is not LE), and so the range must also be checked. The range is not LE: it has 441 distinct UIDs.

This means that the second node has its own separate noise layer, and all conditions of the first node contribute to the composite noise layer.

Example 5

select cli_district_id, sum(cnt)
from (
    select uid, cli_district_id, count(*) as cnt
    from transactions
    where acct_district_id in (1,60)
    group by 1,2
) t
where cli_district_id in (55,74)
group by 1

The normalized expression for this query is:

(cli_district_id = ? and acct_district_id = 1 AND cli_district_id = 55) OR
(cli_district_id = ? and acct_district_id = 1 AND cli_district_id = 74) OR
(cli_district_id = ? and acct_district_id = 60 AND cli_district_id = 55) OR
(cli_district_id = ? and acct_district_id = 60 AND cli_district_id = 74)

The nodes of this expression are spread over two sub-queries. The probe is this:

SELECT cli_district_id,
       count(DISTINCT ac_1_55_uid) AS ac_1_55,
       count(DISTINCT ac_1_74_uid) AS ac_1_74,
       count(DISTINCT ac_60_55_uid) AS ac_60_55,
       count(DISTINCT ac_60_74_uid) AS ac_60_74
FROM (
    select uid, cli_district_id,
        CASE WHEN acct_district_id = 1 AND cli_district_id = 55 THEN uid
             ELSE NULL
        END AS ac_1_55_uid,
        CASE WHEN acct_district_id = 1 AND cli_district_id = 74 THEN uid
             ELSE NULL
        END AS ac_1_74_uid,
        CASE WHEN acct_district_id = 60 AND cli_district_id = 55 THEN uid
             ELSE NULL
        END AS ac_60_55_uid,
        CASE WHEN acct_district_id = 60 AND cli_district_id = 74 THEN uid
             ELSE NULL
        END AS ac_60_74_uid
    from (
        select uid, cli_district_id, acct_district_id, count(*) as cnt
        from transactions
    where acct_district_id in (1,60)
        group by 1,2,3
    ) t
    where cli_district_id in (55,74)
) t
GROUP BY 1

The output of the above query is this:

cli_district_id	ac_1_55	ac_1_74	ac_6055	ac_60_74
55	4	0	0	0
74	0	3	0	4

This means that for bucket 55, the first posor is not LE but the others are. For bucket 74, the second and fourth posors are not LE.

Regarding bucket 55, there would be four noise layers. One is a composite built from the conditions in the first posor (acct_district_id = 1 and cli_district_id = 55), and the other three are individual node noise layers built from the other three posors respectively.

Regarding bucket 74, there are three noise layers. One is a composite built from the conditions in the second and fourth nodes (acct_district_id = 1, acct_district_id = 60 and cli_district_id = 74). The other two are built from the first and third nodes respectively.

Example 6

Analyst query:

This example has negands across two sub-queries.

select cli_district_id, sum(cnt)
from (
    select uid, k_symbol, cli_district_id, operation, count(*) as cnt
    from transactions
    group by 1,2,3,4
    having max(acct_district_id) <> 1
) t
where k_symbol in ('SIPO','UVER') and
      operation <> 'VYBER'
group by 1

Rule 1: remove negands from probe inner select, which means having condition and operation condition are gone from inner select.

Normalized expression is:

(cli_district_id = ? and k_symbol = 'SIPO' and max(acct_district_id) <> 1 and operation <> 'VYBER') OR
(cli_district_id = ? and k_symbol = 'UVER' and max(acct_district_id) <> 1 and operation <> 'VYBER')

Two posors, and each posor has two negands. This leads to a total of 6 CASE statements (one per posor, one per negand per posor).

SELECT cli_district_id,
       count(DISTINCT ac_S_V_uid) AS ac_S_V,
       count(DISTINCT ac_U_V_uid) AS ac_U_V,
       count(DISTINCT ac_S_1_uid) AS ac_S_1,
       count(DISTINCT ac_U_1_uid) AS ac_U_1,
       count(DISTINCT ac_1_S_V_uid) AS ac_1_S_V,
       count(DISTINCT ac_1_U_V_uid) AS ac_1_U_V
FROM (
    select uid, cli_district_id,
        CASE WHEN max_acct <> 1 AND k_symbol = 'SIPO' AND
                  operation = 'VYBER' THEN uid
             ELSE NULL
        END AS ac_S_V_uid,
        CASE WHEN max_acct <> 1 AND k_symbol = 'UVER' AND
                  operation = 'VYBER' THEN uid
             ELSE NULL
        END AS ac_U_V_uid,
        CASE WHEN max_acct = 1 AND k_symbol = 'SIPO' AND
                  operation <> 'VYBER' THEN uid
             ELSE NULL
        END AS ac_S_1_uid,
        CASE WHEN max_acct = 1 AND k_symbol = 'UVER' AND
                  operation <> 'VYBER' THEN uid
             ELSE NULL
        END AS ac_U_1_uid,
        CASE WHEN max_acct <> 1 AND k_symbol = 'SIPO' AND
                  operation <> 'VYBER' THEN uid
             ELSE NULL
        END AS ac_1_S_V_uid,
        CASE WHEN max_acct <> 1 AND k_symbol = 'UVER' AND
                  operation <> 'VYBER' THEN uid
             ELSE NULL
        END AS ac_1_U_V_uid
    from (
        select uid, cli_district_id, acct_district_id, k_symbol,
               operation, max(acct_district_id) as max_acct
        from transactions
    where acct_district_id in (1,60)
        group by 1,2,3,4,5
    ) t
    where k_symbol in ('SIPO','UVER')
) t
GROUP BY 1

yoid2000 commented 4 years ago

@cristianberneanu @sebastian FYI

sebastian commented 4 years ago

so we are stuck with a noise layer per node. ... One implication of a low LE threshold is that we cannot inform the analyst when LE has taken place.

Just for clarification. We are stuck with noise layers even for conditions that we consider having no effect, and that we compensate for, right?

I wonder if there isn't some way to distinguish between us compensating for the lack of a value, and there being a value... I don't have a clear attack, it's just something in the back of my mind.

yoid2000 commented 4 years ago

Just for clarification. We are stuck with noise layers even for conditions that we consider having no effect, and that we compensate for, right?

Yes. I looked for a way but couldn't find one. The problem is that if you have no noise layer for zero-effect (0e) nodes, then you also need to have no noise layer for one-effect (1e) noise layers (in which the answer is adjusted to remove the effect of the user). Otherwise the attacker can distinguish between the two by the presence or absence of noise.

But if you remove noise for both 0e and 1e nodes, then the attacker can start concentrating on the difference between buckets that have 2 users + noise, and buckets that have 1 user but which get adjusted to 0 users and no noise. We could perhaps take the position that it is hard to find such cases, but what I call compound posors (posors with multiple AND'd conditions, like OR (A and B and C)) give the attacker much more flexibility in choosing which users are selected with the posor.

yoid2000 commented 4 years ago

I wonder if there isn't some way to distinguish between us compensating for the lack of a value, and there being a value...

Not sure what you are referring to here.

sebastian commented 4 years ago

I wonder if there isn't some way to distinguish between us compensating for the lack of a value, and there being a value...

Not sure what you are referring to here.

If I read one of your other previous issues correctly your proposal for compensating for 1e users was by add 1x of noise to the query to compensate for the user? Rather than to actually adjust the aggregate by the contribution of the 1e user?

yoid2000 commented 4 years ago

3914 suggests to adjust the output of the cloak rather than modify the query itself. So instead of dropping a condition, we add/subtract 1xSD of noise to approximate the same effect.

Oh, and you are wondering if an attacker can detect that we've done this, presumably because of the difference between our adjustment and the true effect of dropping the condition?

sebastian commented 4 years ago

Oh, and you are wondering if an attacker can detect that we've done this, presumably because of the difference between our adjustment and the true effect of dropping the condition?

Yes, exactly that.

yoid2000 commented 4 years ago

I discuss this in #3914.

yoid2000 commented 4 years ago

Just a quick heads up. I'm going to change how we make noise layers in the design here. We can have fewer noise layers if we only make noise layers for conditions that fail to pass a noisy LCF-style threshold. Will let you know when the change is complete.

yoid2000 commented 4 years ago

Ok, I posted the changes in the first comment. Will work on some more examples, especially sub-queries.

yoid2000 commented 4 years ago

@cristianberneanu at the moment I'm not seeing any reason for limiting OR to a single sub-query. Since we don't drop conditions per se (but instead adjust), the fact that a node spans multiple sub-queries doesn't prevent us from making the necessary noise layers and adjustments.

If you can find a counter example I'd like to see it.

cristianberneanu commented 4 years ago

In what follows I use the term 'node' to refer to a single posor, negand, or range condition, where posor is OR (c1 AND [NOT] c2 AND [NOT] c3) for one or more conditions cN, negand is AND NOT (c1 and [NOT] c2 and [NOT] c3) for one or more conditions cN, and range is AND col BETWEEN X and Y. I'm not including posands when I say "node".

This is not 100% clear to me. For the following condition:

a = 1 or a <> 2 or (a between 0 and 100) or (a <> 3 and b between 1 and 4)

which are the nodes?

final-sd is the standard deviation of the noise layer (as normally computed)

What does this mean? A noisy aggregator has a final SD, a noise layer adds to that SD. Are adjustments done to the SD for computing the aggregators or to the aggregators themselves? How is an aggregator like sum(x) adjusted?

The noise associated with LE conditions is not factored into the noise reporting

This will really hurt usability. The value of the noise estimation consists entirely in it being an upper bound on the real noise added. We could add less noise, but adding more will make it useless for analysts (they can't use it to make any assumptions about the result).

(col3 = 'a' and col4 <> 'j' and col4 <> 'k') OR (col3 = 'b' and col4 <> 'j' and col4 <> 'k')

From what I see in the example, the probe checks each posor, but it then doesn't correctly check for each negands in the context of the posor. The checks in the query are simply col4 = 'j', when they should be at the least be col3 = 'a' and col4 = 'j' plus col3 = 'b' and col4 = 'j'.

Furthermore, when multiple negands are present, all possible combinations should be checked. Otherwise, we won't detect if both col4 <> 'j' and col4 <> 'k' are LE in the context of the posands. This will greatly complicate the probe query.

I would also like to see an example of a query with a having filter and a where filter in the inner most subquery. Something like:

select cli_district_id, sum(cnt)
from (
    select uid, cli_district_id, count(*) as cnt
    from transactions
    where acc_district_id in (1,2)
    group by 1,2
    having count(*) between 0 and 1000
) t
group by 1

I don't have a good feeling about normalization across subqueries. It can be done in some scenarios, but it will get really hairy when combining joins, grouping, having and where filters. I think we should leave it for the next version.

Overall, I feel that this change will be very complex and I am not optimistic that we will finish sooner than 3-4 months. And this if we don't find other problems with it during implementation ... maybe it would be better to have one release fully allocated to this purpose?

yoid2000 commented 4 years ago

This is not 100% clear to me. For the following condition: a = 1 or a <> 2 or (a between 0 and 100) or (a <> 3 and b between 1 and 4) which are the nodes?

Normalization means turning the expression into a set of and groups seperated by OR and up to one NOT OR.

The above expression has four and group expressions. Each such expression is one node. In addition three of the four and group expressions have one or more of negand or range. Each of these is a node that needs to be checked if the containing and group is not LE. There are four of these negand/range nodes.

That means there are a total of eight nodes here, all of which need to be checked. The CASE statements would include the following:

-- and groups (must be checked):
CASE WHEN a = 1...
CASE WHEN a <> 2...
CASE WHEN a between 0 and 100...
CASE WHEN (a <> 3 and b between 1 and 4)
-- negands and ranges (checked if the corresponding and group is not LE)
CASE WHEN a = 2...
CASE WHEN a < 0 or >= 100...
CASE WHEN a = 3
CASE WHEN b < 1 or >= 4

cristianberneanu commented 4 years ago

-- negands and ranges (checked if the corresponding and group is not LE)

But shouldn't this be:

CASE WHEN a = 2...
CASE WHEN a NOT BETWEEN 0 AND 100
CASE WHEN a = 3 AND b BETWEEN 1 AND 4
CASE WHEN a <> 3 AND b NOT BETWEEN 1 AND 4

yoid2000 commented 4 years ago

final-sd is the standard deviation of the noise layer (as normally computed)

What does this mean? A noisy aggregator has a final SD, a noise layer adds to that SD. Are adjustments done to the SD for computing the aggregators or to the aggregators themselves? How is an aggregator like sum(x) adjusted?

No changes are made to the final-sd itself, or to the currently computed noise layers. The changes take place in addition to the current mechanism.

So let's give an example. Say that you have a bucket, and you've computed the final-sd for that bucket as normal, and the value is final-sd = 1. Say there are three normal noise layers (i.e. not associated with any LE nodes). You compute three noise values and sum them together. Let's say that the resulting noise is n=2.11. Up to now this is normal operation and doesn't change.

Now suppose there is an LE posor node with 2e. Because it is a posor, you want to adjust down. Since final-sd = 1, you will adjust down by 2. Thus the resulting noise will be n=0.11.

As another example, say that you have a bucket and the aggregator is sum(x), and you've computed the final-sd for that bucket as final-sd = 311.1. Say there are two normal noise layers, and so based on normal operation you compute a noise as n = -681.1.

Now suppose there is an LE negand node with 1e. Since it is a negand, you want to adjust up. Since final-sd = 311.1, you will add 311.1 to the noise value, ending with a final noise of n = -370.

yoid2000 commented 4 years ago

But shouldn't this be:

CASE WHEN a NOT BETWEEN 0 AND 100

Yes, with the exception that strictly speaking in standard SQL a not between 0 and 100 is equivalent a < 0 or a > 100, but since the cloak treats between a little different, we want to do a < 0 or a >= 100 instead.

CASE WHEN a = 3 AND b BETWEEN 1 AND 4 CASE WHEN a <> 3 AND b NOT BETWEEN 1 AND 4

Yes, I think you are right.

yoid2000 commented 4 years ago

The noise associated with LE conditions is not factored into the noise reporting

This will really hurt usability. The value of the noise estimation consists entirely in it being an upper bound on the real noise added. We could add less noise, but adding more will make it useless for analysts (they can't use it to make any assumptions about the result).

This is the result of a trade-off. Either we can set the LE noisy threshold to 2-3-4, or we can set it to 1-2. If we do the latter, then we cannot report the noise and so the analyst has less information to work with. If we do the former, then we'll end up with more nodes checking positive for LE, and therefore more adjustments and more noise overall.

I think it is better to have less noise but less knowledge of the noise than more noise.

One way we can mitigate this lack of reporting is to report when at least several conditions are LE (but don't say which ones obviously). This way at least the analyst can know that a substantial amount of LE took place. If the analyst really cares, he or she could check the conditions one by one using LCF...

By the way, the noise estimate is not an upper bound on the noise added. It is the SD of the noise added. There could be a lot more noise due to flattening, which we cannot report to the analyst. Of course it would also turn out that the random noise is unusually high (several standard deviations), though this is rare.

cristianberneanu commented 4 years ago

Now suppose there is an LE posor node with 2e. Because it is a posor, you want to adjust down. Since final-sd = 1, you will adjust down by 2. Thus the resulting noise will be n=0.11.

So the mean of the noise is what is changed, correct? Instead of a mean of 0, we will have a mean of +/- 1-2 SD. I think putting it like this is clearer. And this will have to be done in all cases, including when dropping outliers, getting the top users, etc., right?

yoid2000 commented 4 years ago

Furthermore, when multiple negands are present, all possible combinations should be checked. Otherwise, we won't detect if both col4 <> 'j' and col4 <> 'k' are LE in the context of the posands. This will greatly complicate the probe query.

We can't possibly check all combinations, and I don't think it is necessary. I think it would be difficult for an analyst to come up with a query in practice where two or more negands are individually not LE but are together LE with 1e. Given the cost of defending against this, it is not worth it. Let's not worry about it.

yoid2000 commented 4 years ago

So the mean of the noise is what is changed, correct? Instead of a mean of 0, we will have a mean of +/- 1-2 SD. I think putting it like this is clearer.

Yes, it could also be seen this way. We shift the mean. But the shift is +- 1-2 SD per LE node. There is also the possible shift for non-LE nodes (see R6).

And this will have to be done in all cases, including when dropping outliers, getting the top users, etc., right?

I don't think we'd want to do this. We want to approximate what would happen if the LE node were dropped from the query. I don't know how dropping the LE node would affect these other aspects, but I don't think it would be like this. I guess this is the danger of thinking of the adjustment in terms of an adjustment on the mean. That is only the case for the final noise value. In this sense, it might be clearer to really think of it as a final adjustment...

yoid2000 commented 4 years ago

(col3 = 'a' and col4 <> 'j' and col4 <> 'k') OR (col3 = 'b' and col4 <> 'j' and col4 <> 'k')

From what I see in the example, the probe checks each posor, but it then doesn't correctly check for each negands in the context of the posor. The checks in the query are simply col4 = 'j', when they should be at the least be col3 = 'a' and col4 = 'j' plus col3 = 'b' and col4 = 'j'.

Yes you are right. Each negand/range should be checked in the context of the rest of the posor. But not all combinations (as I say above).

I'll fix the example.

yoid2000 commented 4 years ago

I fixed the example. Please have a look. The normalized conditions were:

(col3 = 'a' and col4 <> 'j' and col4 <> 'k') OR (col3 = 'b' and col4 <> 'j' and col4 <> 'k')

In the fix, I now have this:

        CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid  ....
        CASE WHEN col4 = 'k' AND col4 <> 'j' THEN uid ....

The reason I don't include col3 = 'a' or col3 = 'b' is that those are already in the inner-most SELECT so the conditions are already in effect.

But I'll admit that this is pretty confusing. On one hand, we are doing the normalization and using that for the CASE statement, but on the other hand, in the inner SELECT we are filtering on pre-normalized conditions.

So it is under-specified at this point what filter conditions go into the inner SELECT and what goes in the subsequent CASE....

I'll try to clarify this.

yoid2000 commented 4 years ago

I don't have a good feeling ...

This always makes me think of star wars...

about normalization across subqueries. It can be done in some scenarios, but it will get really hairy when combining joins, grouping, having and where filters. I think we should leave it for the next version.

Maybe so.

Overall, I feel that this change will be very complex and I am not optimistic that we will finish sooner than 3-4 months. And this if we don't find other problems with it during implementation ... maybe it would be better to have one release fully allocated to this purpose?

No kidding :(

cristianberneanu commented 4 years ago

One way we can mitigate this lack of reporting is to report when at least several conditions are LE

This would leak information as well. By trying different combinations, the analyst can determine which are LE and which are not, by looking at the warning.

We can't possibly check all combinations, and I don't think it is necessary. I think it would be difficult for an analyst to come up with a query in practice where two or more negands are individually not LE but are together LE with 1e. Given the cost of defending against this, it is not worth it. Let's not worry about it.

I don't think it is that simple. For one, negands can be easily combined with ranges. The resulting combination could be LE, but it won't be detected if we check the range and the negand separately. Then the algorithm has to be resistant to chaff conditions.

How about this condition: dept='cs' (c1) and gender <> 'm' (c2) and salary <> 1000 (c3)? To detect if c2 is LE or c3 is LE or c2 and c3 is LE we need to compute the counts for c1 and c2 and c3, c1 and c2, c1 and c3 and c1.

But the shift is +- 1-2 SD per LE node.

So the final mean is +/-1-2 final_SD * LE node count?

In this sense, it might be clearer to really think of it as a final adjustment

This "final adjustemnt" has a vague definition, in my current understanding.

CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid ....

The second condition is redundant in this case, but it might not be in other cases.

yoid2000 commented 4 years ago

One way we can mitigate this lack of reporting is to report when at least several conditions are LE

This would leak information as well. By trying different combinations, the analyst can determine which are LE and which are not, by looking at the warning.

We should be able to prevent this by again using noisy threshold of some sort.

yoid2000 commented 4 years ago

So the final mean is +/-1-2 final_SD * LE node count?

The final mean would be sum(All LE adjustments) * final_sd

So if we have four LE nodes, with adjustments 0, 1, -2, -1 respectively, then the final mean would be -2 * final_sd

yoid2000 commented 4 years ago

How about this condition: dept='cs' (c1) and gender <> 'm' (c2) and salary <> 1000 (c3)? To detect if c2 is LE or c3 is LE or c2 and c3 is LE we need to compute the counts for c1 and c2 and c3, c1 and c2, c1 and c3 and c1.

My point is that we don't need to detect if c2 and c3 is LE.

Please remember that the goal is not to prevent LE conditions at all costs, the goal is to prevent reasonable attacks. We often get trapped into this way of thinking (we get completely focused on the mechanism and lose sight of the attacks we are defending against).

So if you want to argue for checking every combination of negands and ranges, you should come up with an attack that is realistic and exploits such a combination. Maybe there is one but I don't see it. The attacker would have to come up with some (not a and not b) where not a is not LE, not b is not LE, (not a and not b) is LE with e1 (for a difference attack) or e0 (for a chaff attack), and if e1 happens to isolate the intended victim. I struggle to come up with an example...

yoid2000 commented 4 years ago

CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid ....

The second condition is redundant in this case, but it might not be in other cases.

Why is it redundant? I don't see it.

cristianberneanu commented 4 years ago

Why is it redundant? I don't see it.

col4 = 'j' implies col4 <> 'k'

yoid2000 commented 4 years ago

col4 = 'j' implies col4 <> 'k'

duh. right.

yoid2000 commented 4 years ago

@cristianberneanu maybe you can confirm something for me.

I made the statement somewhere that the inner select should retain the posands and posors in the original query, but remove the negands and negors.

The goal to this being that the data that flows into the CASE statements should include everything that you are testing for, but no more (for efficiency's sake).

It seems to me this means that, conceptually at least, you simply replace any negands and negors with True.

Does this sound right to you?

cristianberneanu commented 4 years ago

It seems to me this means that, conceptually at least, you simply replace any negands and negors with True.

You could do that, but it would be just as simple to drop the conditions entirely. I don't see the advantage.

The goal to this being that the data that flows into the CASE statements should include everything that you are testing for, but no more (for efficiency's sake). The reason I don't include col3 = 'a' or col3 = 'b' is that those are already in the inner-most SELECT so the conditions are already in effect.

They have to be included in both locations: in the inner-most SELECT in order to reduce the amount of data processed and in each CASE in order to compute the correct counts. If you don't include the right posands for each posor in each coressponding CASE statement, invalid values will be aggregated. In the case of the given example, CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid ... will aggregate both posors at the same time, i.e. (col3 = 'a' or col3 = 'b') and col4 = 'j' and col4 <> 'k', which is not useful.

yoid2000 commented 4 years ago

It seems to me this means that, conceptually at least, you simply replace any negands and negors with True.

You could do that, but it would be just as simple to drop the conditions entirely. I don't see the advantage.

Yes of course, that's why "conceptually".

If you don't include the right posands for each posor in each coressponding CASE statement, invalid values will be aggregated. In the case of the given example, CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid ... will aggregate both posors at the same time, i.e. (col3 = 'a' or col3 = 'b') and col4 = 'j' and col4 <> 'k', which is not useful.

But I think this is exactly what we want (to aggregate both posors). This is because col4 <> 'j' and col4 <> 'k' appear in both posors of the normalized expression, and so we want to check them against (col3 = 'a' or col3 = 'b'). Because the thing we want to evaluate is how the removal of one of those negands would affect the output of the whole query.

yoid2000 commented 4 years ago

But I think this is exactly what we want (to aggregate both posors). This is because col4 <> 'j' and col4 <> 'k' appear in both posors of the normalized expression, and so we want to check them against (col3 = 'a' or col3 = 'b'). Because the thing we want to evaluate is how the removal of one of those negands would affect the output of the whole query.

Hmmmm, ok I'm not so sure, because in another query the analyst could remove just one of the negands...

yoid2000 commented 4 years ago

If you don't include the right posands for each posor in each coressponding CASE statement, invalid values will be aggregated. In the case of the given example, CASE WHEN col4 = 'j' AND col4 <> 'k' THEN uid ... will aggregate both posors at the same time, i.e. (col3 = 'a' or col3 = 'b') and col4 = 'j' and col4 <> 'k', which is not useful.

@cristianberneanu

Ok, I agree with you. The negands and ranges in a given posor have to be treated independently from duplicated negands and ranges in other posors. Each is tested independently, and each can contribute its own noise and adjustment.

I'll update the first comment to reflect this.

yoid2000 commented 4 years ago

@cristianberneanu I updated the bullet items and example 1. I'll do more updating later.

cristianberneanu commented 4 years ago

So if you want to argue for checking every combination of negands and ranges, you should come up with an attack that is realistic and exploits such a combination. Maybe there is one but I don't see it. The attacker would have to come up with some (not a and not b) where not a is not LE, not b is not LE, (not a and not b) is LE with e1 (for a difference attack) or e0 (for a chaff attack), and if e1 happens to isolate the intended victim. I struggle to come up with an example...

Let's take the case of the single female in the CS department. The combined condition dept <> 'CS' and gender <> 'f' is LE, according to the definition for LE used throughout this topic, while each of the individual sub-conditions are not LE. The current algorithm won't detect this case, which means one can use such conditions to exclude a single user from the result set.

My intuition is that an attacker can then use filters such as

dept <> 'CS' and gender <> 'f' and salary <> 1000
dept <> 'CS' and gender <> 'f' and salary <> 1001
....

to find out additional information about the victim, such as the salary range, by exploiting the difference between 0e, 1e and 2+e conditions, but maybe I am missing something and this is not a feasible attack.

Ok, I agree with you. The negands and ranges in a given posor have to be treated independently from duplicated negands and ranges in other posors. Each is tested independently, and each can contribute its own noise and adjustment.

This greatly complicates the cases where filtering is done before and after grouping, as the resulting CASE statements will get very complex. This why I would like to see an example for a query containing WHERE and HAVING filters on columns different than the ones being grouped.

yoid2000 commented 4 years ago

Let's take the case of the single female in the CS department. The combined condition dept <> 'CS' and gender <> 'f' is LE, according to the definition for LE used throughout this topic, while each of the individual sub-conditions are not LE.

I don't see this as LE (combined or otherwise). This query excludes all CS dept members and all females. If you drop both conditions, then all of these are included, which is a lot of individuals and so not LE.

This greatly complicates the cases where filtering is done before and after grouping, as the resulting CASE statements will get very complex. This why I would like to see an example for a query containing WHERE and HAVING filters on columns different than the ones being grouped.

I'll work on this now.

cristianberneanu commented 4 years ago

This query excludes all CS dept members and all females.

There is an AND here between the filters (not an OR). The combined condition excludes all female members of the CS department, which should be a single individual in this case. It could be rewritten as NOT (dept = 'CS' and gender = 'f').

sebastian commented 4 years ago

Ok, let me add back what I removed/scratched.

dept <> 'CS' and gender <> 'f' is the same as not (deps = 'CS' or gender = 'f').
NOT (dept = 'CS' and gender = 'f') is an altogether different thing, and would be equal to dept <> 'CS' or gender <> 'f'.

Here is a photo of the two sets. The top one being dept <> 'CS' and gender <> 'f' and the bottom one being not (dept = 'CS' and gender = 'f')

yoid2000 commented 4 years ago

@sebastian no I think you are right.

Actually, the expression that excludes the lone woman would be:

WHERE dept <> CS or gender <> F

Here's the logic: CS woman --> false or false --> false --> exclude CS man --> false or true --> true --> include math woman --> true or false --> true --> include math man --> true or true --> true --> include

So this is two negors. Currently I think what to do with negors is underspecified. But if we normalize as:

WHERE not (dept = CS and gender = F)

Then the entire WHERE clause would be removed from the probe query. In the CASE statement we would test for:

CASE WHEN dept = CS and gender = F then .....

This would be detected as LE, and we would adjust and add noise layer accordingly.

So it looks like this works, but I don't think it was clear from the stated rules to do this...

cristianberneanu commented 4 years ago

You guys are right, there should be an OR there, I got confused while thinking about the inverted condition.

yoid2000 commented 4 years ago

It is super confusing actually. I frequently have to build truth tables to convince myself of one thing or another. I think the problem is that SQL logic and natural language don't match. Saying "men and woman" is the reverse of saying gender = M and gender = F. The former includes everyone, the latter includes noone.

yoid2000 commented 4 years ago

@cristianberneanu I added a new example (number 6) which has negands split over two sub-queries.

I still need to double check it, but basic idea should be right.

yoid2000 commented 4 years ago

@cristianberneanu ok Example 6 looks correct...

yoid2000 commented 4 years ago

Ok, I agree with you. The negands and ranges in a given posor have to be treated independently from duplicated negands and ranges in other posors. Each is tested independently, and each can contribute its own noise and adjustment.

This greatly complicates the cases where filtering is done before and after grouping, as the resulting CASE statements will get very complex. This why I would like to see an example for a query containing WHERE and HAVING filters on columns different than the ones being grouped.

This doesn't mean, however, that you need to check each combination. If you have N negands, you one-by-one reverse each to posand while holding the others as negands. So you still end up with N CASE statements (for the negands ... more if you have posors....)

yoid2000 commented 4 years ago

@cristianberneanu @sebastian for now I'm done here.

Cristian please look and let me know what is confusing or still seems wrong.

cristianberneanu commented 4 years ago

@cristianberneanu I added a new example (number 6) which has negands split over two sub-queries.

@yoid2000 This is not exactly what I had in mind. The reason I asked for this specific query:

select cli_district_id, sum(cnt)
from (
    select uid, cli_district_id, count(*) as cnt
    from transactions
    where acc_district_id in (1,2)
    group by 1,2
    having count(*) between 0 and 1000
) t
group by 1

was because there are a few edge cases in it that I don't understand how they should be handled.

More specific, the acc_district_id column needs to be floated, so that the two posors can be checked separately. The act of floating a non-grouped_by column will distort any computed aggregates, including the count(*) one which needs to be checked against the range.

The resulting CASE statements will need to filter the data for the right posor, compute the aggregate, then check the range condition, something that I don't think will be easy to do or even possible in the general case (especially for more complex aggregates, like stddev).

yoid2000 commented 4 years ago

The act of floating a non-grouped_by column will distort any computed aggregates, including the count(*) one which needs to be checked against the range.

Ok, I see your point now. Good catch.

One question @cristianberneanu. As it now stands, do we make a noise layer from that having clause (i.e. from a clause that uses an aggregate like count(*) as the thing being evaluated)?

cristianberneanu commented 4 years ago

One question @cristianberneanu. As it now stands, do we make a noise layer from that having clause (i.e. from a clause that uses an aggregate like count(*) as the thing being evaluated)?

I am not sure what happens now. I don't see any explicit mentions to aggregates in the noise layers code.

Aircloak / aircloak

Low, hard threshold for LED and related noise layers #3917

Design Examples

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

3914 suggests to adjust the output of the cloak rather than modify the query itself. So instead of dropping a condition, we add/subtract 1xSD of noise to approximate the same effect.