Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Low, hard threshold for LED and related noise layers #3917

Closed yoid2000 closed 4 years ago

yoid2000 commented 4 years ago

In what follows I use the term 'node' to refer to a single posor, negand, or range condition, where posor is OR (c1 AND [NOT] c2 AND [NOT] c3) for one or more conditions cN, negand is AND NOT (c1 and [NOT] c2 and [NOT] c3) for one or more conditions cN, and range is AND col BETWEEN X and Y. I'm not including posands when I say "node".

High-level design:

R1: First, since we are making the LE threshold quite low, the distortion due to LE nodes is relatively small, so not as important to tell the analyst as otherwise would be. Second, the analyst can figure out on his own if a condition has a good chance of being LE by seeing if the condition is LCF. Third, telling the analyst which conditions are LE is complex, especially given normalization of expressions. Finally, since we are pushing the LE threshold so low, telling the analyst would be giving him more information than I'm comfortable with. The fact that the LE threshold is a little noisy helps some, but not as much as I'd like (the LCF threshold is already pushing the boundaries of my comfort zone).

R2: Note that we no longer have the notion of bucket group. That is because we are adjusting answers instead of dropping SQL conditions, so we can adjust on a per-bucket basis.

R3: I had been talking about a hard threshold for LE checking (threshold = 2), but that opened us up to attacks where the attacker knows that a given condition will affect either 1 or 2 users (1e/2e attack). One of the users is the victim, and the other is a "dummy" user (there only for the purpose of boosting the count to 1 or 2). The attacker has to know that the dummy user will be in the answer for sure, and is trying to determine if the victim is in the answer or not. If the victim is in the answer, then there are two users affected by the condition, and no adjustment is made. If the victim is not in the answer, then there is one user affected by the condition and the answer is adjusted to 0.

It is very rare to find the conditions where this can occur using one attribute. However, we allow OR (a AND b AND c), where a, b, and c can be attributes from any columns. This allows an attacker to fine-tune a posor to select two specific users. I haven't looked into how likely this is in our datasets, but I would not want to assume that it is hard.

The fact that we still have uid noise layers helps here, but I don't want to depend on us always having uid noise layers. Anyway, by making the LE threshold at least a little noisy, we introduce some uncertainty into the attack.

R4: I want to minimize the amount of noise we generate with nodes, especially because there could be many of them. Towards this end, I'm proposing a composite noise layer that is seeded from multiple nodes. The danger with a composite noise layer is that chaff conditions can be used to generate an arbitrary number of different seeds, thus allowing the noise layer to be averaged away. This means that LE nodes cannot be part of the composite. As a result, we put nodes that are not LE in the single composite noise layer, and each LE node contributes a noise layer.

In principle, one would have hoped that by doing LED, one could eliminate the noise layers altogether for conditions that are dropped. This may be possible if we set the threshold for LED the same as we set it for LCF. However, this has two problems. First, substantially more conditions would be dropped than is the case with the current design, and the resulting amount of distortion is probably similar. However, dropping introduces a systematic bias in the data, whereas noise does not (has a mean of zero).

R5: We use one unit of final-sd to approximately remove the effect of the condition. If final-sd is 1, then this exactly removes the effect. If final-sd is more than 1, then we will tend to over-compensate, because the final-sd usually tries to cover the more extreme data points.

TODO: Experiment to determine if final-sd is good enough, or if we should scale it in some way.

R6: This is new and probably utterly confusing. This is in response to an equations-based averaging attack I recently thought of that is made possible by posors in particular, but to a lesser extend by negands. The attack is designed to remove the noise from data aggregates, especially counts of distinct users.

The attack generates random sets of values, and requests the count. So for instance:

SELECT count(DISTINCT uid)
FROM table
WHERE age IN (3,11,18,20,28, ....)

then WHERE age IN (5,10,18,22,28,...), WHERE age IN (2,3,13,17,20,31,...) and so on.

Each of these queries can be formulated as an equation, and then the set of equations solved to produce exact counts.

I believe that this attack would not work with per-node static noise because each value would have a consistent bias, and this bias would appear in the final answers. With composite noise layers, this would no longer be the case. Each query would result in a different seed. To defend against this without introducing individual noise layers, we consistently adjust answers for each value up, down, or not at all. This adjustment will persist in the set of equations, and the final answers should be slightly off.

Note that this would be just the first step in an attack that, for instance, then exploits external knowledge or tries to continue with an intersection attack.

The reason we don't adjust when the count-sd is relatively high is that I presume that it is much harder to attack individuals when different individuals contribute different counts, and so the attack has less value. Thus we only do it when count-sd is relatively low.

TODO: validate this attack and the defense

TODO: determine the appropriate number of non-LE nodes we require to start adjusting (currently conservatively set as three).

R7: This is a consequence of handling noise differently for LE and non-LE nodes. If we report the noise accurately, then an attacker could determine if a condition was LE simply by observing the reported noise value. As such, we can under-report the amount of noise. Again if this is very important to the analyst, they should avoid LE conditions, or use LCF to better approximate which conditions may be contributing noise. Note that if we simply added a full noise layer for all nodes, then we could report noise accurately.

R8: The reason we need to use the floated values for the column rather than the floated values that are affected by the nodes is because with 0e nodes, there are no floated values.


Design Examples

Example 1

By way of example, consider the following query:

SELECT col1, count(*)
FROM table
WHERE col2 = 'x' AND
      col3 IN ('a','b') AND
      col4 NOT IN ('j','k')
GROUP BY 1

Normalized nodes look like this:

(col1 = ? and col2 = 'x' and col3 = 'a' and col4 <> 'j' and col4 <> 'k') OR
(col1 = ? and col2 = 'x' and col3 = 'b' and col4 <> 'j' and col4 <> 'k')

The col1 = ? part comes from the selected col1. The value of course becomes known when the bucket is materialized by the DB.

The col1 = ? and col2 = 'x' part of both nodes is redundant. The col2 = 'x' part is redundant because the query filters for those conditions in the probe query, and so by definition the condition will always be True. So we can simplify as:

(col3 = 'a' and col4 <> 'j' and col4 <> 'k') OR
(col3 = 'b' and col4 <> 'j' and col4 <> 'k')

We would make a probe query like follows (might be bugs here...this just gives the basic idea). Note that this probe pretends that the seed material cannot be generated from SQL inspection.

SELECT col1,
       min(col2), max(col2),  -- to seed noise layers
       min(col3), max(col3),  -- to seed noise layers that are 0e
       min(col4), max(col4),  -- to seed noise layers that are 0e
       min(col3_a_col), max(col3_a_col),  -- to seed noise layer if not 0e
       min(col3_b_col), max(col3_b_col),  -- to seed noise layer if not 0e
       count(DISTINCT col3_a_uid),        -- to determine if LE
       count(DISTINCT col3_b_uid),        -- to determine if LE
       min(col4_a_j_col), max(col4_a_j_col),  -- to seed noise layer if not 0e
       min(col4_a_k_col), max(col4_a_k_col),  -- to seed noise layer if not 0e
       count(DISTINCT col4_a_j_uid),        -- to determine if LE
       count(DISTINCT col4_a_k_uid)         -- to determine if LE
       min(col4_b_j_col), max(col4_b_j_col),  -- to seed noise layer if not 0e
       min(col4_b_k_col), max(col4_b_k_col),  -- to seed noise layer if not 0e
       count(DISTINCT col4_b_j_uid),        -- to determine if LE
       count(DISTINCT col4_b_k_uid)         -- to determine if LE
FROM (
    SELECT col2,     -- needed to seed noisy thresholds
        col3, col4,  -- need to select these columns in case condition is 0e
        CASE WHEN col3 = 'a' AND
                 col4 = 'j' AND col4 <> 'k' THEN uid
             ELSE NULL
        END AS col4_a_j_uid,        -- tests col4<>'j', if parent node not LE
        CASE WHEN col3 = 'a' AND
                 col4 = 'j' AND col4 <> 'k' THEN col4
             ELSE NULL
        END AS col4_a_j_col,        -- matching values
        CASE WHEN col3 = 'a' AND
                 col4 = 'k' AND col4 <> 'j' THEN uid
             ELSE NULL
        END AS col4_a_k_uid,        -- tests col4<>'k', if parent node not LE
        CASE WHEN col3 = 'a' AND
                 col4 = 'k' AND col4 <> 'j' THEN col4
             ELSE NULL
        END AS col4_a_k_col,        -- matching values
        CASE WHEN col3 = 'b' AND
                 col4 = 'j' AND col4 <> 'k' THEN uid
             ELSE NULL
        END AS col4_b_j_uid,        -- tests col4<>'j', if parent node not LE
        CASE WHEN col3 = 'b' AND
                 col4 = 'j' AND col4 <> 'k' THEN col4
             ELSE NULL
        END AS col4_b_j_col,        -- matching values
        CASE WHEN col3 = 'b' AND
                 col4 = 'k' AND col4 <> 'j' THEN uid
             ELSE NULL
        END AS col4_b_k_uid,        -- tests col4<>'k', if parent node not LE
        CASE WHEN col3 = 'b' AND
                 col4 = 'k' AND col4 <> 'j' THEN col4
             ELSE NULL
        END AS col4_b_k_col,        -- matching values
        CASE WHEN col3 = 'a' AND
             col4 <> 'j' and col4 <> 'k' THEN uid
             ELSE NULL
        END AS col3_a_uid,        -- matching UIDs for first OR node
        CASE WHEN col3 = 'a' AND
             col4 <> 'j' and col4 <> 'k' THEN col3
             ELSE NULL
        END AS col3_a_col,        -- matching values
        CASE WHEN col3 = 'b' AND
             col4 <> 'j' and col4 <> 'k' THEN uid
             ELSE NULL
        END AS col3_b_uid,        -- matching UIDs for second OR node
        CASE WHEN col3 = 'b' AND
             col4 <> 'j' and col4 <> 'k' THEN col3
             ELSE NULL
        END AS col3_b_col        -- matching values
    FROM (
        SELECT uid, col2, col3, col4
        FROM table
        -- this WHERE clause filters for the needed data (posands and posors)
        WHERE col2 = 'x' AND
              col3 IN ('a','b')
    ) t
) t
GROUP BY 1

The inner-most SELECT filters on posands and posors. This gives us everything we need to check LE and gather seed values. (Note that this might as well be a good opportunity to float col2 as well. Then you don't have to do it in the main query. As long as you are making a full scan here anyway, the extra work won't cost much.)

Note that the CASE statements for col4 reverse the condition from negative to positive.

To check for LE (per bucket), we would first check count(DISTINCT col3_a_uid) and count(DISTINCT col3_b_uid). These correspond to the two nodes (col1 = ? and col2 = 'x' and col3 = 'a' and col4 <> 'j' and col4 <> 'k') and (col1 = ? and col2 = 'x' and col3 = 'b' and col4 <> 'j' and col4 <> 'k'). If the counts are 0 or 1, then they are LE. If 3 or more, they are not LE.

If 2, then we need to seed a noisy threshold. The seed consists of the set of seed elements of all the five conditions in the node. In this particular example we could seed from just SQL inspection plus the value of col1 in the bucket, but for this example lets suppose that is not the case. The values used for the seed material for the first posor comes from the following (all the other symbols are as we do today):

If either if these nodes are LE for a given bucket, then we make a noise layer from the same seed and add this noise layer to the final answer for that bucket. We also adjust the final answer down (because these are posors) by 1 or 2 if the node is 1e or 2e respectively. Note that if both nodes are LE for a given bucket, then most likely the bucket will fail LCF since LCF is performed after adjustment.

If for a given bucket a posor node is not LE, then the two negands inside the posor are each checked for LE. Note that if the same condition is LE within different posors, then to avoid having duplicate seeds, the second and subsequent seeds have an additional symbol which is a counter and differs for each subsequent layer.

Looking at col4 <> 'j', the check follows the same steps as for the posor above. Namely, if count(DISTINCT col4_a_j_uid) is 0 or 1, the negand node is LE, if 3 or more it is not LE, and if 2 we make a seed from the node using min(col4_a_k_col) and max(col4_a_k_col) as the values, and use the resulting seed for a noisy threshold check.

If the node is LE, the we'll generate a separate noise layer for this node in the final answer, and adjust the final answer by 1 or 2 if 1e or 2e respectively.

Finally, if either or both posors are not LE, then we'll make a composite noise layer. This noise layer is seeded from all conditions in each posor that is not LE, but without duplicate conditions.

By way of example, suppose that the posor (col1 = ? and col2 = 'x' and col3 = 'a' and col4 <> 'j' and col4 <> 'k') is LE and the posor (col1 = ? and col2 = 'x' and col3 = 'b' and col4 <> 'j' and col4 <> 'k') is not LE. Further suppose that the negand col4 <> 'j' is LE. Then the composite noise layer would be seeded from col1 = ?, col2 = 'x', col3 = 'b', and col4 <> 'k'.

If on the other hand both posors are not LE, and the negand col4 <> 'j' is LE, then the composite noise layer would be seeded from col1 = ?, col2 = 'x', col3 = 'b', col3 = 'a', andcol4 <> 'k'`.

Example 2

SELECT col1, col2, count(*)
FROM table
WHERE col3 = 'x' AND
      (col4 = 'j' OR (col4 = 'k' and col4 <> 'l'))
GROUP BY 1,2

The expression would be normalized to:

(col3 = 'x' AND col4 = 'j') OR
(col3 = 'x' AND col4 = 'k' and col4 <> 'l')

Then we would make a probe query like follows (once again assuming that the seeds cannot be composed from column inspection):

SELECT col1, col2,
       min(col3), max(col3),  -- to seed noise layer if 0e
       min(col4), max(col4),  -- to seed noise layer if 0e
       min(col3_xj_col), max(col3_xj_col),  -- to seed noise layer if not 0e
       min(col4_xj_col), max(col4_xj_col),  -- to seed noise layer if not 0e
       min(col3_xkl_col), max(col3_xkl_col),  -- to seed noise layer if not 0e
       min(col4_xkl_col), max(col4_xkl_col),  -- to seed noise layer if not 0e
       count(DISTINCT col34_xj_uid),          -- to determine if LE
       count(DISTINCT col34_xkl_uid),         -- to determine if LE
       count(DISTINCT col4_l_uid)             -- to determine if LE
FROM (
    SELECT col1, col2,    -- need so that we can check bucket groups
           col3, col4,    -- need to select these columns in case condition is 0e
        -- first posor
        CASE WHEN col3 = 'x' AND col4 = 'j' THEN uid
             ELSE NULL
        END AS col34_xj_uid,        -- matching UIDs
        CASE WHEN col3 = 'x' AND col4 = 'j' THEN col3
             ELSE NULL
        END AS col3_xj_col,        -- matching values, col3
        CASE WHEN col3 = 'x' AND col4 = 'j' THEN col4
             ELSE NULL
        END AS col4_xj_col,        -- matching values, col4
        -- second posor
        CASE WHEN col3 = 'x' AND col4 = 'k' and col4 <> 'l' THEN uid
             ELSE NULL
        END AS col34_xkl_uid,        -- matching UIDs
        CASE WHEN col3 = 'x' AND col4 = 'k' and col4 <> 'l' THEN col3
             ELSE NULL
        END AS col3_xkl_col,        -- matching values, col3
        CASE WHEN col3 = 'x' AND col4 = 'k' and col4 <> 'l' THEN col4
             ELSE NULL
        END AS col4_xkl_col,        -- matching values, col4
        -- negand (in case posor is not LE)
        CASE WHEN col4 = 'l' THEN uid
             ELSE NULL
        END AS col4_l_uid,        -- matching UIDs negand
        CASE WHEN col4 = 'l' THEN col4
             ELSE NULL
        END AS col4_l_col,        -- matching values negand
    FROM (
        SELECT uid, col1, col2, col3, col4
        FROM table
        -- this WHERE clause filters for the needed data (posands and posors)
        WHERE (col3 = 'x' AND col4 = 'j') OR
              (col3 = 'x' AND col4 = 'k')
    ) t
) t
GROUP BY 1,2

The inner-most select has the normalized expression minus the negand.

The CASE statements filter for three things. The negand, and the each of the two posors (both of which consist of multiple posands and negands).

For the purpose of seeding, the conditions in the two posors are:

(col1 = ? and col2 = ? and col3 = 'x' and col4 = 'j')
(col1 = ? and col2 = ? and col3 = 'x' and col4 = 'k' and col4 <> 'l')

Otherwise, LE checks are made as in the preceeding example. That is, first checking each posor for LE for each bucket, and if the second posor is not LE, then checking the negand.


Example 3

select cli_district_id, sum(cnt)
from (
    select uid, cli_district_id, count(*) as cnt
    from transactions
    group by 1,2
    having count(*) between 0 and 1000
) t
where cli_district_id in (1,2)
group by 1

Note that this query works on the gda_banking data source of attack.aircloak.com.

An example of a probe query is this (assuming that seeds can be made from SQL inspection):

SELECT cli_district_id,
       count(DISTINCT have_uid),          -- to determine if LE in case others are not
       count(DISTINCT cli_1_uid),         -- to determine if LE
       count(DISTINCT cli_2_uid)          -- to determine if LE
FROM (
    select uid, cli_district_id,
        CASE WHEN cnt < 0 OR cnt >= 1000 THEN uid
             ELSE NULL
        END AS have_uid,        -- UIDs outside of HAVING statement
        CASE WHEN cli_district_id = 1 AND
                cnt >= 0 AND cnt < 1000 THEN uid
             ELSE NULL
        END AS cli_1_uid,        -- matching UIDs
        CASE WHEN cli_district_id = 2 AND
                cnt >= 0 AND cnt < 1000 THEN uid
             ELSE NULL
        END AS cli_2_uid        -- matching UIDs
    from (
        select uid, cli_district_id, count(*) as cnt
        from transactions
        group by 1,2
    ) t
    where cli_district_id in (1,2)
) t
GROUP BY 1

In this probe query, the seed material for both conditions can be taken from SQL inspection, so no need for floating.

After normalization, the filter expression is this:

(cli_district_id = ? AND cli_district_id = 1 AND cnt >= 0 AND cnt < 1000) OR
(cli_district_id = ? AND cli_district_id = 2 AND cnt >= 0 AND cnt < 1000)

In this case, since we are selecting the same column in the WHERE clause, there will be duplicate conditions in the seed material so one of them will be dropped for seeding.

Ranges (BETWEEN) must be individually LE checked just as negands are checked. In other words, if the parent posor is not LE, then the ranges inside the posor are LE checked. As with negands, the range is reversed, so cnt >= 0 AND cnt < 1000 becomes (cnt < 0 OR cnt >= 1000).

It so happens that in this query neither posor is LE, so we need to check the range to see if it is LE. For seeding, we use the same seed as we do for BETWEEN conditions today. In this case, the range is LE, so we need a separate noise layer for the range.

The composite noise layer in this case would be seeded from two posors, but without the range. Therefore, the seeding material comes from the two conditions cli_district_id = 1 and cli_district_id = 2.


Example 4

select cli_district_id, sum(cnt)
from (
    select uid, cli_district_id, count(*) as cnt
    from transactions
    group by 1,2
    having count(*) between 100 and 200
) t
where cli_district_id in (1,999)
group by 1

This is the same query as Example 3, but with different numbers. The probe query would be the same as well, but here it is just for convenience:

SELECT cli_district_id,
       count(DISTINCT have_uid),          -- to determine if LE in case others are not
       count(DISTINCT cli_1_uid),         -- to determine if LE
       count(DISTINCT cli_999_uid)          -- to determine if LE
FROM (
    select uid, cli_district_id,
        CASE WHEN cnt < 100 OR cnt >= 200 THEN uid
             ELSE NULL
        END AS have_uid,        -- UIDs outside of HAVING statement
        CASE WHEN cli_district_id = 1 AND
                cnt >= 100 AND cnt < 200 THEN uid
             ELSE NULL
        END AS cli_1_uid,        -- matching UIDs
        CASE WHEN cli_district_id = 999 AND
                cnt >= 100 AND cnt < 200 THEN uid
             ELSE NULL
        END AS cli_999_uid        -- matching UIDs
    from (
        select uid, cli_district_id, count(*) as cnt
        from transactions
        group by 1,2
    ) t
    where cli_district_id in (1,999)
) t
GROUP BY 1

After normalization, the conditions look like this:

(cli_district_id = ? AND cli_district_id = 1 AND cnt >= 100 AND cnt < 200) OR
(cli_district_id = ? AND cli_district_id = 999 AND cnt >= 100 AND cnt < 200)

In this case, the second node is LE (0e) because there are no cli_district_id with value 999. The first node has 222 distinct UIDs (is not LE), and so the range must also be checked. The range is not LE: it has 441 distinct UIDs.

This means that the second node has its own separate noise layer, and all conditions of the first node contribute to the composite noise layer.


Example 5

select cli_district_id, sum(cnt)
from (
    select uid, cli_district_id, count(*) as cnt
    from transactions
    where acct_district_id in (1,60)
    group by 1,2
) t
where cli_district_id in (55,74)
group by 1

The normalized expression for this query is:

(cli_district_id = ? and acct_district_id = 1 AND cli_district_id = 55) OR
(cli_district_id = ? and acct_district_id = 1 AND cli_district_id = 74) OR
(cli_district_id = ? and acct_district_id = 60 AND cli_district_id = 55) OR
(cli_district_id = ? and acct_district_id = 60 AND cli_district_id = 74)

The nodes of this expression are spread over two sub-queries. The probe is this:

SELECT cli_district_id,
       count(DISTINCT ac_1_55_uid) AS ac_1_55,
       count(DISTINCT ac_1_74_uid) AS ac_1_74,
       count(DISTINCT ac_60_55_uid) AS ac_60_55,
       count(DISTINCT ac_60_74_uid) AS ac_60_74
FROM (
    select uid, cli_district_id,
        CASE WHEN acct_district_id = 1 AND cli_district_id = 55 THEN uid
             ELSE NULL
        END AS ac_1_55_uid,
        CASE WHEN acct_district_id = 1 AND cli_district_id = 74 THEN uid
             ELSE NULL
        END AS ac_1_74_uid,
        CASE WHEN acct_district_id = 60 AND cli_district_id = 55 THEN uid
             ELSE NULL
        END AS ac_60_55_uid,
        CASE WHEN acct_district_id = 60 AND cli_district_id = 74 THEN uid
             ELSE NULL
        END AS ac_60_74_uid
    from (
        select uid, cli_district_id, acct_district_id, count(*) as cnt
        from transactions
    where acct_district_id in (1,60)
        group by 1,2,3
    ) t
    where cli_district_id in (55,74)
) t
GROUP BY 1

The output of the above query is this:

cli_district_id ac_1_55 ac_1_74 ac_6055 ac_60_74
55 4 0 0 0
74 0 3 0 4

This means that for bucket 55, the first posor is not LE but the others are. For bucket 74, the second and fourth posors are not LE.

Regarding bucket 55, there would be four noise layers. One is a composite built from the conditions in the first posor (acct_district_id = 1 and cli_district_id = 55), and the other three are individual node noise layers built from the other three posors respectively.

Regarding bucket 74, there are three noise layers. One is a composite built from the conditions in the second and fourth nodes (acct_district_id = 1, acct_district_id = 60 and cli_district_id = 74). The other two are built from the first and third nodes respectively.


Example 6

Analyst query:

This example has negands across two sub-queries.

select cli_district_id, sum(cnt)
from (
    select uid, k_symbol, cli_district_id, operation, count(*) as cnt
    from transactions
    group by 1,2,3,4
    having max(acct_district_id) <> 1
) t
where k_symbol in ('SIPO','UVER') and
      operation <> 'VYBER'
group by 1

Rule 1: remove negands from probe inner select, which means having condition and operation condition are gone from inner select.

Normalized expression is:

(cli_district_id = ? and k_symbol = 'SIPO' and max(acct_district_id) <> 1 and operation <> 'VYBER') OR
(cli_district_id = ? and k_symbol = 'UVER' and max(acct_district_id) <> 1 and operation <> 'VYBER')

Two posors, and each posor has two negands. This leads to a total of 6 CASE statements (one per posor, one per negand per posor).

SELECT cli_district_id,
       count(DISTINCT ac_S_V_uid) AS ac_S_V,
       count(DISTINCT ac_U_V_uid) AS ac_U_V,
       count(DISTINCT ac_S_1_uid) AS ac_S_1,
       count(DISTINCT ac_U_1_uid) AS ac_U_1,
       count(DISTINCT ac_1_S_V_uid) AS ac_1_S_V,
       count(DISTINCT ac_1_U_V_uid) AS ac_1_U_V
FROM (
    select uid, cli_district_id,
        CASE WHEN max_acct <> 1 AND k_symbol = 'SIPO' AND
                  operation = 'VYBER' THEN uid
             ELSE NULL
        END AS ac_S_V_uid,
        CASE WHEN max_acct <> 1 AND k_symbol = 'UVER' AND
                  operation = 'VYBER' THEN uid
             ELSE NULL
        END AS ac_U_V_uid,
        CASE WHEN max_acct = 1 AND k_symbol = 'SIPO' AND
                  operation <> 'VYBER' THEN uid
             ELSE NULL
        END AS ac_S_1_uid,
        CASE WHEN max_acct = 1 AND k_symbol = 'UVER' AND
                  operation <> 'VYBER' THEN uid
             ELSE NULL
        END AS ac_U_1_uid,
        CASE WHEN max_acct <> 1 AND k_symbol = 'SIPO' AND
                  operation <> 'VYBER' THEN uid
             ELSE NULL
        END AS ac_1_S_V_uid,
        CASE WHEN max_acct <> 1 AND k_symbol = 'UVER' AND
                  operation <> 'VYBER' THEN uid
             ELSE NULL
        END AS ac_1_U_V_uid
    from (
        select uid, cli_district_id, acct_district_id, k_symbol,
               operation, max(acct_district_id) as max_acct
        from transactions
    where acct_district_id in (1,60)
        group by 1,2,3,4,5
    ) t
    where k_symbol in ('SIPO','UVER')
) t
GROUP BY 1

cristianberneanu commented 4 years ago

On a first look, it seems aggregates are ignored and the inner expression is treated as a regular filter, i.e. count(*) doesn't generate any additional noise layers, while sum(price) generates noise layers for price.

yoid2000 commented 4 years ago

@cristianberneanu what do you think of this?

SELECT cli_district_id,
       count(DISTINCT ac3_uid) AS ac3,
       count(DISTINCT ac4_uid) AS ac4,
       count(DISTINCT ac1_uid) AS ac1,
       count(DISTINCT ac2_uid) AS ac2
FROM (
    select uid, cli_district_id,
        CASE WHEN acct_district_id = 1 and (cnt < 0 or cnt >= 1000)
             THEN uid
             ELSE NULL
        END AS ac3_uid,
        CASE WHEN acct_district_id = 2 and (cnt < 0 or cnt >= 1000)
             THEN uid
             ELSE NULL
        END AS ac4_uid,
        CASE WHEN (acct_district_id = 1 and cnt between 0 and 1000)
             THEN uid
             ELSE NULL
        END AS ac1_uid,
        CASE WHEN (acct_district_id = 2 and cnt between 0 and 1000)
             THEN uid
             ELSE NULL
        END AS ac2_uid
    from (
        select distinct t1.uid as uid, t1.cli_district_id as cli_district_id,
               t2.acct_district_id as acct_district_id, t1.cnt as cnt
        from (
            select uid, cli_district_id, count(*) as cnt
            from transactions
            where acct_district_id in (1,2)
            group by 1,2
        ) t1
        join (
            select uid, cli_district_id, acct_district_id
            from transactions
            where acct_district_id in (1,2)
        ) t2
        on t1.uid = t2.uid and t1.cli_district_id = t2.cli_district_id
    ) t
) t
GROUP BY 1

Probably nicer ways to write it, but I think it is a way of carrying up acct_district_id without messing up the group by. (I wonder if there is a way to do it without requiring that distinct at the join point.)

cristianberneanu commented 4 years ago

@cristianberneanu what do you think of this?

I think it looks pretty ugly and complex. It is very hard to check the correctness. We should add some limitations, at least in the first version, in order to ease the implementation.

This would probably be easier to compute using window functions. But we don't support this feature yet and I am not sure if we can even support it everywhere.

cristianberneanu commented 4 years ago

Addendum: It would be simpler to check each posor independently and then combine the results using JOINs or issuing multiple queries.

yoid2000 commented 4 years ago

I think it looks pretty ugly and complex. It is very hard to check the correctness. We should add some limitations, at least in the first version, in order to ease the implementation.

This I agree with. I just wanted to have some way of doing it on paper (just to see if it is even possible). This join thing does look to me like a basic way of floating a value that is hidden behind a group by. But having a limitation that you can't have posors or negands behind a group by is probably not a particularly painful limitation.

Addendum: It would be simpler to check each posor independently and then combine the results using JOINs or issuing multiple queries.

Simpler maybe, though multiple queries of course gets expensive...

yoid2000 commented 4 years ago

@cristianberneanu what other edge cases are you worried about?

Now that this is a good basic design, might be good to decide what functionality we really need at this point (versus stuff we can add later) and come up with an MVP that anyway satisfies Telefonica.

cristianberneanu commented 4 years ago

This join thing does look to me like a basic way of floating a value that is hidden behind a group by.

Simpler maybe, though multiple queries of course gets expensive...

The same JOIN trick could be used to combine the results into a single query. That would potentially be easier to understand and implement. The performance of the method would depend on the query context: if indexes could be used, it might be faster, if a full scan is needed, it could be slower.

But using window functions would be the best fit and would have the best performance. The drawback being that they probably aren't supported on MongoDB.

@cristianberneanu what other edge cases are you worried about?

Beside the last one, there aren't others on my mind right now.

yoid2000 commented 4 years ago

@sebastian @cristianberneanu

Ok. I think it would be good if we have a separate issue that lays down the requirements/features of a first implementation. Then we can interpret the design given here in light of those requirements...

I'll start that issue in a bit.

yoid2000 commented 4 years ago

@cristianberneanu @sebastian please see #3926

cristianberneanu commented 4 years ago

On a first look, it seems aggregates are ignored and the inner expression is treated as a regular filter, i.e. count(*) doesn't generate any additional noise layers, while sum(price) generates noise layers for price.

Do we need to change something here?

yoid2000 commented 4 years ago

Do we need to change something here?

No, I don't think so. I vaguely remember deciding that it would be pretty hard to generate an attack using count(*) as the isolating condition. If I remember right, there were some tricky aspects to floating that or deciding what the seed should be, so we just let it go.

yoid2000 commented 4 years ago

@cristianberneanu @sebastian

By the way, it occurs to me that we could stick HAVING count(distinct uid) > 1 at the bottom of the probe query and in some cases avoid getting a lot of buckets transmitted that would otherwise be LCF filtered in the main query.

yoid2000 commented 4 years ago

overtaken by #3994