Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

LED: working design for sum(col), count(col), count(*), and count(distinct uid) #3994

Open yoid2000 opened 4 years ago

yoid2000 commented 4 years ago

Notes on terminology:


The following is my final design for LED. I have coded up this design and tested it against a large number of queries, and am as confident as I can be that it works.

This is an adjustment-based mechansim. That is, it uses one or two probes to determine by how much each bucket needs to be adjusted, and then applies those adjustments to the answers to the normal query. The adjustment is made in a way that mimics what would happen if LE conditions were dropped from the query. The result is that two queries from a pair of difference attack queries will always produce identical pre-noise answers. In addition, the noise for any layers that are in both the queries are also identical. The only difference between the answers of the two queries are from the noise layers present in one query but not the other.

This design covers sum, count(), and count(distinct uid) aggregators. It works with inner SELECT having GROUP BY on UID (though my tests so far are only for queries without inner SELECT).

This design does not include negors.

As with prior designs, this design assumes normalized conditions consisting of one or more and-groups OR'd together. An and-group consists of one or more posands and negands. Also as with prior designs, we detect LE posors and negands. We do not detect LE posands. The term "node" refers to posor and negand conditions.

All posands have both static and uid noise layers. All non-LE nodes also have both static and uid noise layers. All LE nodes have only a static noise layer.

A node is determined to be LE or not using a noisy threshold, similar to how we determine LCF. The mean for the noisy threshold is 4, and the standard deviation is 1. A node is LE if its count of distinct UIDs is less than or equal to the threshold. We set the max noisy threshold to be six. (The purpose for this is to filter out rows with a higher count returned by probe1 for efficiency.)

The design consists of two probes. The first probe is mandatory for any query with nodes (negands or posors). It can be made in parallel with the normal query. The first probe determines if each node is LE or not, or in the case of duplicate negands (two negands in the same posor that have the same effect), whether the duplicate negands might be LE. The first probe produces one row per bucket. The number of CASE statements for the first probe grows linearly with the number of nodes. The number of WHERE conditions in the first probe grows linearly with the number of posors.

The first probe and the results of the normal query are used together to determine if adjustments are needed. If yes, then the second probe is made. The second probe gathers per-UID per-bucket information for the UIDs and buckets that are affected by the LE nodes. In other words, the second probe has a row per UID per bucket, but only for the UIDs and buckets affected by LE nodes (which by definition comprises a small number of UIDs). The number of CASE statements in the second probe grows linearly with the number of LE nodes. The number of WHERE conditions in the second probe grows as the number of bucket/LE-node combinations.

It is important to note that we would like to minimize the amount of adjusting we do. This has two benefits. First it minimizes distortion. Second, the second probe can be costly, so to the extent that we minimize cases where we need to adjust, we minimize the cost of the second probe (or eliminate the need to make it).

To help minimize adjusting, we require that a LE node has certain properties across a so-called "bucket group". A bucket group is a set of buckets whereby all but one of the bucket values are the same. For example, if a query selects two columns, dept and salary, the one bucket group would be all buckets with dept='CS', another bucket group would be all buckets with dept='math', another would be all buckets with salary=10000, and so on.

Probe 1

Probe 1 determines, for each bucket, whether each node is:

  1. LE, or
  2. Potentially LE.

A bucket is a single row in the output of probe 1 (or for that matter the normal query). Each bucket has a distinct set of selected column values.

To determine if each node is LE, Probe 1 gathers the following information for each node:

These three pieces of information are called the "UID signature".

The UID signature is gathered for full-context posor checks, and for both full-context and partial-context negand checks. A full-context posor or negand check is one that includes the context of all negands in the posor. A partial-context negand check is one that does not include other negands, but rather only includes the posands within the posor.

The reason for checking negands in partial-context is as follows. If there are two semantically identical negands in the same posor (i.e. the two negands exclude the same rows), then reversing one of the duplicate negands in a full-context check will result in a UID count of zero. For example, suppose that a posor has two duplicate negands, col <> 5 and col+1 <> 6. When gathering the above information about col <> 5 in full context, the negand is reversed and checked in the context of the complete posor. As a result, the check will include col = 5 AND col+1 <> 6. This necessarily excludes all rows and gives a count of zero UIDs. However, we cannot tell from this whether the negand alone has zero effect, or whether the negand has some effect but is a duplicate.

The partial-context check allows us to determine if a negand is truly zero-effect, or rather is a duplicate. If it is zero-effect, then it will also be zero-effect in the partial-context. If it is not zero-effect, then it must be a duplicate. The partial-context check also allows us to determine which negands are duplicates of each other. We assume that this is the case when two or more negands with zero-effect in the full-context check have the same uid count, min uid, and max uid in the partial-context check.

By way of example, consider the following analyst query:

SELECT frequency AS c1,
       sum(amount) AS agg1
FROM transactions
WHERE (birth_number = '535812')
  OR (acct_district_id = 2
      AND (birth_number <> '333333')
      AND (bank <> 'CD'))
GROUP BY 1

This query has two posors. The first consists of a single posand, and the second has one posand and two negands.

For the normal query, the cloak would transform the query into something like this:

SELECT frequency AS c1,
       min(uid) AS min_uid,
       max(uid) AS max_uid,
       count(distinct uid) AS cnt_uid,
       sum(agg) AS agg1
FROM
  (SELECT frequency,
          uid,
          sum(amount) as agg
   FROM transactions
   WHERE (birth_number = '535812')
     OR (acct_district_id = 2
         AND (birth_number <> '333333')
         AND (bank <> 'CD'))
   GROUP BY 1,
            2) t
GROUP BY 1

The following is an example of probe 1 for the above query:

     1  SELECT frequency AS c1,
     2         count(DISTINCT p1_uid) AS p1_full_cnt,
     3         min(p1_uid) AS p1_full_min,
     4         max(p1_uid) AS p1_full_max,
     5         count(DISTINCT p2n1_uid) AS p2n1_full_cnt,
     6         min(p2n1_uid) AS p2n1_full_min,
     7         max(p2n1_uid) AS p2n1_full_max,
     8         count(DISTINCT p2n1_dup) AS p2n1_part_cnt,
     9         min(p2n1_dup) AS p2n1_part_min,
    10         max(p2n1_dup) AS p2n1_part_max,
    11         count(DISTINCT p2n2_uid) AS p2n2_full_cnt,
    12         min(p2n2_uid) AS p2n2_full_min,
    13         max(p2n2_uid) AS p2n2_full_max,
    14         count(DISTINCT p2n2_dup) AS p2n2_part_cnt,
    15         min(p2n2_dup) AS p2n2_part_min,
    16         max(p2n2_dup) AS p2n2_part_max,
    17         count(DISTINCT p2_uid) AS p2_full_cnt,
    18         min(p2_uid) AS p2_full_min,
    19         max(p2_uid) AS p2_full_max
    20  FROM
    21    (SELECT frequency,
    22            uid,
    23            CASE
    24                WHEN birth_number = '535812' THEN uid
    25                ELSE NULL
    26            END AS p1_uid,
    27            CASE
    28                WHEN acct_district_id = 2
    29                     AND birth_number = '333333'
    30                     AND bank <> 'CD' THEN uid
    31                ELSE NULL
    32            END AS p2n1_uid,
    33            CASE
    34                WHEN acct_district_id = 2
    35                     AND birth_number = '333333' THEN uid
    36                ELSE NULL
    37            END AS p2n1_dup,
    38            CASE
    39                WHEN acct_district_id = 2
    40                     AND birth_number <> '333333'
    41                     AND bank = 'CD' THEN uid
    42                ELSE NULL
    43            END AS p2n2_uid,
    44            CASE
    45                WHEN acct_district_id = 2
    46                     AND bank = 'CD' THEN uid
    47                ELSE NULL
    48            END AS p2n2_dup,
    49            CASE
    50                WHEN acct_district_id = 2
    51                     AND birth_number <> '333333'
    52                     AND bank <> 'CD' THEN uid
    53                ELSE NULL
    54            END AS p2_uid
    55     FROM
    56       (SELECT frequency,
    57               uid,
    58               birth_number,
    59               acct_district_id,
    60               bank
    61        FROM transactions
    62        WHERE (birth_number = '535812')
    63          OR (acct_district_id = 2) ) t) t
    64  GROUP BY 1
    65  HAVING count(DISTINCT p1_uid) < 7
    66  OR count(DISTINCT p2_uid) < 7
    67  OR count(DISTINCT p2n1_uid) < 7
    68  OR count(DISTINCT p2n2_uid) < 7

The inner-most select (lines 56-63) has been modified so that the negands are removed from the WHERE clause. This allows us to measure the effect of the negands in the next higher select (CASE statements). In addition, the columns in the original WHERE clause (birth_number, acct_district_id, and bank) are selected so that their effect can be measured.

There are six case statements. Two are for the two posors (labeled p1_uid and p2_uid). Both are full context. You can see this in p2_uid, which contains both negands in the WHEN clause (lines 51 and 52).

Of the other four CASE statements, two are for the full-context negands, and two for the partial-context negands. The full-context CASE statements are p2n1_uid and p2n2_uid, and the partial-context CASE statements are p2n1_dup and p2n2_dup. Note for example that p2n1_uid has the reversed negand (birth_number = '333333') and the other negand, while p2n1_dup does not have the other negand.

The column col for each CASE statement is used to compute three aggregates at the next level up: count(DISTINCT col), min(col), and max(col) (i.e. the UID signature). This results in 18 such columns plus the original selected column (frequency). The column names contain full for full-context measures, and part for partial-context measures. Columns starting with pX_ are for posors, X being the posor number (1,2,...). Columns starting with pXnY_ are for negands, Y being the negand number (1,2,...).

Note finally that I have added a HAVING clause to the outer-most SELECT. This is an optimization that filters out any probe 1 output buckets for which there are definitely no LE nodes.

Ranges are treated like negands in that they are removed from the WHERE clause, and reversed in the CASE statements to include everything but the range. It seems to me that this can sometimes be grossly inefficient. For instance if an analyst asks for a range of one week, if we remove this from the WHERE clause we'll end up processing the entire database. It should in general be possible to determine in advance if a range is very unlikely to be LE, and then exclude it from all the computation. I'll make a separate issue for this.

LIKE conditions can be treated as in #3927.

The output of probe 1 is processed in conjunction with the output of the normal query as follows. The ultimate goal is to label each bucket/node combination with two tags. The first tag is a boolean that indicates whether the combination requires a second probe or not. The second tag takes one of three values indicating whether 1) the node is LE for the bucket (and therefore does not get a UID-noise layer), 2) potentially LE (in which case it is a duplicate and the second probe is needed to determine if it is LE), or 3) not LE (in which case the node gets a UID noise layer but no adjustment).

A bucket/node is determined to be LE as follows.

To determine if a bucket/node is LE, a noisy threshold with hard bounds is used. The hard bounds are 2 and 6. Any bucket/node with a distinct UID count less than 2 is LE, and any bucket/node with a distinct UID count greater than 6 is not LE. For distinct counts between 2 and 6, the noise value is seeded with the UID signature. The mean for the noisy threshold is 4, and the standard deviation is 1. A node if LE if its count of distinct UIDs is less than or equal to the threshold.

If a bucket/node is a posor and has zero UIDs, then the bucket/posor is labeled as LE but a second probe is not needed since no adjustment is needed. If the bucket/posor is LE with one or more UIDs, then we need to determine if the LE might be participating in a difference attack. If not, then a second probe and subsequent adjustment is not needed.

To determine if an LE posor might be participating in a difference attack, we check its UID signature across all buckets in each bucket group. If, for any bucket group, the posor has one or two distinct UID signatures, then it might be participating in a difference attack and a second probe is needed to determine its adjustment. If, on the other hand, the posor has more than two distinct UID signatures for all bucket groups, then it is assumed not to be participating in a difference attack, and no second probe or adjustment is needed.

If a bucket/posor is subject to a second probe and adjustment, then the negands in the posor do not need to be checked. Otherwise the negands need to be checked, as follows.

If the full-context UID signature for a given bucket/negand has a UID count of zero, then a check needs to made to determine if the negand is a duplicate.

Two (or more) negands are duplicates if 1) they all have full-context UID counts of zero, 2) the partial-context UID signatures are identical (same min, max, and count), and 3) these two conditions hold across all buckets for at least one bucket group. (The idea here is that if two negands are not duplicates across all buckets of a bucket group, then they are not semantically identical, and so it would be hard to find two such negands that nevertheless enable a difference attack.)

All but one of the duplicate negands are subsequently ignored.

A further check of the remaining duplicate negand is made to determine if it is potentially LE or not LE. If there are more than one set of duplicate negands in the posor, then we consider them all to be potentially LE. If there is only one set of duplicate negands, then we take the sum of all full-context UID counts for the other negands in the posor. We subtract this sum from the partial-context UID count of the remaining duplicate negand to produce a modified UID count for the remaining duplicate negand. Using this modified UID count, we make a noisy threshold and determine if the remaining duplicate negand is LE. If it is still not LE, then the remaining duplicate negand is regarded as not LE. Otherwise it is regarded as potentially LE. If it is not LE, then a second probe is not needed for the remaining negand. If it is potentially LE, then a second probe is needed.

If a negand is not a duplicate, then its full-context UID signature is used to determine if the negand is LE or not. If it has a UID count of zero, then it is LE and a second probe is not needed for this negand. If it is not LE, then a second probe is not needed.

If it is LE with a non-zero UID count, then we determine if the negand might be participating in a difference attack the same as described with posors above. It it might be, then a second probe and adjustment is needed.

Probe 2

After the above procedure, some set of bucket/nodes may be tagged as requiring a second probe. The following description is limited to these bucket/nodes.

In the case of bucket/nodes that are LE, these will require adjustment. In the case of such bucket/nodes that are potentially LE (because they were duplicates), the information gained from the second probe is used first to determine if the are LE or not, and if so to make the needed adjustment.

The second probe generates per-UID rows. However, it generates such rows only for the UIDs that are affected the tagged bucket/nodes. To do this, the WHERE clause of the inner-most SELECT filters for only these UIDs. It does this by specifically filtering for the bucket values and partial-context node conditions.

The reason that we need per-UID information is because we need to know how each individual UID is affected to adjust properly. For instance, consider the case where a negand excludes UIDs u1, u2, and u3 for a given bucket, and is LE for that bucket. In our adjustment, we wish to mimic what would happen if the negand were dropped. If the UIDs were not affected by any other negands, then we would want to adjust upwards to mimic the rows associated with the UIDs being included. However, suppose another non-LE negand in the same posor happens to exclude u2. In this case, u2 would still be excluded even if the LE negand were dropped. So in order to know whether to adjust for each UID, we need to know how it is affected by other nodes.

In addition, in the case of count(*) and sum(col) aggregates, we need to know how much the UID contributes to those aggregates. (In the case of count(DISTINCT uid), we know that the contribution is 1.) Therefore we also need per-UID information to know by how much to adjust.

More specifically, the following rules are used to determine how to adjust with respect to each UID:

  1. If any posor that is not LE includes the UID, then the query is safe with respect to that UID (does not need adjustment)
  2. If a UID is excluded from all posors, and no negands that explicitly exclude the UID are LE, then the query is safe with respect to that UID (does not need adjustment)
  3. If all posors that include the UID are LE, then the query is not safe with respect to that UID, and an adjustment down should be made.
  4. If all posors exclude the UID, but one or more negands that explicitly exclude the UID are LE, then the query is not safe with respect to that UID, and an adjustment up should be made.

The following is an example of the second probe continuing the above examples. Recall that in the above example, there are two posors, and the second posor has two negands. It so happens that for this query there are three possible values for the selected column frequency: 'POPLATEK MESICNE', 'POPLATEK PO OBRATU', and 'POPLATEK TYDNE'.

 1  SELECT frequency AS c1,
 2         uid,
 3         sum(p1_agg) AS p1_adj1,
 4         sum(p2n1_agg) AS p2n1_adj1,
 5         sum(p2n2_agg) AS p2n2_adj1,
 6         sum(p2_agg) AS p2_adj1
 7  FROM
 8    (SELECT frequency,
 9            uid,
10            CASE
11                WHEN birth_number = '535812' THEN amount
12                ELSE NULL
13            END AS p1_agg,
14            CASE
15                WHEN acct_district_id = 2
16                     AND birth_number = '333333' THEN amount
17                ELSE NULL
18            END AS p2n1_agg,
19            CASE
20                WHEN acct_district_id = 2
21                     AND bank = 'CD' THEN amount
22                ELSE NULL
23            END AS p2n2_agg,
24            CASE
25                WHEN acct_district_id = 2 THEN amount
26                ELSE NULL
27            END AS p2_agg
28     FROM
29       (SELECT frequency,
30               uid,
31               amount,
32               birth_number,
33               acct_district_id,
34               bank
35        FROM transactions
36        WHERE (frequency IN ('POPLATEK MESICNE')
37               AND birth_number = '535812')
38          OR (frequency IN ('POPLATEK PO OBRATU')
39              AND acct_district_id = 2)
40          OR (frequency IN ('POPLATEK MESICNE')
41              AND acct_district_id = 2
42              AND bank = 'CD') ) t) t
43  GROUP BY 1,2

The WHERE clause selects only the UIDs that may need adjustment. In this particular case, posor p1 (birth_number = '535812') requires the second probe for frequency= 'POPLATEK MESICNE', lines 36-37.

Posor p2 (acct_district_id = 2 AND birth_number <> '333333' AND bank <> 'CD') requires the second probe for frequency='POPLATEK PO OBRATU', lines 38-39. Note this is partial-context so that the UIDs that would otherwise have been excluded by the negands are included.

Negand p2n2 (bank <> 'CD') requires the second probe for frequency='POPLATEK MESICNE', lines 40-42.

The aggregate of the original analyst query is sum(amount). The CASE statements therefore select amount for each of the nodes, and the outermost SELECT computes the corresponding sums (lines 3-6). Note that the CASE statements here are also partial context. This allows us to see exactly what UIDs each node is affecting.

With this example as background, let's state the complete rules for generating probe 2.

The inner-most select has a WHERE clause that filters for rows affected by LE nodes. As such, there is an expression for each LE node, and these expressions are OR'd together.

WHERE ->           node1-expression OR node2-expression OR ...
node-expression -> part1 AND part2
part1 ->           bucket1 OR bucket2 OR ...
bucket ->          col1=val1 AND col2=val2 AND ...   (GROUP BY columns)
part2 ->           partial-context node 

Each expression consists of two parts, each part AND'd together. The first part includes each of the buckets to which the node applies, OR'd together. If the bucket consists of multiple columns (i.e. the GROUP BY columns), then each is AND'd together. In the above example I show col1=val1, but these could of course be ranges or string functions, etc. The second part is the partial-context node itself.

The next SELECT consists of CASE statements, one per LE node per analyst aggregate. The WHEN clause is the partial-context node. If the aggregate is sum(col), then the THEN clause is col (as in the above example where col is amount). If the aggregate is any count (count(*), count(col), or count(DISTINCT uid)), then the THEN clause is 1. (If count(col), then the WHEN must additionally filter out NULL rows with AND col NOT NULL.) Note in particular that if the aggregate is count(DISTINCT uid), the THEN should be 1, not uid. The reason is explained later.

The next SELECT does a GROUP BY on the GROUP BY columns of the original analyst query plus the uid column. This is followed by one aggregate per CASE statement which is the sum of the CASE statement column.

The answer to probe 2 is processed as follows.

Note that for each aggregate and each bucket, the normal cloak query generates four items: the aggregate itself as well as the UID signature (min, max, and count distinct). When adjusting, all four of these items need to be accounted for and potentially modified. The output of the adjustment procedure is the change required to each of the four items for each bucket. These changes are then applied to the corresponding items in the normal cloak query.

Each UID potentially changes each of the items. The final change to the aggregate value and UID count is the sum of changes from each UID. The final change to the min and max UID is that of the remaining min and max UID after individual UIDs have been added and removed.

Note that it is possible for a UID to cause a change to the aggregate value without causing a change to the uid count. This can happen in cases where a given node affects some of the database rows associated with a given UID, but not others. So for instance a UID might have 100 rows total in the database. A given LE negand might exclude all 100 of those rows, while a non-LE posor includes 50 of the rows. In this case, we only want to adjust up for the 50 rows actually excluded by the normal query. Furthermore, the UID is included in any event, so the UID signature is not adjusted at all.

The basic idea in determining the change per UID per bucket is to first determine if the UID is included or excluded from the original query, and if included, how much it contributes to each aggregate in the bucket. I call this the "native" result for the UID/bucket. Then we determine the same thing, but this time ignoring the effect of any LE nodes. I call this the "non-LE" result for the UID/bucket. The change is the difference between the native and non-LE results.

Note that the change in the UID count is always correct. There is a change of 1 if the UID is included in one result and excluded from the other. The contribution to the aggregate, on the other hand, may not be strictly speaking correct. It is possible, for instance, for a given posor to include certain rows, and another posor to include certain other rows, with some overlap. Probe 2 cannot detect this overlap. As such, we simply take the contribution from the posor that contributes the most as the total contribution. This is safe because it is not really possible to execute a difference attack with nodes that only partially include or exclude a UID.

Adjusting the per-bucket min or max UID works as follows. We make a list of known UIDs, including the min and max UID from the normal cloak query and the UIDs from probe 2. Obviously this doesn't include all UIDs that ultimately contribute to the normal answer, rather only the known ones. When we remove a UID during adjustment, we remove it from the list. When we add a UID during adjustment, we keep it in the list. We then take the min and max from this list, and use it for the adjusted min and max (which we then use as seed material in UID noise layers).

Note that the min or max we end up with may not be the true min or max from the underlying data that contributes to the cloak answer. This is ok, because both the left and right adjusted queries ultimately generate the same final list. So long as both the adjust left and right queries agree on the min and max UID, the answers will be the same.

When the aggregate is sum(col), and the column is real, then there can be machine representation errors between the cloak and the database. Unless accounted for, this could be exploited by an attacker to determine if an adjustment took place or not. As a result, we need to add an additional noise component, whose magnitude is comparable to the machine representation error, to all sum(col) aggregates on real columns, whether or not an adjustment took place, or indeed whether or not there are any negands or posors. Alternatively we could do some kind of rounding to hide any such error. We can put this in a separate issue to determine how best to do it.

yoid2000 commented 4 years ago

I generated a walk-though example for the query illustrated in this issue. You can find it here:

https://gist.github.com/yoid2000/cb344a5edfed1f77bf806cebc88462b3

The code itself is posted in these two files:

https://gist.github.com/yoid2000/ba437fd39965e7f0cddc2aa077e80891 and https://gist.github.com/yoid2000/11f22aca0bf91edb27eb0146c6606163.

yoid2000 commented 4 years ago

I placed a second example here:

https://gist.github.com/yoid2000/eef3c33ffda9314401b7a48b6b0568e4

it demonstrates duplicate negands.

@sebastian @cristianberneanu that is all the examples I plan to put for now. Awaiting your comments...

sebastian commented 4 years ago
  1. Note: I added some glossary information to the top of the issue to clarify some terms which might not be obvious to new(er) readers
  2. Thanks for the hard work, @yoid2000
  3. This implementation scares me! It's so complex that it's difficult to understand and reason about, going to be very hard to implement correctly, and it's going to be a daunting task to review the correctness of that implementation.
  4. I believe @cristianberneanu is right in that a selected CASE statement likely is easier. The reason I have come to appreciate is that a posor or negand in the WHERE-clause can affect the contents of any of the selected columns and aggregates (hence needs a very thorough handling) whereas a selected CASE statement is limited to the resulting column (and subsequent uses in say an aggregate) alone.
  5. From the analysis of OR-usage by Telefonica issue it is clear that we get the most bang for the bucks by first implementing support for selected CASE in and outside of aggregates with support for inequalities (<, <=, >=, and >) as well as simple equalities (col = constant) [get's us to 62% of the OR-usage we have seen]. It's not clear to me how this LED machinery needs to be adjusted to support this, but I believe you said you would think through how to support selected CASE @yoid2000, did I understand that correctly?

This design covers sum, count(), and count(distinct uid) aggregators. It works with inner SELECT having GROUP BY on UID (though my tests so far are only for queries without inner SELECT).

What about the other aggregates? I suppose we get avg from sum and count, but that leaves min, max, and stddev. We cannot release something that does not work across all the aggregates. It would lead to very confusing and inconsistent query behavior.

The first probe and the results of the normal query are used together to determine if adjustments are needed. If yes, then the second probe is made.

I am still a bit woolly on exactly the conditions that would trigger the second probe, but at least at first read this sounds like it makes for a very clear and enticing timing attack? I.e. I want to attack victim V and add a clause isolating V on the property I want to detect in such a way that the second probe is issued when the condition I am after verifying is met. Hence irrespective of what the corrected answer is I'll already know the true value in the database based on the execution time alone.

sebastian commented 4 years ago

Just some additional notes. Even though I noted that I believe selecting a CASE-statement is easier (in point 4 and 5 above) it clearly is not without danger either!

Consider the two following example queries:

SELECT
  ...
FROM (
  SELECT 
    uid,
    CASE
      WHEN ... THEN ...
      ...
    END as col1
  FROM table
  GROUP BY uid
)
WHERE col1 = 'bar'

SELECT
  col1,
  aggregate(...)
FROM (
  SELECT 
    uid,
    CASE
      WHEN ... THEN ...
      ...
    END as col1,
    ...
  FROM table
  GROUP BY uid
)
GROUP BY col1

In the first one the WHERE-clause on col1 effectively is a complex OR-clause needing treatment much like in @yoid2000's design above.

In the second case the selected CASE-statement has a direct impact on the subsequent aggregate that needs to be accounted for.

yoid2000 commented 4 years ago

This implementation scares me! It's so complex that it's difficult to understand and reason about, going to be very hard to implement correctly, and it's going to be a daunting task to review the correctness of that implementation.

Yes. In the end I had to do a quite thorough fuzz test to convince myself that this was solid. The fuzz test only checks to make sure that a left and right query, where the right query has had one or more dropped conditions, result in the same answer.

yoid2000 commented 4 years ago

It's not clear to me how this LED machinery needs to be adjusted to support this, but I believe you said you would think through how to support selected CASE @yoid2000, did I understand that correctly?

In a way I'm still confused as to what TF wants. Most of the examples they give don't have OR semantics at all. The kinds of CASE statements that have OR semantics are this sort:

  967                     WHEN kat_2 = 'Hardware'
  968:                     AND kat_4 LIKE '%unvollst%' THEN
  969                        'Y'
  970                     WHEN kat_2 = 'Hardware'
  971:                     AND kat_4 LIKE 'Unangemess%' THEN
  972                        'Y'
  973                     WHEN kat_2 = 'Hardware'
  974:                     AND kat_4 LIKE 'Nichterha%' THEN
  975                        'Y'

which is equivalent to:

  WHERE ( kat_2 = 'Hardware' AND kat_4 LIKE '%unvollst%' ) OR
        ( kat_2 = 'Hardware' AND kat_4 LIKE 'Unangemess%' ) OR
        ( kat_2 = 'Hardware' AND kat_4 LIKE 'Nichterha%' )

But the vast majority of the TF examples don't have OR semantics.

So we need to decide what we want to implement.

yoid2000 commented 4 years ago

I suppose we get avg from sum and count, but that leaves min, max, and stddev.

We should very seriously consider dropping native support for min and max, and rather get it from Aircloak Explorer (i.e. by taking a histogram and returning the biggest and smallest bucket).

In fact, we should do the same for stddev.

Speaking of count(distinct col), a difference attack is harder to carry out with this, but not impossible. I'm unsure where we stand on implementation of stats-based count(distinct col).

yoid2000 commented 4 years ago

I am still a bit woolly on exactly the conditions that would trigger the second probe, but at least at first read this sounds like it makes for a very clear and enticing timing attack? I.e. I want to attack victim V and add a clause isolating V on the property I want to detect in such a way that the second probe is issued when the condition I am after verifying is met.

This doesn't easily work because of the noisy threshold.

sebastian commented 4 years ago

I'm unsure where we stand on implementation of stats-based count(distinct col).

It shipped as part of 19.3. In other words we have stats based count(distinct col) in our product today.

yoid2000 commented 4 years ago

It shipped as part of 19.3. In other words we have stats based count(distinct col) in our product today.

Based on #3563 I presume. I'll think about what this means for LED.

sebastian commented 4 years ago

My suggestion is that we implement support for OR and LED in a set of tranches of increasing complexity. That way we get to something that has value much sooner. I fear that doing the whole shebang is going to be a project that either takes us to the end of 2020 or never finishes.

I suggest the following steps, but would love feedback and ideas. @cristianberneanu, any thoughts?

  1. CASE with "If A then B"-semantic, and nvl
  2. CASE statement explicitly only inside an aggregate
  3. Freestanding CASE-statement outside an aggregate without additional aggregates
  4. Freestanding CASE-statement outside an aggregate with additional aggregates
  5. Freestanding CASE-statement where the CASE-statement result value can also be used in a WHERE-clause
  6. Full implementation

Example queries for the examples are:

CASE with "If A then B"-semantic, and nvl

SELECT
  nvl(amount, 0)
FROM table

SELECT 
  trinketSize,
  sum(nvl(amount, 0))
FROM
GROUP BY trinketSize

SELECT 
  CASE -- only single WHEN THEN ELSE
    WHEN trinketSize BETWEEN 100 and 200 THEN 'LARGE'
    ELSE 'SMALL'
  END as sizeClassification
  trinketSize,
  sum(nvl(amount, 0))
FROM
GROUP BY sizeClassification

SELECT
  month,
  SUM(
    CASE 
      WHEN aboutToLeave THEN 1 
      ELSE 0
    END
  ) as potentialLeavers
FROM table
GROUP BY 1

CASE statement explicitly only inside an aggregate

-- Allowed query
SELECT
  trinketSize,
  SUM(
    CASE 
      WHEN vendor = 'WHALER' THEN 0
      WHEN vendor = 'BOB' THEN 10
      WHEN vendor = 'ALICE' THEN 10
      WHEN vendor = 'Cynthia' THEN 15
      ELSE 2 
    END
  )
FROM table
GROUP BY trinketSize

-- NOT allowed because CASE outside of aggregate
-- which requires subsequent adjustments to the 
-- surrounding aggregates
SELECT 
  CASE 
    WHEN vendor = 'WHALER' THEN 0
    WHEN vendor = 'BOB' THEN 10
    WHEN vendor = 'ALICE' THEN 10
    WHEN vendor = 'Cynthia' THEN 15
    ELSE 2 
  END as vendorPoints,
  SUM(trinketSize)
FROM table
GROUP BY vendorPoints

-- NOT allowed although semantically equivalent to the allowed case in this instance
-- since suddenly someone will end up adding in an aggregate at the level of the CASE-statement
SELECT
  trinketSize,
  SUM(vendorPoints)
FROM (
  SELECT
    trinketSize,
    CASE 
      WHEN vendor = 'WHALER' THEN 0
      WHEN vendor = 'BOB' THEN 10
      WHEN vendor = 'ALICE' THEN 10
      WHEN vendor = 'Cynthia' THEN 15
      ELSE 2 
    END as vendorPoints
  FROM table
)
GROUP BY trinketSize

The extent to which aggregates require readjustment is limited here. We would need to think through what types of restrictions should exist...

Freestanding CASE-statement outside an aggregate without additional aggregates

I am not 100% sold on if this additional step adds much value. But here goes:

SELECT 
  CASE 
    WHEN vendor = 'WHALER' THEN 0
    WHEN vendor = 'BOB' THEN 10
    WHEN vendor = 'ALICE' THEN 10
    WHEN vendor = 'Cynthia' THEN 15
    ELSE 2 
  END as vendorPoints,
  trinketSize
FROM table

Freestanding CASE-statement outside an aggregate with additional aggregates

SELECT 
  vendorPoints,
  AVG(vendorPoints)
  SUM(totalTrinketSize) as allTrinkets
FROM (
  SELECT 
    uid,
    CASE 
      WHEN vendor = 'WHALER' THEN 0
      WHEN vendor = 'BOB' THEN 10
      WHEN vendor = 'ALICE' THEN 10
      WHEN vendor = 'Cynthia' THEN 15
      ELSE 2 
    END as vendorPoints,
    SUM(trinketSize) as totalTrinketSize
  FROM table
  GROUP BY uid, vendorPoints
)
GROUP BY vendorPoints

A query like this would require a lot of aggregate readjustments if conditions are low effect.

Freestanding CASE-statement where the CASE-statement result value can also be used in a WHERE-clause

SELECT 
  vendorPoints,
  AVG(vendorPoints)
  SUM(totalTrinketSize) as allTrinkets
FROM (
  SELECT 
    uid,
    CASE 
      WHEN vendor = 'WHALER' THEN 0
      WHEN vendor = 'BOB' THEN 10
      WHEN vendor = 'ALICE' THEN 10
      WHEN vendor = 'Cynthia' THEN 15
      ELSE 2 
    END as vendorPoints,
    SUM(trinketSize) as totalTrinketSize
  FROM table
  GROUP BY uid, vendorPoints
)
-- This is the new part here...
WHERE vendorPoints IN (0, 10)
GROUP BY vendorPoints

Full implementation

SELECT 
  vendorPoints,
  AVG(vendorPoints)
  SUM(totalTrinketSize) as allTrinkets
FROM (
  SELECT 
    uid,
    CASE 
      WHEN vendor = 'WHALER' THEN 0
      WHEN vendor = 'BOB' THEN 10
      WHEN vendor = 'ALICE' THEN 10
      WHEN vendor = 'Cynthia' THEN 15
      ELSE 2 
    END as vendorPoints,
    SUM(trinketSize) as totalTrinketSize
  FROM table
  WHERE vendor <> 'BOB' OR victoryPoints BETWEEN 0 and 5
  GROUP BY uid, vendorPoints
)
-- This is the new part here...
WHERE vendorPoints IN (0, 10)
GROUP BY vendorPoints
yoid2000 commented 4 years ago

@sebastian super useful set of CASE examples! Now I better understand @cristianberneanu 's comments on why CASE in aggr() is simpler. I'll play around with what that looks like in probe1 and probe2.

cristianberneanu commented 4 years ago

I am also very worried by the complexity of this entire thing. I am having trouble wrapping my head around it. And I doubt that I can implement something that I don't understand. And even if I do, I won't be able to debug it afterwards. And even if I do, no one else will be able to understand or check the implementation. So the code will be unmaintainable in the long run.

I would rather prefer a series of small steps that leads us closer to the target instead of a giant leap there. In the worst case scenario, I suggest we throw any performance concerns out the window for the initial version and implement a more naive design at first, like issuing a separate probe for each node. Once we have that understood and checked for correctness, we can try to combine them into one large query.

I am also worried about the complexity of the generated queries. Having a list of buckets as a filter doesn't seem scalable at all.

There are also a couple of things that don't click for me at all:

I don't understand how do you plan to normalize the conditions across multiple queries, joins, grouping, different having and where filters.

To help minimize adjusting, we require that a LE node has certain properties across a so-called "bucket group". A bucket group is a set of buckets whereby all but one of the bucket values are the same. For example, if a query selects two columns, dept and salary, the one bucket group would be all buckets with dept='CS', another bucket group would be all buckets with dept='math', another would be all buckets with salary=10000, and so on.

So if we change the grouping order we get different results?


All in all, slowly going towards supporting different case scenarios is a lot more appealing to me. At least some cases, like case inside an aggregator, intuitively feel more approachable than supporting OR.

yoid2000 commented 4 years ago

@cristianberneanu your comments are well taken. I wonder if this isn't the case that shows we really should be implementing this inside the database engine, so we don't have to do all this complex and hard-to-scale stuff to tease out what we need from the database.

I don't think we should do anything where we ignore scalability. If you spend some time with the examples, I think you will be able to wrap your head around it. Conceptually at least it isn't really that hard to understand, once you understand it.

I'm actually not sure that CASE is going to be much easier. Even the simplest example turns out to have an easy attack that we would need to do LED to defend against. Take this example:

Left query:

SELECT frequency, sum(
    CASE
        WHEN birth_number = '706213' THEN 1
        ELSE 0
    END
    )
FROM transactions
GROUP BY 1

Right query:

SELECT frequency, sum(
    CASE
        WHEN birth_number = 'nosuchvalue' THEN 1
        ELSE 0
    END
    )
FROM transactions
GROUP BY 1

There is exactly one user with birth_number = '706213', and these two queries clearly reveal which bucket that user is in. So in other words, even a single WHEN/ELSE in an aggregate has OR semantics that we need to do LED against.

cristianberneanu commented 4 years ago

I wonder if this isn't the case that shows we really should be implementing this inside the database engine, so we don't have to do all this complex and hard-to-scale stuff to tease out what we need from the database.

At this point, this would set us back at least a year, if not two. It would also be unusable for Telefonica since it would only work on something extensible, like PostgreSQL.

There is exactly one user with birth_number = '706213', and these two queries clearly reveal which bucket that user is in. So in other words, even a single WHEN/ELSE in an aggregate has OR semantics that we need to do LED against.

We clearly need some additional processing, I wasn't suggesting allowing CASE statements without any additional checks. In this case, beside the main query buckets, the aggregator needs to account for each CASE branch. In effect, each branch creates an additional bucket for which we need to compute the required statistics. If any of the resulting buckets are low-count, we suppress them (we practically ignore the branch). We then aggregate what input remains, create a composite noise layer (details TBD) and add a single noise value to the final result, just as we do now.

yoid2000 commented 4 years ago

This is just a simple example of dropping a WHEN condition:

When analyst says this:

SELECT frequency, sum(
    CASE
        WHEN birth_number = '706213' THEN 1
        ELSE 0
    END
    )
FROM transactions
GROUP BY 1

We make a probe like this:

SELECT frequency AS c1,
       count(DISTINCT p1_uid) AS p1_full_cnt,
       min(p1_uid) AS p1_full_min,
       max(p1_uid) AS p1_full_max
FROM
  (SELECT frequency,
          CASE
              WHEN birth_number = '706213' THEN uid
              ELSE NULL
          END AS p1_uid
   FROM transactions) t
GROUP BY 1
HAVING count(DISTINCT p1_uid) < 7

From this we see that the WHEN is LE, and drop it from the analyst query to make something like this:

SELECT frequency, sum(
    CASE
        WHEN False THEN NULL
        ELSE 0
    END
    )
FROM transactions
GROUP BY 1
yoid2000 commented 4 years ago

At this point, this would set us back at least a year, if not two. It would also be unusable for Telefonica since it would only work on something extensible, like PostgreSQL.

A native database implementation would no longer be Aircloak. Rather open source, paid for out of my research budget, and a lot simpler than what we are trying to do now.

yoid2000 commented 4 years ago

In this case, beside the main query buckets, the aggregator needs to account for each CASE branch. In effect, each branch creates an additional bucket for which we need to compute the required statistics. If any of the resulting buckets are low-count, we suppress them (we practically ignore the branch). We then aggregate what input remains, create a composite noise layer (details TBD) and add a single noise value to the final result, just as we do now.

So in other words, this could be a design without a probe at all?

cristianberneanu commented 4 years ago

Duplicated from Slack, so other can see it as well:

We need to drop branches conditionally, per bucket. So the probe approach won't work.

So in other words, this could be a design without a probe at all?

Yes. We compute all possible inputs and then select the right ones inside the cloak. Probably won't scale to more than 3-4 branches though.

Something like aggregator(case when a then b else c end) would become aggregator(b) when a + aggregator(c) when not a. In effect, we extract the case outside the aggregator.

yoid2000 commented 4 years ago

I am also worried about the complexity of the generated queries. Having a list of buckets as a filter doesn't seem scalable at all.

This also worries me. There is a trade-off that can be made here, where we put fewer buckets in the filter at the expense of more buckets in the DB output.

yoid2000 commented 4 years ago

So if we change the grouping order we get different results?

That's a good question. But in any event if we change the grouping order one gets different results, so I don't think it really matters.

cristianberneanu commented 4 years ago

But in any event if we change the grouping order one gets different results, so I don't think it really matters.

I don't think we do. We should get identical results no matter the order of the groups, otherwise one can get multiple samples for the same bucket by varying the grouping.

yoid2000 commented 4 years ago

What I mean is that currently we'll get different buckets, because different columns will have * values depending on the column order.

Other than this, I don't see why we'd get different results because of order. (Or at least I don't see why it can't be implemented that way.)

yoid2000 commented 4 years ago

In this case, beside the main query buckets, the aggregator needs to account for each CASE branch. In effect, each branch creates an additional bucket for which we need to compute the required statistics. If any of the resulting buckets are low-count, we suppress them (we practically ignore the branch). We then aggregate what input remains, create a composite noise layer (details TBD) and add a single noise value to the final result, just as we do now.

I don't think this approach works, at least if by "ignore the branch" you mean "adjust the answer down to negate the effect of the branch" .

The problem is this. Suppose the attacker is doing a difference attack using the pairs of queries like this:

SELECT frequency, sum(
    CASE
        WHEN birth_number = '706213' THEN 1
        WHEN acct_district_id = 1 THEN 1
        ELSE 0
    END
    )
FROM transactions
GROUP BY 1
SELECT frequency, sum(
    CASE
        WHEN acct_district_id = 1 THEN 1
        ELSE 0
    END
    )
FROM transactions
GROUP BY 1

There is one user with that birth_number, and the attacker wants to learn the acct_district_id. Each pair will select a different acct_district_id, and the attacker is looking for the case where the difference between the two queries is one less than all the others.

So if we are adjusting, we want to adjust on the query where the victim matches the acct_district_id, but not the queries where the victim does not match the acct_district_id. Which means that we need to know if the victim matches the second condition or not.

If I understand what you are suggesting, you would take the first query and do something like this:

SELECT frequency, 
    sum(CASE
            WHEN birth_number = '706213' THEN 1
            ELSE 0
        END
    ),
    sum(CASE
            WHEN acct_district_id = 1 THEN 1
            ELSE 0
        END
    )
FROM transactions
GROUP BY 1

But you'd of course also have to count the number of distinct UIDs, so you'd really need this:

SELECT frequency, 
    sum(CASE
            WHEN birth_number = '706213' THEN 1
            ELSE 0
        END
    ),
    count(distinct CASE
            WHEN birth_number = '706213' THEN UID
            ELSE NULL
        END
    ),
    sum(CASE
            WHEN acct_district_id = 1 THEN 1
            ELSE 0
        END
    )
    count(distinct CASE
            WHEN acct_district_id = 1 THEN UID
            ELSE NULL
        END
    ),
FROM transactions
GROUP BY 1

Anyway, once you decide that the first branch is LCF, you still don't know whether to adjust or not, because that depends on whether the victim is matched by the second branch.

The only way to know that is to collect per-UID information, which is what the second probe is for. So I don't think we can avoid either a second probe or somehow collecting per-UID information.

cristianberneanu commented 4 years ago

This is not what I had in mind. I never mentioned adjusting, because I don't think it is needed if we compute all possible cases. If we have the right input data, we don't need to do any adjusting, since we can compute the correct result.

In the case of:

CASE
     WHEN birth_number = '706213' THEN 1
     WHEN acc_district_id = 1 THEN 1
     ELSE 0
END

lets name the conditions:

c1: birth_number = `706213` 
c2: acc_district_id = 1

For each frequency bucket, to compute the sum aggregator, we first compute the following statistics:

sum_statistics1 when c1
sum_statistics2 when not c1 and c2
sum_statistics3 when not c1 and not c2
sum_statistics4 when c2
sum_statistics5 when not c2
sum_statistics6 when not c1
sum_statistics7 when true

Then, in the cloak, for each frequency bucket, we apply the following algorithm:

if `c1` is not LC and `not c1 and c2` is not LC
      use sum_statistics1 + sum_statistics2 + sum_statistics3

if `c1` is LC and `c2` is not LC
      use sum_statistics4 + sum_statistics5

if `c1` is LC and `c2` is LC
      use sum_statistics7

if `c1` is not LC and `not c1 and c2` is LC
      use sum_statistics1 + sum_statistics6

In the above example, assuming c1 is always LC for all buckets and c2 is never LC, then we will always use sum_statistics4 + sum_statistics5, so in effect we always drop the first branch, making both queries identical.

The advantage is that we don't do any probing. The drawback is that we computes a large number of statistics for each aggregator using case (5 statistics columns x 7 cases = 35 columns). Some of the columns are redundant (if we know count(a) and count(not a), we know count(total), or if we know min(a) and min(not a), we also know min(total)), so maybe we could reduce the amount of selected columns, but I doubt it will scale to more than 4 branches.

cristianberneanu commented 4 years ago

Some of the columns are redundant (if we know count(a) and count(not a), we know count(total), or if we know min(a) and min(not a), we also know min(total)), so maybe we could reduce the amount of selected columns, but I doubt it will scale to more than 4 branches.

Actually, scratch that. The columns are not redundant in the general case, because each branch can return something different.

yoid2000 commented 4 years ago

So this scales exponentially with the number of conditions?

cristianberneanu commented 4 years ago

So this scales exponentially with the number of conditions?

Yes. But I don't think probing is better either.

yoid2000 commented 4 years ago

Probe 2 can indeed get nasty. But we can arbitrarily bound the size of the query (i.e. the number of SELECT column values in the conditions in the WHERE clause), and trade that off for the size of the answer (the number of SELECT column values in the answer). So the exponential growth of conditions can be much worse, I believe.

cristianberneanu commented 4 years ago

Paul, I don't think it is a completely fair comparison. CASE statements have different semantics from OR. Each branch can return something else and each condition influences the ones after it. I suspect that a probe implementation for CASE will be even more complex than what we currently have for OR.

yoid2000 commented 4 years ago

I suspect that a probe implementation for CASE will be even more complex than what we currently have for OR.

Yes, most likely. There are going to be new corner cases, no pun intended.

@sebastian and I have talked about for the interim just an implementation of CASE that adds noise to conditions, essentially as we do now for other conditions, but does not try to do any form of LED. This might be something that TF can accept for now, and I presume won't be wasted code because in any event we need to have the compiler etc. working for CASE.

@sebastian ?

sebastian commented 4 years ago

I have still to talk to TLF if they could work with such a variant whereby the protections are not strong. What I am unclear on at the moment, @yoid2000, is the extent to which the protection is not strong when just adding noise layers. I.e. what strength can I claim that we are offering if we are only doing noise?

yoid2000 commented 4 years ago

There was some discussion among the founders yesterday of doing full emulation as a means of implementing LED for CASE and OR. The idea here is that emulation would be an easier first step, and then from there we could make efficiency improvements as needed.

I'm not sure this is a good idea, so I am putting some thoughts here.

The arguments against the 2-probe design are 1) too complex, and 2) inefficient, especially probe 2. The feeling is that, while emulation is obviously inefficient, the implementation would be much less complex so we could go there faster. This especially important for TF.

First regarding the inefficiency of the 2-probe design. I think it should be the case that probe 2 is rarely needed, and when it is that it is limited to very few conditions/buckets. Probe 2 is only needed when there are exactly two UID signatures for all buckets in a given bucket group. When does this "probe 2 scenario" happen? If a condition affects a single user, then almost certainly we get the probe 2 scenario. If a condition affects a few users, then the probability that all of those users happen to fall in exactly the same buckets from a given bucket group is small, and gets increasingly small with the number of users. My guess is that is a condition affects even 4 or 5 users, then the probability that they all fall in the same buckets is virtually zero. We can test this, and I suggest we do before we reject the design.

Now the question as to whether probe 2 is expensive. The main concern is the WHERE clause of probe 2.

Recall that a condition for the WHERE clause for a single LE node has this structure:

WHERE (LE node condition) AND (bucket condition OR bucket condition OR ....)

The LE node condition is small (like col = val), while the number of bucket conditions can be quite long and expensive to compute. However, so long as the LE node condition is checked first, the bucket conditions will only be checked if the LE node conditions matches, which will only occur for one or a very small number of UIDs (2 or 3 at most).

Plus of course it is possible to exclude the bucket conditions check in the DB and do it in the cloak instead, though I really don't see that as being necessary unless perhaps the number of bucket conditions is really huge (hundreds?)

Regarding complexity, it is harder for me to comment on this. I would point out that some of the complexity exists equally for both approaches, for instance the compiler for the analyst CASE statements. The 2-probe design has one advantage from a complexity standpoint, and that is that we don't need to change the existing stats-based query (at least for queries that don't have CASE statement). By contrast, for the emulation approach, we have to float all of the columns that are involved in CASE statements. I guess this is still substantially simpler than the 2-probe design all things considered, but it will still be complex...

Anyway I think we should give this some careful thought before deciding.

sebastian commented 4 years ago

I do, predictably, not agree.

The cost of the low effect detection design is (probe 2 in particular is not my concern). It is something one could be concerned about, but it is not my primary focus. What I have against the design is the complexity of understanding the algorithm, which in turn means that any implementation is going to be very hard. Like @cristianberneanu also mentioned in the past the problem does not end there, because they are bound to be bugs, and tracking down these in a system that is of this level of complexity is going to cause us a lot of grief. My concern is that this is going to be a black hole in our codebase that no-one dares touch, and that we end up building abstractions around because we are afraid of it.

The complexity of emulating OR-functionality in the cloak is not negligible. But it is something we can grasp and comprehend. We would see the actual real data in the cloak and could make decisions based on which conditions are in fact low effect and low count and which are not, and directly suppress them. This would be along the lines of how we in the past handled outliers in the min/max aggregates, i.e. based on inspection of the raw data.

The 2-probe design has one advantage from a complexity standpoint, and that is that we don't need to change the existing stats-based query (at least for queries that don't have CASE statement). By contrast, for the emulation approach, we have to float all of the columns that are involved in CASE statements.

I don't quite see it this way. As far as I am concerned no changes would need to be made to the stats based algorithm at all. The db emulation layer can calculate the same aggregate statistics as would the database.

Let's consider the following query hierarchy:

Unrestricted-query (1)
|--> Anonymizing-query with CASE/OR logic (2)
     |--> Per-user query (3)

In this setup we would still be able to execute the per-user query (3) in the database, and would perform the remaining execution in the cloak. The information needed to perform the OR-logic is already naturally exposed by query 3 so no additional floating is needed at all.

Let's consider another example:

Anonymizing-query (4)
|--> Per-user query with CASE/OR logic (5)
     |--> Per-user query (6)

Here we could again execute the per-user query (6) in the database and would have to emulate the per-user query (5) containing the OR-logic in the emulator. Again no additional floating is needed as such. But clearly some smartness is needed here because the impact of the OR-logic only manifests itself in the aggregates at the anonymizing query (4).

Let's consider yet another example:

Anonymizing-query (7)
|--> Per-user query with CASE/OR logic (8)
|    |--> Per-user query (9)
|--> Complex per-user query without CASE/OR logic (10)

In this setup we JOIN the results of query 8 and 10. The execution of query 10 can be done entirely in the database, but we would have to treat query 8 and 9 the way we treated 5 and 6 in the previous example.

In all of these examples however we have access to the per-user data and can make real decisions in the cloak.


As an aside (which is not exactly an apples to apples comparison) but still worth mentioning. The seemingly simple feature of allowing analyst created user tables took us 6 man months to implement! The problem seemed simple on the surface, but turned out to have complexities that were not immediately obvious. In the case of the LED design we however:

I therefore have concerns that implementing the LED design is something that we can get done within any reasonable time.


The benefits of emulation is that it increases the chances that we can provide the functionality in a way we can comprehend. Once we have a base to work from we can selectively offload OR-functionality to the databases either through probing approaches like in your design, or alternatively through database specific native implementations. This does however mean that we have a set of core functionality that we can offer across all databases from day one. It also means that we can more rapidly expand the set of OR functionality we can safely allow. The main drawback of course is that we run into the performance nightmare we tried to escape by introducing stats based anonymization.

The benefit of the LED design is that it promises to allow us to handle the OR-functionality computation entirely in the database. The drawbacks however is that:


An honourable mention is in place: @cristianberneanu has been saying for months that an emulated solution for the OR-logic most likely is the better choice. I chose to not agree until we had investigated the alternatives because I really did not want to get back into emulated territory. But here we are full circle back where @cristianberneanu had suggested we start (at least as far as I am concerned).

yoid2000 commented 4 years ago

@sebastian that is fine. I was mostly concerned that you guys believed that the performance of the 2-probe thing was going to be very bad, leading to thinking along the lines of "Well, performance will suck no matter what, so we may as well go with emulation because it is simpler". In particular, this statement, if you mean to apply it to the 2-probe approach, is wrong: "in the worst case with exponentially growing number of aggregates"

So with that in mind, if you want to go with full emulation then of course that is fine.

yoid2000 commented 4 years ago

So for the emulation approach, the key design points are the following:

cristianberneanu commented 4 years ago

Has a decision been finally made on what we attempt first, CASE or OR? It is not clear to me.

If we first go with CASE, do we implement general support or only support for aggregators with CASE statements?

What are the differences between the current LED design for OR and the one for CASE?

If we go with case-aggregators, what about supporting only 2-3 max branches at first by computing all possible combinations in one query (which might be doable)?

Regarding emulation: to be fair, I had something different in mind. I was thinking of a scenario where we could run arbitrary code directly in the database. That is because the LED algorithm requires a bunch of complicated checks that need fast access to the data and are not suitable for SQL processing engines. But this is something that I don't think it would be feasible with the current state of the system.

However, so long as the LE node condition is checked first, the bucket conditions will only be checked if the LE node conditions matches, which will only occur for one or a very small number of UIDs (2 or 3 at most).

In this case, the bucket-list conditions make no sense. We could just drop them, get all the buckets in the cloak and do any filtering there.

Plus of course it is possible to exclude the bucket conditions check in the DB and do it in the cloak instead, though I really don't see that as being necessary unless perhaps the number of bucket conditions is really huge (hundreds?)

The number of buckets could be huge in some edge cases. But most importantly, it would be easier to implement if we don't have to supply them. Why pay the complexity cost if we don't need it?

yoid2000 commented 4 years ago

The number of buckets could be huge in some edge cases. But most importantly, it would be easier to implement if we don't have to supply them. Why pay the complexity cost if we don't need it?

Yes, makes sense.

sebastian commented 4 years ago

@cristianberneanu I am talking to Telefonica. I will give you a final decision soon so we avoid confusion and can move forward.

sebastian commented 4 years ago

I am still waiting on feedback from Arnold on whether he can live with noise layer only CASE statements or whether we should go the emulation route. Either way I want the first thing we support in either of these scenarios to be limited to:

I want us to go very slowly here in the features we add so we don't shoot ourselves in our collective feet and build something that we cannot secure sufficiently.