Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Avoiding probes with noise layers #1276

Closed yoid2000 closed 7 years ago

yoid2000 commented 7 years ago

Overview

This issue describes some changes to the details of anonymization. The primary reason for the change is to eliminate the costs and complexity of probing. For the purpose of this issue, I assume that we have all of our current math restrictions. A separate issue will deal with relaxing math restrictions.

The main drawback of the new approach is that it can result in higher noise levels, though most queries won't see this, and it would be surprising to see a query that even doubles the noise. (And, over time if it is important we can mitigate this drawback at the cost of more complexity.)

At the "conceptual" level, the change can be described as follows. Much of the complexity in the cloak is there to deal with the "difference attack", whereby the analyst uses the mere fact that two answers are different to conclude something about the database. Currently, the cloak defends against this by modifying answers so that two queries that would otherwise produce different answers in fact produce the same answer. In the new approach at least so far as attacks that drop clauses, we do the opposite: we ensure that two queries that would otherwise produce the same answer may well in fact have different answers.

Note that this change does not remove the need for other existing mechanisms. For ranges, we still do shrink-and-drop (SaD). For IN clauses, we still check each element in the list, check to see if it is individually low-count, and drop the corresponding rows if it is. We also still need to do something to handle LIKE clauses, either what is already proposed, or I believe we'll be able to add a noise layer instead.

The basic idea is to generate an extra layer of noise per AND'd equality clause (= only, this doesn't work for not equal<> clauses). For the class of difference attacks that involves removing a clause, having this layer results in the noise changing when the clause is removed whether or not the removal changed the UID set. As a result, we don't need to make an explicit check (probe) as to whether the UID set would change or not when the clause is dropped.

"Adding additional layers" literally means generating multiple noise values and summing them together. Every answer has at least one noise layer. This is the same layer we have today, i.e. seeded by the uid list of the answer. On top of this, we have zero or more layers per clause. (Later we discuss how to generate these layers, but normally the standard deviation (SD) of the summed layers is equal to the SD of the single layer we currently create.)

Whenever the cloak produces a bucket, the contents of that bucket is influenced by some column or columns (with the exception of select count(*) from table).
This can happen because of WHERE, HAVING, or GROUP BY (not including GROUP BY on uid), because these are the things that filter rows into one or another bucket. Note that Aircloak queries have an implicit GROUP BY when there is no aggregation function in the outermost SELECT.

Alert: check this to see if it can't happen on other places that I'm forgetting.

Any time a filter occurs, then we want to create a noise layer based on the values in the relevant column immediately after the filter. This is done by floating the column of the filter (or, if the SQL is simple, by deducing what the value of the of all column values would be after the filter).

Broadly speaking, there are two ways we can try to do this. One is to use the values (constants) in the SQL itself. The other is to use the values in the column. I initially tried using the values in the SQL itself, but found this hard to do. The problem generally is that there are too many ways that the analyst can change the constants in SQL without changing the semantics of the SQL. This allows the analyst to generate a set of queries that effectively remove the noise of extra layers through averaging.

So instead we use the values in the columns themselves (which, after all, reflect the semantics of the query). To do this, we need to "float" those values to the top (cause them to appear in the SELECT call made by the cloak to the database so that the cloak may inspect the values). We then use those values to seed the noise layer. The following section describes how and when to float column values. The subsequent section describes how to generate the seed from the values.

Floating Column Values (Or Not)

The general goal is to float the values in the column itself, and not the values as modified by math and constants in the SQL. There are three basic cases:

  1. The column is aggregated (i.e. with GROUP BY uid).
  2. The column experiences math with constants.
  3. Neither of the above (native column).

These are described in following sections.

Native Column

When a column does not go through an aggregation function and does not encounter any math before the filter, it is floated in its native form (not modified for instance by math), and used to generate a noise layer.

Note however, that if it is obvious what the value of the floated column would be, then it is not necessary to literally float the column.

Following is an example:

SELECT count(*)
FROM employees
WHERE salary = 100000

Here we want to make a noise layer based on the native values in the salary column. Therefore, salary could by floated in the query made to the database (along with uid as we currently do), something like this:

SELECT uid, 
       salary AS float_salary     -- floated column
FROM employees
WHERE salary = 100000

However, in this case, it is obvious that if we float salary, then all values would be 100000, so the cloak may skip the floating and just use the expected value.

We would also float salary if a GROUP BY were used:

SELECT salary, count(*)
FROM employees
GROUP BY salary

In this case, each bucket would have a noise layer based on salary values used for that bucket. (In this case, salary is floated naturally by the query itself.)

In the following example, salary is again floated:

SELECT count(*)
FROM employees
WHERE salary >= 100000 AND
      salary < 150000

As:

SELECT uid, 
       salary AS float_salary
FROM employees
WHERE salary >= 100000 AND
      salary < 150000

NOTE: the treatment of ranges will probably be replaced with #1332, which gets rid of SaD.

Native Column with Math

If the analyst adds some math to the column, we still want to float the native column. As an example:

SELECT age, count(*)
FROM (
      SELECT uid, age, salary * 10 AS sal10
      FROM employees
     ) t
WHERE sal10 = 1000000
GROUP BY age

Here again we want to float salary (and in this case age as well), and not salary * 10. The reason for this is that we don't want the analyst to manipulate the value that gets used to create the noise layer, since each different value would produce a different seed, and in this way the analyst could average away the noise. So the database query would be (something like):

SELECT uid, age,
       float_age, 
       float_salary
FROM (
      SELECT uid, age, salary * 10 AS sal10,
             salary AS float_salary,
             age AS float_age,
      FROM employees
     ) t
WHERE sal10 = 1000000
GROUP BY age

Note that, strictly speaking, if the cloak is clever it could figure out that this really means salary = 100000, and avoid floating. But floating is the more general approach.

Following is an example with a range:

SELECT age, sal10, count(*)
FROM (
      SELECT uid, age, salary * 10 AS sal10
      FROM employees
      WHERE salary >= 100000 AND
            salary < 150000
     ) t
GROUP BY age, sal10

Here salary is floated and used in a single noise layer in spite of the fact that encounters two filters. The query from the database is as follows:

SELECT age, sal10,
       float_age,
       float_salary
FROM (
      SELECT uid, age, salary * 10 AS sal10,
      age AS float_age,
      salary AS float_salary
      FROM employees
      WHERE salary >= 100000 AND
            salary < 150000
     ) t

There are two extra noise layers here. Note that the floated salary is also needed for SaD.

Note that we never want to make a noise layer from something in the SQL that does not encounter a filter. So in the following:

SELECT age, count(*)
FROM (
      SELECT uid, age, salary
      FROM employees
     ) t
GROUP BY age

We would not float or make a noise layer from salary, because it never encounters a filter of any kind. Not making a layer prevents an attacker from adding arbitrary SQL that doesn't effect the output per se, but causes a new seed and therefore allows the attacker to average away the noise layer.

Native Column with Safe Discontinuous Functions (Discon)

For the purposes of this issue, we define safe discons the same as we currently do.

As with math, we want to float the native column.

The following query:

SELECT month(date), count(*)
FROM games
GROUP BY month(date)

Would be floated like this:

SELECT player, month(date), date AS float_date
FROM games

and a noise layer formed from float_date.

(Note that, strictly speaking, it might be safe in this case to generate the layer purely from the month(date) value, primarily because there is no math or constant here that would allow the analyst to manipulate values and therefore averate away the noise layer. However, to keep things uniform, it seems simpler to always form a tagged column for discons. We can revisit this later though.)

The following query is the same, but with the addition of a filter preceding the discon:

SELECT month(date), count(*)
FROM games
WHERE date >= '2000-01-01' AND
      date < '2001-01-01'
GROUP BY month(date)

Once again, we float date and use as a noise layer:

SELECT uid, 
       month(date), 
       date AS float_date
FROM games
WHERE date >= '2000-01-01' AND
      date < '2001-01-01'

In the following example there are two discons:

SELECT trimmed_fn, count(*)
FROM (
      SELECT uid, age,
      btrim(lower(firstname)) AS trimmed_fn
      FROM employees
     ) t
WHERE age = 50
GROUP BY trimmed_fn

These are safe (no constants), so we float firstname and age, and make two noise layers:

SELECT uid, trimmed_fn, 
       age,
       float_age,
       float_firstname
FROM (
      SELECT uid, age,
      btrim(lower(firstname)) AS trimmed_fn,
      age AS float_age,
      firstname AS float_firstname
      FROM employees
     ) t
WHERE age = 50

The following example has bucket.

SELECT bucket(age BY 5) AS agegroups, count(*)
FROM demographics
GROUP BY agegroups

Here we want to float age, as follows:

SELECT id, age
       age AS float_age
FROM demographics

Note that SaD still needs to be done on each bucket.

The following example has extract_match:

SELECT extract_match(name, 'paul|john|george|ringo') as beatle,
FROM rock_stars
GROUP BY beatle

Here we float name:

SELECT name,
       name AS float_name
FROM rock_stars

Aggregate Function

When a column goes through an aggregate function (GROUP BY uid), then of course it is no longer possible to float the native values. Rather, only aggregated values can be floated. When this happens, we float three aggregate values, min, max, and count, and use these values to generate the seed.

Alert: Need more thought as to whether these three values provide enough visibility into each user's values, or if we need more (i.e. sum).

Note that a column can pass through two or even more aggregations. In the first aggregation, three values are generated (min, max, and count). In principle we could at the next aggregation take min, max, and count of the three previous values, resulting in 9 values. This is both expensive and unnecessary. Instead we take the min of the mins, the max of the maxes, and the sum of the counts. Later on we have an example of this.

Here is a simple example:

SELECT sw, count(*) 
FROM
  ( SELECT player,
           sum(winpoints) AS sw
    FROM games
    GROUP BY player
  ) t
GROUP BY sw

Because winpoints is aggregated, the following request is made from the database:

SELECT player, sum(winpoints),
       min(winpoints),          -- floated
       max(winpoints),          -- floated
       count(winpoints)         -- floated
FROM games
GROUP BY player

and the min, max, and count values are used to generate the seed.

Note that from a privacy point of view, it would be fine in this case to float the sum itself, but in other cases it is necessary to generate the three values. Therefore from an implementation point of view, it may be easier to always generate and use these rather than code up when they are and are not needed.

Alert: There may be column types where min and max are not computable. I haven't given this any thought yet. If so, then we need to float some other aggregate.

Another example is this:

SELECT count(*)  
FROM
   ( SELECT player,
     sum(winpoints) AS sw
     FROM games
     GROUP BY player
   ) t 
WHERE sw = 500

The min, max, and count values are floated as follows:

SELECT player, 
       float_min_winpoints,         -- floated
       float_max_winpoints,         -- floated
       float_cnt_winpoints          -- floated
FROM
   ( SELECT player,
     sum(winpoints) AS sw,
     min(winpoints) AS float_min_winpoints,
     max(winpoints) AS float_max_winpoints,
     count(winpoints) AS float_cnt_winpoints
     FROM games
     GROUP BY player
   ) t 
WHERE sw = 500

In the following example, the aggregation occurs after the filter:

SELECT count(*)  
FROM
   ( SELECT player,
     sum(winpoints) AS sw
     FROM games
     WHERE winpoints >= 40 AND winpoints < 50
     GROUP BY player
   ) t 

Here again, the min, max, and count values are floated, as follows:

SELECT player, 
       float_min_winpoints,         -- floated
       float_max_winpoints,         -- floated
       float_cnt_winpoints          -- floated
FROM
   ( SELECT player,
     sum(winpoints) AS sw,
     min(winpoints) AS float_min_winpoints,
     max(winpoints) AS float_max_winpoints,
     count(winpoints) AS float_cnt_winpoints
     FROM games
     WHERE winpoints >= 40 AND winpoints < 50
     GROUP BY player
   ) t 

As a reminder, let me point out again that the purpose of the noise layer is to deal with the case where the filter (in this case WHERE winpoints >= 40 AND winpoints < 50) is removed. However, for shrink-and-drop (SaD), it is in any event necessary to float min and max. The noise layer does not eliminate the need for SaD.

In the following query, there are two filters for the same column, one before the aggregation, and one after:

SELECT count(*)  
FROM
   ( SELECT player,
     sum(winpoints) AS sw
     FROM games
     WHERE winpoints >= 40 AND winpoints < 50
     GROUP BY player
   ) t 
WHERE sw = 500

Here again we float min, max, and count.

We need to defend against the case where the analyst removes one of the two conditions. If we were to naively put a single noise layer for the winpoints column, then the analyst could remove one without changing the noise layer. It would be nice to have one noise layer, and then for instance us the number of filters as part of the seed. This however opens up an averaging attack where the analyst can just add chaff filters to his heart's content. So instead we have a noise layer per filter encountered. To prevent each noise layer from having the same seed, a counter is added to the seed. The counter is incremented every time the same noise value would otherwise be generated.

The following is an example of an aggregation function occurring before the filter, but with some math involved:

SELECT sw, count(*) 
FROM
  ( SELECT player,
           sum(winpoints * 2) AS sw
    FROM games
    GROUP BY player
  ) t
GROUP BY sw

As with prior examples, we float min, max, and count of winpoints minus the math. So this:

SELECT player, sw,
       float_min_winpoints,         -- floated
       float_max_winpoints,         -- floated
       float_cnt_winpoints          -- floated
FROM
   ( SELECT player,
     sum(winpoints * 2) AS sw,
     min(winpoints) AS float_min_winpoints,
     max(winpoints) AS float_max_winpoints,
     count(winpoints) AS float_cnt_winpoints
     FROM games
     GROUP BY player
   ) t 

As a result, the math doesn't effect the noise layer itself, even though it does effect the values in the sw column.

In the following query, the math on the column occurs after the aggregation rather than before. Regardless, the cloak floats min, max, and count as above.

SELECT sw, count(*) 
FROM
  ( SELECT player,
           sum(winpoints) * 2 AS sw
    FROM games
    GROUP BY player
  ) t
GROUP BY sw

Following is an example with HAVING:

SELECT count(*)  
FROM
   ( SELECT player,
     sum(winpoints * 2) / 3 AS sw
     FROM games
     GROUP BY player
     HAVING sw = 50
   ) t 

We float as follows:

SELECT player, sw,
       float_min_winpoints,         -- floated
       float_max_winpoints,         -- floated
       float_cnt_winpoints          -- floated
FROM
   ( SELECT player,
     sum(winpoints * 2) / 3 AS sw
     min(winpoints) AS float_min_winpoints,
     max(winpoints) AS float_max_winpoints,
     count(winpoints) AS float_cnt_winpoints
     FROM games
     GROUP BY player
     HAVING sw = 50
   ) t 

The following query is also legitimate, where we would float min(winpoints) etc:

SELECT count(*)  
FROM
   ( SELECT player,
     sum(winpoints) AS sw
     FROM games
     WHERE winpoints >= 40 AND winpoints < 50
     GROUP BY player
   ) t 
WHERE sw + 32 = 500

Noise

Generating the Seed per Noise Layer

Each noise layer has a seed. For columns, the seed is generated by a hash of the following items concatenated together:

  1. The column name (must always be the same for a given column no matter what the analyst writes). If the cloak aggregation function is count(*), then the column name is appended with '*'.
  2. The XOR of the hashes of the distinct values in the column. When min, max, and count have been floated, then the values consist of the concatenation of these three values.
  3. A cryptographic ally strong random salt.
  4. A counter that starts at zero, and is incremented every time a layer is identical to a previous layer (which happens when a column encounters more than one filter).

The hash should be cryptographically strong (SHA256).

The noise layer should be computed after SaD or any other mechanism that drops rows (i.e. dropping low-count values from an IN clause).

Generating the Noise

In each cloak aggregation function (count, sum, etc.), there is a point where mean zero noise is added to the true answer with some computed standard deviation SD. This is described in https://github.com/Aircloak/aircloak/blob/master/cloak/docs/anonymization.md.

In the case of both sum() and count(), there is a point where a noisy number Nv is generated with mean 0 and SD 1 (and then multiplied by something). In max(), min(), and median() there is a point where noise is added to the "top average" with a certain SD. We can regard this SD as an SD of 1 multiplied by "a quarter of the standard deviation of ....".

So in other words, in the current implementation, there is a point in the process where a noise value is generated with SD = 1. Let's call that noise value Nv. With layered noise, we are going to produce Nv differently. Normally the new "layered" Nv will also have SD = 1, but sometimes it may be more, as explained below.

In the current cloak, the noise is a single Gaussian sample pulled from a PRNG seeded with the distinct uid's in the bucket. With layered noise, this noise is the sum of one or more Gaussian samples. (Note that all other noisy values, i.e. those used for noisy thresholds, are computed exactly the same as today.)

The sum of Gaussian samples is still Gaussian. The standard deviation (SD) of a sum of N Gaussian samples, each with SD=s is sqrt(N)*s. So for example, the sum of 4 Gaussian samples, each with SD=0.5 has cumulative SD=1. I mention this because as long as there are four layers or less, the SD of each layer is set so that the cumulative SD is 1.

Specifically, the SD of each individual noise layer, the sum of which produces Nv, is set according to this table:

number of layers SD per layer cumulative SD
1 1 1
2 0.71 1
3 0.58 1
4 0.5 1
5 0.5 1.12
6 0.5 1.22
8 0.5 1.41
10 0.5 1.58

Alert: still need to validate that these values provide adequate protection. Preliminary simulations suggest that they do.

The noise layer Nv, then, is composed of the sum of N layers, where one layer is seeded as we seed the current layer (with uid's), and subsequent layers are seeded with distinct column values as described above.

Rounding

TBD

Examples

Lets look at some examples.

Example 1

The following query might be part of an attack where the analyst knows that there is one woman in the CS dept., and wants to learn her salary.

Q1:

SELECT gender, salary, dept, count(*)
FROM employees
GROUP BY gender, salary, dept

In this query, gender, salary, and dept are filtered by the GROUP BY, and so for each bucket, noise layers for gender, salary, and dept are produced based on the native values in the three columns.

No explicit floating is needed, because the columns already appear at the top. Because we are counting NULLs here, the actual column strings used in the seeds for each layer are gender*, salary*, and dept*.

Note that the other query in the attack would be this:

Q2:

SELECT salary, dept, count(*)
FROM employees
GROUP BY salary, dept

Since there is only one woman in the CS dept, the only buckets that would appear in the first query Q1 for dept = CS are buckets containing men (all the woman buckets would be LCF). The true answers for every bucket with dept = CS are the same except one of them: that matching the woman's salary.

However, the noisy answer for all buckets differs between Q1 and Q2 (or not, due to rounding) for two reasons:

  1. There is a noise layer based on gender in Q1 but not Q2.
  2. Each noise layer in Q1 has a different SD from the corresponding layer in Q2.

Because any given answer may differ for the above reasons, the analyst can't tell which bucket also differs because of the presence of the woman.

Example 2

SELECT gender, count(*)
FROM employees
WHERE salary = 100000 AND
      dept = CS
GROUP BY gender

Likewise, in this query, gender, salary, and dept all act as filters on the produced buckets (again seeded with gender*, salary*, and dept*). However, here salary and dept need to be floated. Therefore, the query that the cloak makes to the database is (something like) this:

SELECT uid, gender, salary, dept
FROM employees
WHERE salary = 100000 AND
      dept = CS

The cloak then makes one noise layer each for gender, salary, and dept.

The corresponding attack query would exclude any mention of gender, and would we protected by the same mechanisms as described in Example 1.

Example 3

Suppose here that all players have level = 5 except one, and the analyst wants to learn the sum of wins for that one. This query could be run as the first query in the attack:

Q1:

SELECT age, winsum, count(*)
FROM (
  SELECT player,
         sum(winpoints) AS winsum
         FROM games
         GROUP BY player
  ) g
JOIN (
  SELECT uid, age
  FROM players
  WHERE level = 5
  ) p
ON g.player = p.uid
GROUP BY age, winsum

In this example, age, level, and winpoints all need to be floated because each encounters a filter (GROUP BY for age and winpoints, WHERE for level).

Because of the aggregation function sum, we need to float the min, max, and count for winpoints instead of the native values.

So the query made to the database would be something like this:

SELECT g.player, age, winsum,
       float_min_winpoints,
       float_max_winpoints,
       float_cnt_winpoints,
       float_age,
       float_level
FROM (
  SELECT player,
         sum(winpoints) AS winsum,
         min(winpoints) AS float_min_winpoints,
         max(winpoints) AS float_max_winpoints,
         count(winpoints) AS float_cnt_winpoints
         FROM games
         GROUP BY player
  ) g
JOIN (
  SELECT uid, age,
         age AS float_age,
         level AS float_level
  FROM players
  WHERE level = 5
  ) p
ON g.player = p.uid

Here there are three noise layers (in addition to the uid layer), one with min/max/cnt of winpoints concatenated together, and one each of age and level.

The second query in the attack might be this (the WHERE clause removed):

Q2:

SELECT age, winsum, count(*)
FROM (
  SELECT player,
         sum(winpoints) AS winsum
         FROM games
         GROUP BY player
  ) g
JOIN (
  SELECT uid, age
  FROM players
  ) p
ON g.player = p.uid
GROUP BY age, winsum

Compared to Q1, Q2 is missing one layer (for level) and so noise layers will be different.

Example 4

This is just like the above, only the analyst has tossed in a little math.

SELECT age, winsum, count(*)
FROM (
  SELECT player,
         sum(winpoints*2) AS winsum
         FROM games
         GROUP BY player
  ) g
JOIN (
  SELECT uid, age
  FROM players
  WHERE level = 5
  ) p
ON g.player = p.uid
GROUP BY age, winsum

The floated columns and resulting noise layers would be exactly as in the example above. The math is ignored for the purposes of floating.

Example 5:

In this query, there is no additional noise layer at all, because there is no GROUP BY or other filter.

SELECT sum(winpoints), avg(winpoints)
FROM games

Example 6:

This query has two noise layers, one for winpoints* and one for losepoints*. The winpoints and losepoints columns need to be floated, obviously.

SELECT count(*)
FROM games
GROUP BY winpoints, losepoints

Example 7:

This query has two noise layers, one for winpoints and one for date.

SELECT avg(losepoints)
FROM games
GROUP BY winpoints, date

Note that the fact that an average is computed, or that losepoints is computed over, is irrelevant. losepoints is not used as a noise layer.

Example 8:

This query taken from Teambank samples.

SELECT
  left(cast(buchungsDatum as text), 7) as month,
  extract_match(buchungstext, 'EC|Visa|Maestro|Master') as card,
  count(*)
FROM umsatz
GROUP BY month, card

Here buchungsDatum* and buchungstext* constitute the 2 extra noise layers, so we float these to the top:

SELECT buchungsDatum, 
       buchungstext,
       buchungsDatum AS float_buchungsDatum, 
       buchungstext AS float_buchungstext
FROM umsatz

The reason that the two raw columns are floated, rather than using the output of left(cast()) and extract_match() is because I want to avoid the situation where an analyst can make multiple buckets with the same uid-list, but with different values. If analysts could do this, then they could make multiple noise samples are remove the effect of the extra layers. It is not clear in these particular cases whether the analyst could do that or not, but in general we don't know, so it is safer to float the raw columns.

Example 9:

SELECT 
  bucket(income by 200 align middle) as income_class, 
  count(*)
FROM (
  SELECT inhaberId, median(monthly_income) as income FROM (
    SELECT 
      inhaberId,
      left(cast(buchungsDatum as text), 7) as year_month,
      sum(betrag) as monthly_income
    FROM umsatz
    WHERE betrag >= 0 and betrag < 1000000000
    GROUP BY inhaberId, year_month
  ) as income_by_month
  GROUP BY inhaberId
) as user_incomes
GROUP BY income_class

This query has three filters: year_month (acting on buchungsDatum), betrag, and income_class (acting ultimately on betrag). We need to float buchungsDatum and betrag, and pass them through aggregation functions because of sum and median. The resulting query to the database is this:

SELECT income,
      float_min_min_buchungsDatum,
      float_max_max_buchungsDatum,
      float_cnt_cnt_buchungsDatum,
      float_min_min_betrag,
      float_max_max_betrag,
      float_cnt_cnt_betrag

FROM (
  SELECT inhaberId, median(monthly_income) as income,

      min(float_min_buchungsDatum) AS float_min_min_buchungsDatum,
      max(float_max_buchungsDatum) AS float_max_max_buchungsDatum,
      sum(float_cnt_buchungsDatum) AS float_cnt_cnt_buchungsDatum,
      min(float_min_betrag) AS float_min_min_betrag,
      max(float_max_betrag) AS float_max_max_betrag,
      sum(float_cnt_betrag) AS float_cnt_cnt_betrag
  FROM (
    SELECT 
      inhaberId,
      left(cast(buchungsDatum as text), 7) as year_month,
      sum(betrag) as monthly_income,

      min(buchungsDatum) AS float_min_buchungsDatum,
      max(buchungsDatum) AS float_max_buchungsDatum,
      count(buchungsDatum) AS float_cnt_buchungsDatum,
      min(betrag) AS float_min_betrag,
      max(betrag) AS float_max_betrag,
      count(betrag) AS float_cnt_betrag
    FROM umsatz
    WHERE betrag >= 0 and betrag < 1000000000
    GROUP BY inhaberId, year_month
  ) as income_by_month
  GROUP BY inhaberId
) as user_incomes

Here is where we have nested aggregation functions, so as described above, we take the min of the mins, the max of the maxes, and the sum of the counts.

In total we have two extra noise layers, one consisting of the concatenation of the buchungsDatum min, max, and count, and one consisting of the concatenation of the betrag min, max, and count. Note finally that we should be doing SaD on the betrag min/max values.

Example 10:

This example demonstrates why we include the * in the column name for count(*). Here, the query is designed to generate a NULL value when divide-by-zero occurs, and then the difference between count(*) and count(col) is used to detect whether or not a NULL value was generated:

Q1:

SELECT count(*) FROM
  (SELECT z/(6003614.195712-z+d+y+m) FROM
     (SELECT zipcode * 100 AS z,
             year(bday) / 10000 AS y,
             month(bday) / 1000000 AS m,
             day(bday) as d
      FROM user_info
      WHERE salary = 100000
     ) t1
  ) t2

Q2:

SELECT count(thing) FROM
  (SELECT z/(6003614.195712-z+d+y+m) AS thing FROM
     (SELECT zipcode * 100 AS z,
             year(bday) / 10000 AS y,
             month(bday) / 1000000 AS m,
             day(bday) as d
      FROM user_info
      WHERE salary = 100000
     ) t1
  ) t2

Of course as it now stands math restrictions would prevent these queries, but we expect to fix that.

In these queries, only salary would be floated. (Though if I think about it, the mere fact that bday and zipcode are used but never encounter a filter is by itself suspicious and could be used as the basis for rejecting the query!)

Anyway, assuming that the query was allowed, then the noise layer in the first query would use salary* in the seed, and the second would use salary. This would result in different seeds, and therefore the analyst would not know if a difference in answer is due to the seed or the presence or absence of the victim.

yoid2000 commented 7 years ago

@sebastian @obrok @cristianberneanu @sasa1977

Gang, here is a new issue that is designed to prevent the need for doing probing with inverse negative clauses. The technique is generally referred to as "layered noise".

Layered noise can also be used to allow us to get rid of most of the math restrictions, and I'll present that in another issue. Layered noise will also remove the need to do checks for attacks with LIKE, and in many cases remove the need for gathering extra data to check negative clauses. I'll discuss both of these in future issues as well.

I guess Sebastian asked one or more of you (Yapee at least) to have a look at this stuff. Whoever this is, please have a read and lets plan to talk once you have digested it. In particular, please read it very critically: assume that there are bugs and attacks that I haven't considered...

obrok commented 7 years ago

Overall this approach seems promising, because there is an intuitive high-level reason why it should cover all cases (we'll see about that, but I'm hopeful). Floating is similar to what we already do for SaD ranges so I like it from an implementor's perspective as well.

Some notes, questions:

  1. You say:

When a column goes through an aggregate function (GROUP BY uid)

If a column goes through an aggregate function that implies a GROUP BY and we require any GROUP BY in a subquery to include uid, so in that case this specification is redundant. What about something like:

select avg(a)
from foo
group by b

do we float a or min(a), max(a), count(a)? Seems like a?

  1. If min/max are uncomputable then most likely only count is. Maybe we can just float count in that case.

  2. Why does the column name need to be included in the seed?

  3. Why does the * need to be included in the seed?

  4. You say

  1. The number of layers.

The total number of layers is included in the seed for each layer? Is that intentional?

  1. Including the type of clause seems a bit underdefined. I guess we would need to include all clauses used? Or just the last one?

  2. Related - At some point you mention that the seed should include "number of filters" - but that seems to expose us to an averaging attack because you can include a bunch of redundant filters on the same column to build many queries which only differ by noise.

  3. Related - There seems to be at least some room for an averaging attack because the type of filter is included. For example you can usually form two queries one with GROUP BY and another with = that do the same thing, giving you 2 samples. More options may exist, especially if we need to include all types of filters used?

  4. In general in this scheme it seems we need to be on the lookout for noise averaging attacks the most as opposed to difference attacks previously.

  5. In Example 6:

The grants and gender columns need to be floated, obviously.

Is this a typo? The query does not mention anything about grants/gender.

yoid2000 commented 7 years ago

What about something like:

select avg(a)
from foo
group by b

do we float a or min(a), max(a), count(a)? Seems like a?

The filter here is acting on b, so you float b. (You only do min/max/count when a GROUP BY uid is encountered.)

(My example 7 is like this one.)

yoid2000 commented 7 years ago
  1. If min/max are uncomputable then most likely only count is. Maybe we can just float count in that case.

When are min/max uncomputable?

yoid2000 commented 7 years ago
  1. Why does the * need to be included in the seed?

This is to deal with attacks that exploit generation of a NULL value, and use count(*) versus count(col) to include or exclude that NULL value. I've added an example 10 to describe.

yoid2000 commented 7 years ago
  1. Why does the column name need to be included in the seed?

A good question. One of my fears is that the analyst has some clever way of learning what the noise value is for certain column values. For instance, if he can learn that any column consisting of nothing but value 5 produces a noise of -1.63, then maybe this can be exploited in some attack. (I can't think of a way the analyst can learn this, nor is there an immediately obvious way the knowledge can be exploited if it is learned, but I'm nervous just the same.)

So, by adding the column name, I reduce the attack surface considerably. Maybe he can figure out that column level and value 5 have noise -1.36, but this doesn't help him with column winpoints and value 5.

By the way, the purpose of adding the number of noise layers to the seed is in part for the same reason.

I've also considered doing more rounding of answers than we do now, so that it is harder to learn noise values, but not sure we need it so haven't written it down.

yoid2000 commented 7 years ago
  1. You say
    1. The number of layers.

The total number of layers is included in the seed for each layer? Is that intentional?

Yes, as at least partly explained in the previous post. To expand on this a bit...what I want to do is to make the noise dependent on that the analyst types as much as possible, but without giving the analyst the ability to add substantial "chaff" --- essentially the ability to create a lot of noise samples and average it away. It is pretty hard for the analyst to play with the number of layers because each layer adds noise, so he can't just append an arbitrary number of "chaff" clauses or something. Also the simplest way to make a difference attack is to remove a clause, and that immediately changes all of the seed values.

yoid2000 commented 7 years ago
  1. Including the type of clause seems a bit underdefined. I guess we would need to include all clauses used? Or just the last one?

This is yet another of these things I throw in to make noise harder to predict while limiting the added flexibility it gives the analyst in making noise samples. But it is probably the least important of such things.

If it is underspecified, then we should fix. My inclination would be to use the first filter that the column encounters as the clause type label.

obrok commented 7 years ago

When are min/max uncomputable?

It's you who brought this up:

Alert: There may be column types where min and max are not computable. I haven't given this any thought yet. If so, then we need to float some other aggregate.

I'm only saying that if you can't compute min then you're unlikely to be able to compute avg for example.

yoid2000 commented 7 years ago
  1. Related - At some point you mention that the seed should include "number of filters" - but that seems to expose us to an averaging attack because you can include a bunch of redundant filters on the same column to build many queries which only differ by noise.

Ha ha, this is the thing I just said would be hard for the analyst to attack. I literally forgot exactly how the thing worked (I was saying "number of layers", but indeed I have the optimization where we don't have one layer per filter).

I think you are right, and probably we should give up that optimization and have one layer per filter. Then this attack goes away. I would think it relatively rare that the same column goes through a lot of filters.

yoid2000 commented 7 years ago
  1. Related - There seems to be at least some room for an averaging attack because the type of filter is included. For example you can usually form two queries one with GROUP BY and another with = that do the same thing, giving you 2 samples. More options may exist, especially if we need to include all types of filters used?

You are right...it is stupid to have a label that can be trivially bypassed. Lets assume that we'll drop this part of the seed.

yoid2000 commented 7 years ago
  1. In general in this scheme it seems we need to be on the lookout for noise averaging attacks the most as opposed to difference attacks previously.

I think the new attacks we need to watch for are:

  1. Averaging attacks.
  2. Attacks that learn the noise value and somehow exploit.
  3. Attacks that find clever ways to bypass the mechanism that adds a noise layer (cause no noise layer where there should be one).
  4. Attacks that result in identical noise layers on two queries, so that the only remaining difference is the uid layer.

In short, we may have expanded the attack surface considerably...

yoid2000 commented 7 years ago
  1. In Example 6:

The grants and gender columns need to be floated, obviously.

Is this a typo? The query does not mention anything about grants/gender.

Yes, typo. I had changed the example without changing the explanation. Fixed now.

cristianberneanu commented 7 years ago

This looks to me more computationally expensive then condition probing. Increasing the amount of selected data will greatly reduce the performance of the query. Was the previous solution unfeasible or was this selected mainly for performance reasons?

cristianberneanu commented 7 years ago

The XOR of the hashes of the distinct values in the column.

Since this is done only for = clauses, won't all the values be the same?

obrok commented 7 years ago

Since this is done only for = clauses, won't all the values be the same?

In case of GROUP BY, =- yes. In case of ranges, negative conditions - no.

yoid2000 commented 7 years ago

@cristianberneanu

This looks to me more computationally expensive then condition probing. Increasing the amount of selected data will greatly reduce the performance of the query. Was the previous solution unfeasible or was this selected mainly for performance reasons?

Yes. I take it as a given that this approach is far cheaper computationally. With the previous approach, you had to remove AND clauses, which would often greatly increase the number of rows that the database needs to look at. Sometimes you could benefit from the LIMIT, but by no means always.

By contrast, this approach increases the number of columns that need to be conveyed, but it never increases the number of rows that anything has to deal with, and in fact the database needs to look at all those columns anyway. So there is some extra work, but I don't think that much. I could be wrong...

yoid2000 commented 7 years ago

The XOR of the hashes of the distinct values in the column.

Since this is done only for = clauses, won't all the values be the same?

In case of GROUP BY, =- yes. In case of ranges, negative conditions - no.

It is not really a problem if all of the values in the column are the same. It just means that the seed will be based on a single value (plus stuff like column name and number of layers) rather than multiple values...

cristianberneanu commented 7 years ago

I take it as a given that this approach is far cheaper computationally.

I think it will be more expensive. The cloak is a lot slower for one and the transfer of the data from the database to the cloak is expensive in itself.

It is not really a problem if all of the values in the column are the same.

It will be a performance problem if we select extra data for each row, even if that data is the same.

obrok commented 7 years ago

I take it as a given that this approach is far cheaper computationally.

I think it will be more expensive. The cloak is a lot slower for one and the transfer of the data from the database to the cloak is expensive in itself.

Intuitively I agree with what @yoid2000 is saying... Could we somehow get a better answer to this question without implementing both approaches?

cristianberneanu commented 7 years ago

There are many variables involved, so we won't be able to get a definitive answer in any case. If the columns involved are indexed and the number of filters is low, the probing should be faster. If the columns are not indexed and there are lots of filters, maybe floating the values will be faster. In between these, there results will be mixed.

yoid2000 commented 7 years ago

Maybe I answered the original question too quickly. This work was initiated because of performance reasons, but it has two other benefits:

  1. It can (probably) allow us to get rid of most math restrictions. (That will come in another issue.)
  2. We learned of a new attack a couple weeks back that is fixed by this work (though we had already started on this approach before learning that).

The attack is this. Suppose I want to learn the exact number of users with some attribute A (or some combination of attributes A1 & A2 & A3). Learning an exact number is dangerous, because then a lot can be deduced from an exact number. (If I know the number of all people in the CS dept, and the number of men, then I can deduce that there is for instance one woman.)

To learn the exact number, I form a bunch of queries of the form:

A & B
A & NOT B
A & C
A & NOT C
etc.

Since (A & B) + (A & NOT B) = A, then as long as many of these answers are not low-count filtered, I can derive the exact value of A.

With noise layer, there would always be a fixed noise layer attached to A (which might be something like dept = 'CS'), so an exact answer could not be produced.

yoid2000 commented 7 years ago

There are many variables involved, so we won't be able to get a definitive answer in any case. If the columns involved are indexed and the number of filters is low, the probing should be faster. If the columns are not indexed and there are lots of filters, maybe floating the values will be faster. In between these, there results will be mixed.

Yes, I agree. With the caveat that the number of columns that have to be floated is also based on the number of filters.

Perhaps what we'd find is that average performance for floating is worse, but worst case performance for probing could be much much worse. Also, in cases like Teambank where anyway the query has to be emulated, the performance for probing can be very bad.

obrok commented 7 years ago

Anyway, assuming that the query was allowed, then the noise layer in the first query would use salary* in the seed, and the second would use salary. This would result in different seeds, and therefore the analyst would not know if a difference in answer is due to the seed or the presence or absence of the victim.

That's a bit problematic. What about a query like:

SELECT count(*), count(thing) FROM
  (SELECT z/(6003614.195712-z+d+y+m) AS thing FROM
     (SELECT zipcode * 100 AS z,
             year(bday) / 10000 AS y,
             month(bday) / 1000000 AS m,
             day(bday) as d
      FROM user_info
      WHERE salary = 100000
     ) t1
  ) t2

maybe the seed should be * in case of only count(*) at the top level, salary in case of only count(salary) and salary* in this case? This gives another query with the same semantics though, so more room for averaging?

(Though if I think about it, the mere fact that bday and zipcode are used but never encounter a filter is by itself suspicious and could be used as the basis for rejecting the query!)

But it is used in an expression that ultimately ends up influencing the result (at least in the count(thing) case) - I suspect deciding when a usage is "legitimate" in this way will be very difficult.

yoid2000 commented 7 years ago

SELECT count(*), count(thing) FROM

Grrrrr. I don't want to obsess over this, but I don't want to ignore it either.

Probably we need to accept the fact that the analyst may be able to do a small amount of averaging. As long as it isn't too much, it is probably acceptable. Still, this gives the analyst one free sample for any column that has no NULL values...

yoid2000 commented 7 years ago

By the way, as an optimization if it is clear from the query what the condition is, then it is not necessary to float the column. Rather, the compiler can just compute what the value in the column would be if it were floated. For instance, in the following query:

SELECT count(*)
FROM table
WHERE col1 = 20

it is clear that if we were to float col1, then the resulting values would all be 20.

sebastian commented 7 years ago

@yoid2000

Any time a filter occurs, then we want to create a noise layer based on the values in the relevant column prior to the filter.

This is unclear to me. It seems to signify that the query:

SELECT age, count(*)
FROM players
WHERE age = 10
GROUP BY age

should have a noise layer based on all the values of age (i.e. prior to the filter), but this contradicts all the later examples, and my understanding from reading the rest. Therefore, what is actually meant?


Detail: bucket, extract_match, and extract_matches don't need extra floating. They are all only allowed in the top-level query, i.e. all the data required is already present. All the same they do need extra layers of noise to be added, based on the raw values going into the functions. This for example seems to be missing from example 9?


For instance, if he can learn that any column consisting of nothing but value 5 produces a noise of -1.63, then maybe this can be exploited in some attack. (I can't think of a way the analyst can learn this, nor is there an immediately obvious way the knowledge can be exploited if it is learned, but I'm nervous just the same.)

Isn't this something the analyst would learn by using a GROUP BY clause? Like SELECT age, count(*) FROM patients GROUP BY age? Then he will know for each value of age that there was a noise component based only on the particular value? Or am I misunderstanding, and the seed for any of the noise values across all the GROUP BY clauses are based on the complete set of distinct reportable age-values? I.e. same seed is used across all rows (I don't think that to be the case...)?

sebastian commented 7 years ago

The hash and the token should be cryptographically strong (SHA256).

There is a mention of a token here which isn't further defined. Is this a remnant from some past iteration? The negative condition checks issue also makes use of some tokens based on the filter type?

Can the tokens go/be removed?

yoid2000 commented 7 years ago

Any time a filter occurs, then we want to create a noise layer based on the values in the relevant column prior to the filter.

This is unclear to me. It seems to signify that the query:

SELECT age, count(*)
FROM players
WHERE age = 10
GROUP BY age

should have a noise layer based on all the values of age (i.e. prior to the filter), but this contradicts all the later examples, and my understanding from reading the rest. Therefore, what is actually meant?

The wording is clumsy. What I mean is what you deduced from the examples: that I want to float the column as it existed before the filter so that any changes to the column after the filter don't influence what we float. But the filter of course removes rows, which is what we want.

I cleaned up the text.

yoid2000 commented 7 years ago

Detail: bucket, extract_match, and extract_matches don't need extra floating. They are all only allowed in the top-level query, i.e. all the data required is already present. All the same they do need extra layers of noise to be added, based on the raw values going into the functions. This for example seems to be missing from example 9?

I think example 9 is ok, cause we are floating income, which is what is used in the bucket(). Then the cloak has to itself execute the bucket() function, and add the corresponding noise layer (which is mentioned in the text as income_class).

yoid2000 commented 7 years ago

For instance, if he can learn that any column consisting of nothing but value 5 produces a noise of -1.63, then maybe this can be exploited in some attack. (I can't think of a way the analyst can learn this, nor is there an immediately obvious way the knowledge can be exploited if it is learned, but I'm nervous just the same.)

Isn't this something the analyst would learn by using a GROUP BY clause? Like SELECT age, count(*) FROM patients GROUP BY age? Then he will know for each value of age that there was a noise component based only on the particular value? Or am I misunderstanding, and the seed for any of the noise values across all the GROUP BY clauses are based on the complete set of distinct reportable age-values? I.e. same seed is used across all rows (I don't think that to be the case...)?

You are right in that we don't use the same seed for all rows...every bucket has a different seed for the age layer, because every bucket has a different value of age. However, the analyst still can't learn any of the noise values.

If the analyst does SELECT age, count(*) FROM patients GROUP BY age, then each bucket will have two layers, one for the GROUP BY, and one from the uid (the "classic" noise that we've always had). So even if the analyst knows the exact count for some reason, the age layer is obscured by the presence of the uid layer. (And furthermore, even if the analyst could learn the noise values for age for this query, they would change in a query that had a different number of layers, so still not helpful).

yoid2000 commented 7 years ago

There is a mention of a token here which isn't further defined. Is this a remnant from some past iteration? The negative condition checks issue also makes use of some tokens based on the filter type?

the tokens go/be removed?

Yes, I fixed the text.

sebastian commented 7 years ago

I think example 9 is ok, cause we are floating income, which is what is used in the bucket(). Then the cloak has to itself execute the bucket() function, and add the corresponding noise layer (which is mentioned in the text as income_class).

income_class is producing a noise layer due to being used in GROUP BY. But we still need an additional layer based on the income value going into the bucket function!? Hence it should have 4 layers of noise, not 3?

obrok commented 7 years ago

income_class is producing a noise layer due to being used in GROUP BY. But we still need an additional layer based on the income value going into the bucket function!? Hence it should have 4 layers of noise, not 3?

But income_class is a function of income, no?

sebastian commented 7 years ago

Exactly, "it's a function of", but a function that does a many-to-one mapping from income to income_class.

Hence the income noise layer of bucket(income by 5) as income_class => 0 would have a noise layer influenced by any combination of the numbers 0 through 5, whereas income_class (because of the GROUP BY) produces a noise layer only based on 0.

obrok commented 7 years ago

Hence the income noise layer of bucket(income by 5) as income_class => 0 would have a noise layer influenced by any combination of the numbers 0 through 5, whereas income_class (because of the GROUP BY) produces a noise layer only based on 0.

OK, but why is it needed?

yoid2000 commented 7 years ago

I had another look. There should indeed be three noise layers derived from the three conditions (and a fourth noise layer from the uid). So @sebastian is right about that. betrag goes through two conditionals, one with the range, and one with the GROUP BY, so two of the layers should be based on betrag. Since two of the noise layers are on the same column, there should be a counter associated with the seed for each.

One question is, should the betrag layer derived from the range be based on all the rows (from the combined buckets), or just from the per-bucket rows. It could probably be either, but I don't see a problem with basing in on just the per-bucket rows. Done this way, the only difference between the seeds for the two betrag layers is the counter, but this indeed produces different seeds.

Note that if there were an attack that involved dropping the range, then this would remove the noise layer and force the resulting answer to be different whether or not a victim was included. So this is the right behavior. It does give the analyst a free extra sample, but this is something we already decided we can live with as long as the analyst can't get very many such samples...

sebastian commented 7 years ago

We currently also allow WHERE-clauses of the type columnA = columnB and columnA <> columnB. The question then becomes which values are going to be used in the seed? The distinct set of values in the intersection of columns columnA and columnB?

yoid2000 commented 7 years ago

You don't need to create a noise layer for col=col clauses. The analyst can't attack with those clauses so they don't need to be protected

sebastian commented 7 years ago

Ah, excellent!

yoid2000 commented 7 years ago

@obrok discovered an attack that exploits the fact that we use "number of layers" as part of the seed. The attack is to keep increasing the number of layers with "chaff" layers like age <> 1000. Then using different chaff expressions so that the chaff noise is averaged away. With each increased number of layers, the analyst gets a different sample for the (non-chaff) layer he is trying to attack.

The reason I had number of layers as part of the seed in the first place was because I was concerned that an analyst might be able to reverse engineer the noise associated with a given condition, but in fact I don't have an explicit way of doing that...I was only concerned.

But it seems that we need to remove the number of layers from the seed.

@obrok if you are ok with this, I can modify the description accordingly?

obrok commented 7 years ago

@yoid2000 please do

yoid2000 commented 7 years ago

done.....

But @obrok , as we discussed, please see if you can find attacks that let the analyst reverse engineer the noise value for a given condition.

obrok commented 7 years ago

@yoid2000 can the salt used for the extra layers and the UID layer be the same?

obrok commented 7 years ago

To clarify - I think the neatest way to code this is to actually treat the UID layer uniformly with the other layers, so it would be nice if the salt could be the same.

yoid2000 commented 7 years ago

I can think of no reason why the salt can't be the same.

PF

On Fri, Apr 21, 2017 at 5:57 PM Paweł Obrok notifications@github.com wrote:

To clarify - I think the neatest way to code this is to actually treat the UID layer uniformly with the other layers, so it would be nice if the salt could be the same.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Aircloak/aircloak/issues/1276#issuecomment-296230433, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qT1xMfhuctpy9yNNVoBOaw3tPVYXks5ryNHFgaJpZM4MfQaF .