Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

no-uid approach #2561

Open yoid2000 opened 6 years ago

yoid2000 commented 6 years ago

This issue discusses the possibility of removing the need to float UIDs (the no-uid approach). Being able to do so could allow the cloak to request aggregates from the backend DB rather than request every row.

The enabling concept is that we change the way we create so-called UID layers (which after this I'll call dynamic layers). Instead of seeding (in part) based on the values of the UIDs in the bucket, we seed on the number of DISTINCT UIDs that comprise the bucket. As a result, we no longer need to see the individual UIDs. Rather we can get per-bucket aggregates.

A second use of UIDs is in determiming how much noise to add to an answer, for instance because of flattening etc. The intent here is to use other hints about the effect of extreme users to approximate the way we flatten and add noise today.

The cost of all this is increased complexity of the SQL we produce, and in increased computation in the DB. The latter, however, is a good tradeoff as long as the cost of doing it in the DB is less than the cost of transmitting the data to the cloak and doing the work there.

The mechanisms in here aren't fully worked out, but so far I don't see any major issues.

As you go through this, please try to think of better ways to accomplish the basic goals. This is my first whack at it, but the first whack is almost never the best, so let's try to find a good way before committing too much code.

Basic Mechanisms

Obtaining counts of distinct UIDs

As I said, instead of obtaining the UIDs themselves, we obtain counts of distinct UIDs per bucket. This means that in the outer SELECT there is always something like:

SELECT ...
       count(DISTINCT uid),
       ...
       FROM ...

Obtaining statistics about user contributions (composite functions)

Because we can't use individual rows to decide how much noise to add or how much flattening to do, we need to gather certain statistics about how much users contribute to the answer. In the examples I gather max, min, avg, and stddev. What this means in practice is that the first (topmost) inner select is grouped by UID:

SELECT aggregate_function(column),
  ....
  max(column), min(column), avg(column), stddev(column),
  ....
  FROM (
    SELECT uid, column
    ...
    GROUP BY uid) t

Floating columns for the purpose of seeding

In the current approach, when we float columns we gather all column values, and use all of the distinct values combined as a seed component. With the no-uid approach, we can no longer do that. Nevertheless, we still want to produce a seed component based on the combined distinct values. In the examples I use an average of the distinct values to do this, on the basis of the average being influenced by all such values.

This means that, where floating is necessary, we'll see in the outer SELECT something like:

SELECT ...
       avg(DISTINCT floated_column),
       ...
       FROM ...

In the case of text, my examples do the following (which work for postgres), though this is just one of I'm sure many ways to do this.

SELECT ...
       avg(DISTINCT('x' || right(md5(floated_column),4))::bit(16)::int),
       ...
       FROM ...

Multi-query jobs

It appears unavoidable to have queries that require multiple queries at the database (i.e. for SQL, multiple semi-colons if not multiple trips to the DB).

One such case is standard deviation, which requires two passes through the data: one to compute the global average, and then another to compute the actual standard deviation. I'm sure we'll see more of this as we get into more complex statistical or machine learning functions.

Another is the mixing of composite and single-point aggregation functions in a single analyst query. Composite aggregation functions are those that use all values in the computation, and include count, sum, average, and standard deviation. Single-point aggregation functions are those that produce a single value, like min, max, and median. These are treated quite differently, and so such queries need to be broken up into multiple queries.

Examples

Following are some examples. They are designed to work on the banking database. In all cases (except where mentioned), the correctness of the examples has been checked against the expected answers from the database.

Count total number of rows

Almost the simplest possible query is this:

SELECT count(*)
FROM accounts

Today the cloak would modify this query to be something like this:

SELECT client_id
FROM accounts

and then compute the counts itself, using client_id to seed the dynamic noise layer and to flatten (and in principle to do low count filtering, though that wouldn't trigger in this case).

With the new no-uid approach, the cloak could do something like this:

SELECT sum(cnt),                   -- the true answer
       count(DISTINCT client_id) AS duids,         -- used for LCF and seed
       max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, count(*) AS cnt
    FROM accounts
    GROUP BY client_id) t

duids (count of DISTINCT UIDs) is used as part of the seed computation, in place of the hash of actual UIDs today.

sum(cnt) produces the same values as count(*) from the original query. We'll add noise to sum(cnt) for each bucket.

duids is also used to low-count filter. The noisy threshold is computed from the various seed components, including duids.

max(cnt), min(cnt), avg(cnt), and stddev(cnt) are used to decide how much noise to add. (Exact details to be determined if we decide to go ahead with this approach.) I'm not sure we need all four. max(cnt) and min(cnt) give us the most extreme values in the bucket. We can see how extreme it is relative to the average (avg(cnt)). We can also compute what it contributes to the standard deviation (stddev(cnt)). In general, the more extreme, the more flattening and the more noise. I think this will turn out to be a reasonable approach, but more work needed obviously.

Note as a side point that, for the banking table, this reduces the number of rows returned from the DB from 5639 to 1. In this case, the query times at the DB are about the same.

The answer to the above query is:

sum duids max min avg stddev
5369 5369 1 1 1.00000000 0

The max, avg, and stddev show not only that there are no extreme values, but that every user contributes the same amount. Both the fact that max == avg and stddev = 0 tell us this. So we can add the minimum noise here (SD=1 per layer).

If however we do the same query on the transactions table:

SELECT sum(cnt),
       count(DISTINCT client_id) AS duids,          -- used for LCF and seed
       max(cnt), min(cnt), avg(cnt), stddev(cnt)    -- used to compute noise
FROM
   (SELECT client_id, count(*) AS cnt
    FROM transactions
    GROUP BY client_id) t

we get this answer:

sum duids max min avg stddev
1262625 5369 675 9 235.16949152 127.676758457762

Here we see that some UIDs are contributing more than others. So the question is, how much flattening should we do, and how much noise should we add. For the first question, we would like to know how extreme the max is compared to the subsequent few users.

As a thought experiment, one way to get this avg and stddev would be for 1/2 of the values to be 107, and the other half to be 363 (other than our max and min). Were this the case, then indeed our max would be extreme compared to the next highest values. But one can suppose this to be very unlikely. Even in this worst case, we only need to flatten the extreme value by 312 to effectively hide it among the others. (And presumably we could flatten the min by adding 98.) In this worst case, we'd want the baseline noise to be 363. The cloak currently reports noise of 340 to this query.

Note finally that we can probably use the distribution of values in the shadow as a hint, for instance to increase our confidence that we don't have some corner case.

Count total number of DISTINCT UIDs in the table

Analyst query:

SELECT count(DISTINCT client_id)
FROM accounts

With the new no-uid approach, the cloak could do something like this:

SELECT sum(cnt),                   -- the true answer
       count(DISTINCT client_id) AS duids,         -- used for LCF and seed
       max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, count(DISTINCT client_id) AS cnt
    FROM accounts
    GROUP BY client_id) t

The answer shows that only minimum noise is needed:

sum duids max min avg stddev
5369 5369 1 1 1.00000000000 0

Count DISTINCT values for a given column

Analyst query:

SELECT count(DISTINCT frequency)
FROM accounts

I'll write this as two queries. Someone with more SQL skills than me can perhaps figure out how to do this in one query efficiently.

The two queries are these:

SELECT max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, count(DISTINCT frequency) AS cnt
    FROM accounts
    GROUP BY 1) t
SELECT count(DISTINCT client_id) AS duid,     -- used for seed and LCF
       count(DISTINCT frequency) AS dcnt      -- true answer
FROM
    (SELECT client_id, frequency
     FROM accounts) t1 

Though note that if the table has one row per user, then the first query is not necessary cause we'll know the answer. That is the case above.

Suppose the analyst wants to do this on transactions, for instance:

SELECT count(DISTINCT trans_date)
FROM transactions

The no-uid queries are:

SELECT max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, count(DISTINCT trans_date) AS cnt
    FROM transactions
    GROUP BY 1) t
SELECT count(DISTINCT trans_date) AS dcnt     -- true answer
       count(DISTINCT client_id) AS duid,     -- used for seed and LCF
FROM
    (SELECT client_id, trans_date
     FROM transactions) t1 

The answers to the two queries are:

dcnt duid
2191 5369
max min avg stddev
548 9 181.48947 97.5453997

Here we see that some additional noise and maybe flattening is needed, but again not too much for the same arguments as I made above.

Sum of a column

Analyst query:

SELECT sum(amount)
FROM transactions

Equivalent no-uid query:

SELECT sum(sm),                     -- the true answer
       count(DISTINCT client_id) AS duids,      -- used for LCF and seed
       max(sm), min(sm), avg(sm), stddev(sm)    -- used to compute noise
FROM
   (SELECT client_id, sum(amount) AS sm
    FROM transactions
    GROUP BY client_id) t

Average of a column

Analyst query:

SELECT avg(amount)
FROM transactions

Equivalent no-uid query:

SELECT sum(sm)/sum(cnt) AS average,
       count(DISTINCT client_id) AS duids,
       max(sm/cnt), min(sm/cnt), avg(sm/cnt), stddev(sm/cnt)
FROM
   (SELECT client_id, sum(amount) AS sm, count(*) AS cnt
    FROM transactions
    GROUP BY client_id) t
sd duids max min avg stddev
5873.1954 5369 21062.6132 843.4355 5690.1697 4018.0798

Note here that for the purpose of deciding the amount of noise, we would divide the values here (max, min, and avg) by duids. So max=3.92 andavg=1.06`.

Standard Deviation of a column

Analyst query:

SELECT stddev(amount)
FROM transactions

The no-uid query is:

SELECT sqrt(sum(sm)/sum(cnt)) AS sd,
       count(DISTINCT client_id) AS duids,
       max(sqrt(sm/cnt)), min(sqrt(sm/cnt)),
       avg(sqrt(sm/cnt)), stddev(sqrt(sm/cnt))
FROM
   (SELECT client_id, sum(diff) AS sm, count(*) AS cnt
    FROM
      (SELECT client_id, pow(abs(amount - (SELECT avg(amount)
                                           FROM transactions)), 2) AS diff
       FROM transactions
      ) t1
    GROUP BY client_id
   ) t3

Note that the average needed for the standard deviation computation is computed as a SELECT embedded in the abs function. My postgres is smart enough to store the computed value and re-use, but not every implementation might.

There is a small difference in the answers of the analyst and no-uid queries here. The analyst query returns 9470.0803, while the no-uid query returns 9470.0766. So not sure I'm really doing the computation right here.

The answer is this:

sd duids max min avg stddev
9470.0765 5369 27470.2953 3618.6551 8036.1940 4500.4135

Note that the amount of noise added here would be proportional to the max or avg divided by the number of distinct users (so max 5.1 and avg 1.5).

Max of a column

Analyst query:

SELECT max(amount)
FROM transactions

I think the easiest way to deal with this is to just get the top 100 or even 500 rows from the table, and more-or-less do the same kind of computation we've been doing:

SELECT client_id, amount
FROM transactions
ORDER BY 2 desc
LIMIT 100

The reason for getting so many is to minimize the probability that everything we gather belongs to just one or two distinct users. Also, if the total number of rows is small, then with this LIMIT we could obtain all rows and then avoid the problem of max being less than min.

Min of a column

Same as max, but in reverse.

Median of a column

Analyst query:

SELECT median(amount)
FROM transactions

Idealy we'd like to sample a set of values above and below the true median, and again do more-or-less what we do today. This seems to me to require two queries (or a UNION or JOIN of essentially two queries).

Possibly we could consider just sampling above or below the true median, and assuming that the distribution of value above is pretty close to what it is below.

Average and Standard Deviation of a column

Analyst query:

SELECT avg(amount), stddev(amount)
FROM transactions

The no-uid query is:

SELECT sqrt(sum(sd_sm)/sum(cnt)) AS sd,
       sum(sm)/sum(cnt) AS avg,
       count(DISTINCT client_id) AS duids,
       max(sqrt(sd_sm/cnt)), min(sqrt(sd_sm/cnt)),
       avg(sqrt(sd_sm/cnt)), stddev(sqrt(sd_sm/cnt)),
       max(sm/cnt), min(sm/cnt), avg(sm/cnt), stddev(sm/cnt)
FROM
   (SELECT client_id, sum(diff) AS sd_sm, 
           count(*) AS cnt, sum(amount) as sm
    FROM
      (SELECT client_id, amount,
              pow(abs(amount - (SELECT avg(amount)
                                FROM transactions)), 2) AS diff
       FROM transactions
      ) t1
    GROUP BY client_id
   ) t3

Here it is just a matter of including all of the needed values. Other than counting duids, we double the required computations.

Average and Median of a column

Analyst query:

SELECT avg(amount), max(amount)
FROM transactions

This would just require two separate queries, one for avg and one for max. I don't see a way to avoid that.

Histogram row count per value

Analyst's query:

SELECT frequency, count(*)
FROM accounts
GROUP BY 1
SELECT frequency, sum(cnt),                      -- the true answer
       count(distinct client_id) AS duids,       -- used for LCF and seed
       max(cnt), min(cnt), avg(cnt), stddev(cnt) -- used to compute noise
FROM
   (SELECT client_id, frequency, count(*) AS cnt
    FROM accounts
    GROUP BY 1, 2) t
GROUP BY 1

This is just like the simple counting all rows query, except that we additionally group by the column in both SELECTs.

The answer to the above query is this:

frequency sum duids max min avg stddev
'POPLATEK MESICNE' 4980 4980 1 1 1.00000000 0
'POPLATEK PO OBRATU' 107 107 1 1 1.00000000 0
'POPLATEK TYDNE' 282 282 1 1 1.00000000 0

A histogram query can of course return a lot of rows, if for instance the values tend to be unique. In this case, one could prevent having to process all these rows with a pair of queries like this:

SELECT amount, sum(cnt),                     -- the true answer
       count(distinct client_id) AS duids,       -- used for LCF and seed
       max(cnt), min(cnt), avg(cnt), stddev(cnt) -- used to compute noise
FROM
   (SELECT client_id, amount, count(*) AS cnt
    FROM transactions
    GROUP BY 1, 2) t
GROUP BY 1
HAVING sum(cnt) > 1
SELECT count(*)
FROM
   (SELECT amount, count(*)
    FROM transactions
    GROUP BY amount
    HAVING count(*) = 1) t

The first query will, presumably, produce some number of values that are LCF and therefore go into the * bucket. We need to think a little about how to determine the amount of noise for the * bucket in this case, because strictly speaking we don't have the min/max/avg/sd for the UIDs that went into the * bucket. Probably we can make an assumption that the statistics for the UIDs for the complete set more-or-less holds for those of the values that are LCF, and set noise accordingly.

The second query tells us the count of the * bucket for those values that have only a single user. I think that these can just be added to the * bucket computed from the first query, perhaps with a little more noise.

Note that the shadow, if we have one, could be a hint that we can benefit by making two queries instead of one.

Histogram of row counts by user (number of transactions)

This query computes a histogram of how many users have how many transactions.

SELECT num_trans, count(*) FROM
   (SELECT client_id, count(*) AS num_trans
    FROM transactions
    GROUP BY client_id) t
GROUP BY num_trans

No-uid query:

SELECT cnt AS num_trans, count(*),             -- the true answer
       count(distinct client_id) AS duids,         -- used for LCF and seed
       max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, count(*) AS cnt
    FROM transactions
    GROUP BY client_id) t
GROUP BY cnt

Though note that since in any event this query groups by client_id, there is no reason to separately compute duids or max/min/avg/sd. In other words, the num_trans is always the same as min and max, and sd stddev is always 0.

As with above, if we think there might be a long LCF tail, we could break it into two queries:

SELECT cnt AS num_trans, count(*),             -- the true answer
       count(distinct client_id) AS duids,         -- used for LCF and seed
       max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, count(*) AS cnt
    FROM transactions
    GROUP BY client_id) t
GROUP BY cnt
HAVING count(*) > 1
SELECT count(*) 
FROM
   (SELECT num_trans, count(*) FROM
      (SELECT client_id, count(*) AS num_trans
       FROM transactions
       GROUP BY client_id) t
    GROUP BY num_trans
    HAVING count(*) = 1) t2

Histogram of averages

Analyst query:

SELECT cli_district_id, avg(amount)
FROM transactions
GROUP BY cli_district_id

No-uid query:

SELECT cli_district_id, sum(sm)/sum(cnt) AS average,
       count(DISTINCT client_id) AS duids,
       max(sm/cnt), min(sm/cnt), avg(sm/cnt), stddev(sm/cnt)
FROM
   (SELECT client_id, cli_district_id, sum(amount) AS sm, count(*) AS cnt
    FROM transactions
    GROUP BY 1,2) t
GROUP BY 1

Histogram of Standard Deviations

Analyst query:

SELECT operation, stddev(amount)
FROM transactions
GROUP BY operation

Note that with standard deviations, we can't simply add a GROUP BY for the histogram column like the examples above. This is because the avg for each histogram value needs to be computed first. Maybe there is a way to do that with a single SQL query, but I don't know what it is (or even if it is a good idea if there is a way). I think if we are going to go with no-uid, then we need to be prepared to deal with compound queries (a string of ; separated queries all at once) or even multiple queries (query, reply, query, reply).

A (composite) no-uid query that works is:

CREATE TEMP TABLE foo AS
  SELECT operation, avg(amount) AS avg1
  FROM transactions
  GROUP BY 1;

SELECT operation,
       sqrt(sum(sm)/sum(cnt)) AS sd,
       count(DISTINCT client_id) AS duids,
       max(sqrt(sm/cnt)), min(sqrt(sm/cnt)),
       avg(sqrt(sm/cnt)), stddev(sqrt(sm/cnt))
FROM
   (SELECT client_id, operation, sum(diff) AS sm, count(*) AS cnt
    FROM
      (SELECT client_id, operation,
              pow(abs(amount - avg1), 2) AS diff
       FROM 
          ( SELECT client_id, foo.operation AS operation, amount, avg1
            FROM
            foo JOIN transactions
            ON foo.operation = transactions.operation ) t
      ) t1
    GROUP BY 1,2
   ) t3
GROUP BY 1;

DROP TABLE foo;

Count total number of rows with posand

Query:

SELECT count(*)
FROM accounts
WHERE cli_district_id = 68

No-uid modified:

SELECT sum(cnt),                                   -- the true answer
       avg(DISTINCT cli_district_id),          -- for seeding posand
       count(DISTINCT client_id) AS duids,         -- used for LCF and seed
       max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, cli_district_id, count(*) AS cnt
    FROM accounts
    WHERE cli_district_id = 68
    GROUP BY 1,2) t

Here we are floating the posand so that we can seed the corresponding noise layers. We take the average of the posand because otherwise we'd have to do a GROUP BY and that, in some cases, could explode into a lot of rows.

For instance, take the following query:

SELECT count(*)
FROM accounts
WHERE lastname LIKE 'A%'

If we were to modify it like this:

SELECT lastname, sum(cnt),   -- the true answers, grouped by lastname
       count(DISTINCT client_id) AS duids,         -- used for LCF and seed
       max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, lastname, count(*) AS cnt
    FROM accounts
    WHERE lastname LIKE 'A%'
    GROUP BY 1,2) t
GROUP BY 1

Then it would return almost one row per user, so virtually no savings.

So instead we can take the average of the lastnames (after conversion to int) and use that as a seed component for the LIKE posand:

SELECT sum(cnt),                   -- true answer
       avg(DISTINCT('x' || right(md5(lastname),4))::bit(16)::int) AS seed,
       count(DISTINCT client_id) AS duids,         -- used for LCF and seed
       max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, lastname, count(*) AS cnt
    FROM accounts
    WHERE lastname LIKE 'A%'
    GROUP BY 1,2) t

There is almost certainly a better way to produce some composite value out of strings. And the above is postgres specific.

If we implement #2459, its datetime equivalent, and some shadow-based approach for strings, then we can avoid most floating.

Count total number of DISTINCT UIDs in the table with posand

Analyst query:

SELECT count(DISTINCT client_id)
FROM accounts
WHERE cli_district_id = 68

With the new no-uid approach, the cloak could do something like this:

SELECT sum(cnt),                   -- the true answer
       avg(cli_district_id),
       count(DISTINCT client_id) AS duids,         -- used for LCF and seed
       max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, cli_district_id, count(DISTINCT client_id) AS cnt
    FROM accounts
    WHERE cli_district_id = 68
    GROUP BY 1,2) t

The posand could also be placed on the outer select, but I would assume generally more efficient to do it in the inner select.

Count DISTINCT values for a given column with posand

Analyst query:

SELECT count(DISTINCT trans_date)
FROM transactions
WHERE k_symbol = 'DUCHOD'

I'll write this as two queries. Someone with more SQL skills than me can perhaps figure out how to do this in one query efficiently.

The two queries are these:

SELECT max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, count(DISTINCT trans_date) AS cnt
    FROM transactions
    WHERE k_symbol = 'DUCHOD'
    GROUP BY 1) t
SELECT count(DISTINCT trans_date) AS dcnt,    -- true answer
       count(DISTINCT client_id) AS duid,     -- uid seed and lcf
       avg(DISTINCT('x' || right(md5(k_symbol),4))::bit(16)::int) AS seed
FROM
    (SELECT client_id, trans_date, k_symbol
     FROM transactions
     WHERE k_symbol = 'DUCHOD') t1 

Standard Deviation of a column with posand

Analyst query:

SELECT stddev(amount)
FROM transactions
WHERE cli_district_id = 42

The no-uid query is:

SELECT sqrt(sum(sm)/sum(cnt)) AS sd,
       count(DISTINCT client_id) AS duids,
       avg(cli_district_id) AS seed,
       max(sqrt(sm/cnt)), min(sqrt(sm/cnt)),
       avg(sqrt(sm/cnt)), stddev(sqrt(sm/cnt))
FROM
   (SELECT client_id, cli_district_id, sum(diff) AS sm, count(*) AS cnt
    FROM
      (SELECT client_id, cli_district_id,
              pow(abs(amount - (SELECT avg(amount) 
                                FROM transactions
                                WHERE cli_district_id = 42)), 2) AS diff
       FROM transactions
       WHERE cli_district_id = 42
      ) t1
    GROUP BY 1,2
   ) t3

Note that the SELECT embedded in the abs function also needs the WHERE clause.

The answers to the two above queries are not quite identical, so I wonder if something is wrong. The first query returns 8242.9778 while the second returns 8242.6026.

Here is the same thing with two posands.

Analyst query:

SELECT stddev(amount)
FROM transactions
WHERE cli_district_id = 42 AND
      k_symbol = 'DUCHOD' 

The no-uid query is:

SELECT sqrt(sum(sm)/sum(cnt)) AS sd,
       count(DISTINCT client_id) AS duids,
       avg(cli_district_id) AS cli_seed,
       avg(DISTINCT('x' || right(md5(k_symbol),4))::bit(16)::int) AS k_seed,
       max(sqrt(sm/cnt)), min(sqrt(sm/cnt)),
       avg(sqrt(sm/cnt)), stddev(sqrt(sm/cnt))
FROM
   (SELECT client_id, cli_district_id, k_symbol,
           sum(diff) AS sm, count(*) AS cnt
    FROM
      (SELECT client_id, cli_district_id, k_symbol,
              pow(abs(amount - (SELECT avg(amount) 
                                FROM transactions
                                WHERE cli_district_id = 42 AND
                                k_symbol = 'DUCHOD')), 2) AS diff
       FROM transactions
       WHERE cli_district_id = 42 AND
             k_symbol = 'DUCHOD'
      ) t1
    GROUP BY 1,2,3
   ) t3

Note that the SELECT embedded in the abs function also needs the WHERE clause.

Here the difference in answer is still bigger: 844.0798 versus 843.1372. Hmmmm.

Histogram with HAVING in inner select

Analyst query (noting first that the cloak probably requires that the analyst put the uid in the inner select, which is not done here, and second that the analyst would use the bucket function, not floor):

SELECT floor(sm/10000)*10000 AS buckets, count(*)
FROM (
   SELECT sum(amount) AS sm
   FROM transactions
   GROUP BY client_id
   HAVING max(cast(left(trans_date, 2) AS int)) = 98 ) t
GROUP BY 1

No-uid query:

SELECT floor(sm/10000)*10000 AS buckets, count(*) as cnt,
       count(DISTINCT client_id) AS duids,
       max(sm), min(sm), avg(sm), stddev(sm),
       avg(DISTINCT('x' || right(md5(mx_seed),4))::bit(16)::int) AS seed_mx,
       avg(DISTINCT('x' || right(md5(mn_seed),4))::bit(16)::int) AS seed_mn,
       avg(ct_seed) as seed_ct
FROM (
   SELECT client_id, sum(amount) AS sm,
          max(trans_date) as mx_seed,
          min(trans_date) as mn_seed,
          count(trans_date) as ct_seed
   FROM transactions
   GROUP BY client_id
   HAVING max(cast(left(trans_date, 2) AS int)) = 98 ) t
GROUP BY 1

Here I'm pulling up the min, max, and count of trans_date, the column in the HAVING clause for the purpose of seeding the HAVING noise layer. This reflects what we do today for aggregated inner selects.

Histogram with WHERE in aggregated inner select

Analyst query:

SELECT floor(sm/10000)*10000 AS buckets, count(*)
FROM (
   SELECT sum(amount) AS sm
   FROM transactions
   WHERE cast(left(trans_date, 2) AS int) = 98
   GROUP BY client_id ) t
GROUP BY 1

No-uid query:

SELECT floor(sm/10000)*10000 AS buckets, count(*) as cnt,
       count(DISTINCT client_id) AS duids,
       max(sm), min(sm), avg(sm), stddev(sm),
       avg(DISTINCT('x' || right(md5(mx_seed),4))::bit(16)::int) AS seed_mx,
       avg(DISTINCT('x' || right(md5(mn_seed),4))::bit(16)::int) AS seed_mn,
       avg(ct_seed) as seed_ct
FROM (
   SELECT client_id, sum(amount) AS sm,
          max(trans_date) as mx_seed,
          min(trans_date) as mn_seed,
          count(trans_date) as ct_seed
   FROM transactions
   WHERE cast(left(trans_date, 2) AS int) = 98
   GROUP BY client_id ) t
GROUP BY 1

Count total number of rows with negand

Query:

SELECT count(*)
FROM accounts
WHERE cli_district_id <> 68

The new plan is to drop low-effect negands. From the shadow, we can decide one of three things:

  1. The negand is high-effect, so we leave it in and seed from inspection of the SQL (not floating).
  2. The negand is low-effect, so we drop the negand.
  3. We aren't sure.

Here we discuss only the third case. In the third case, we need to determine from querying the database whether or not the negand is low effect.

In the current setup, where the cloak pulls every row from the DB, the cloak can drop the negand in its query to the DB, and inspect the returned rows to see how many would have been excluded by the negand. Then if the cloak decides that the negand is low effect, it can in principle act as though the negand were never there (i.e. leave the rows in the answer). It can also choose to keep the noise layers associated with the negand, or exclude them.

Here is one possible no-uid query:

SELECT sum(cnt_with) + sum(correction) AS true_with,
                                           -- the true answer with negand
       count(DISTINCT uid_with) AS duid_with,  -- duid with negand
       max(cnt_with), min(cnt_with), avg(cnt_with), stddev(cnt_with),
       sum(cnt_wo) AS true_wo,          -- the true answer without negand
       count(DISTINCT uid_wo) AS duid_wo,      -- duid wo negand
       max(cnt_wo), min(cnt_wo), avg(cnt_wo), stddev(cnt_wo),
       count(DISTINCT uid_neg) AS duid_neg     -- duid of negand match rows
FROM
    (SELECT client_id AS uid_wo, 
            count(*) AS cnt_wo,
            max(uid_w) AS uid_with,
            CASE WHEN count(col_w) = 0 THEN NULL
                 ELSE count(col_w)
            END AS cnt_with,
            sum(corrections) AS correction,
            max(uid_neg) AS uid_neg
     FROM
       (SELECT client_id, 
        CASE WHEN cli_district_id <> 68 THEN client_id
             ELSE NULL
        END AS uid_w,       -- to count UIDs with negand
        CASE WHEN cli_district_id <> 68 THEN cli_district_id
             ELSE NULL
        END AS col_w,       -- to get count(*) with negand
        CASE WHEN cli_district_id <> 68 THEN NULL
             ELSE client_id
        END AS uid_neg,         -- count removed UIDs
        CASE WHEN cli_district_id IS NULL THEN 1
             ELSE 0
        END AS corrections
        FROM accounts) t
     GROUP BY 1) t

The answer to the above query is:

true_with duid_with max min avg stddev
5283 5283 1 1 1.0 0
true_wo duid_wo max min avg stddev duid_neg
5369 5369 1 1 1.0 0 86

The basic idea of this query is that we compute three things:

  1. The result we'd get if the negand were included in the query (true_with).
  2. The result we'd get if the negand were dropped from the query (true_wo).
  3. The number of distinct UIDs for rows that the negand would have removed from the answer (duid_neg).

Depending on the value duid_neg, and whether or not it passes LCF, we select wither true_with or true_wo as the result we report to the analyst (after noise). In this particular case, duid_neg=86, so clearly would pass LCF and so we'd use the true_with value when reporting.

What I'm doing here is including the rows where cli_district_id = 68 in the computation. This allows me to compute true_wo and duid_neg. In order to compute true_with, what I'm doing is generating new columns where the value is NULL if cli_district_id = 68 (to reflect the fact that the row would have been excluded from the original query).

Since in this case we are computing count(*), which normally would count NULL rows, I need to correct for the case where the original column value is in fact NULL. Otherwise we wouldn't be counting those rows when we should. Thus the corrections column and subsequent handling. (Note: the correction mechanism isn't properly tested.)

The reason for the CASE statement when computing cnt_with is because otherwise the min, avg, and stddev values are messed up. This is because without the CASE, count(col) returns 0 which then gets incorrectly incorporated into the min, avg, and stddev computations.

Another approach one could take with this is to tag columns where cli_district_id = 68, and then use the tag when computing true_with and true_wo:

-- old approach
SELECT sum(cnt_with) AS true_with,             -- the true answer with negand
       count(DISTINCT uid_with)-1 AS duid_with,  -- duid with negand
       max(cnt_with) AS max_with, sum(cnt_with)/sum(tg) AS avg_with,
       sum(cnt_wo) AS true_wo,             -- the true answer without negand
       count(DISTINCT uid_wo) AS duid_wo,  -- duid with out negand
       max(cnt_wo) AS max_wo, min(cnt_wo) AS min_wo, 
       avg(cnt_wo) AS avg_wo, stddev(cnt_wo) AS stddev_wo,
       count(DISTINCT uid_eff)-1 AS duid_eff  -- duid for only rows hitting negand
FROM
    (SELECT client_id AS uid_wo, sum(cnt) AS cnt_wo,
            max(client_id*tag) AS uid_with, sum(cnt*tag) AS cnt_with, 
            max(client_id*abs(tag-1)) AS uid_eff, max(tag) AS tg
     FROM
       (SELECT client_id, cli_district_id,
        CASE WHEN cli_district_id <> 68 THEN 1
             ELSE 0
        END AS tag,
        count(*) AS cnt
        FROM accounts
        GROUP BY 1,2) t
     GROUP BY client_id) t

The answer is this:

true_with duid_with max_with avg_with
5283 5283 1 1.0000000
true_wo duid_wo max_wo min_wo avg_wo stddev_wo duid_eff
5369 5369 1 1 1.0000000 0 86

This was in fact my original approach. It got me into trouble in cases where we have multiple negands. I mention it here just for completeness and to see if it gives you any other good ideas.

As another example, assume the analyst query is this:

SELECT count(*)
FROM transactions
WHERE trans_date <> '930118'

The corresponding no-uid query is:

SELECT sum(cnt_with) + sum(correction) AS true_with,
                                           -- the true answer with negand
       count(DISTINCT uid_with) AS duid_with,  -- duid with negand
       max(cnt_with), min(cnt_with), avg(cnt_with), stddev(cnt_with),
       sum(cnt_wo) AS true_wo,          -- the true answer without negand
       count(DISTINCT uid_wo) AS duid_wo,      -- duid wo negand
       max(cnt_wo), min(cnt_wo), avg(cnt_wo), stddev(cnt_wo),
       count(DISTINCT uid_neg) AS duid_neg     -- duid of negand match rows
FROM
    (SELECT client_id AS uid_wo, 
            count(*) AS cnt_wo,
            max(uid_w) AS uid_with,
            CASE WHEN count(col_w) = 0 THEN NULL
                 ELSE count(col_w)
            END AS cnt_with,
            sum(corrections) AS correction,
            max(uid_neg) AS uid_neg
     FROM
       (SELECT client_id, trans_date,
        CASE WHEN trans_date <> '930118' THEN client_id
             ELSE NULL
        END AS uid_w,       -- to count UIDs with negand
        CASE WHEN trans_date <> '930118' THEN trans_date
             ELSE NULL
        END AS col_w,       -- to get count(*) with negand
        CASE WHEN trans_date <> '930118' THEN NULL
             ELSE client_id
        END AS uid_neg,         -- count removed UIDs
        CASE WHEN trans_date IS NULL THEN 1
             ELSE 0
        END AS corrections
        FROM transactions) t
     GROUP BY 1) t
true_with duid_with max min avg stddev
1262622 5369 675 9 235.1689 127.6760
true_wo duid_wo max min avg stddev duid_neg
1262625 5369 675 9 235.1694 127.6767 3

Sum of a column with negand

Analyst query:

SELECT sum(amount)
FROM transactions
WHERE trans_date <> '930118'

(True answer 7415632162.8918, true answer without negand, 7415643454.89165, difference of 11291.99985.)

SELECT sum(sum_with) AS true_with,         -- the true answer with negand
       count(DISTINCT uid_with) AS duid_with,  -- duid with negand
       max(sum_with), min(sum_with), avg(sum_with), stddev(sum_with),
       sum(sum_wo) AS true_wo,          -- the true answer without negand
       count(DISTINCT uid_wo) AS duid_wo,      -- duid wo negand
       max(sum_wo), min(sum_wo), avg(sum_wo), stddev(sum_wo),
       count(DISTINCT uid_neg) AS duid_neg     -- duid of negand match rows
FROM
    (SELECT client_id AS uid_wo, 
            sum(amount) AS sum_wo,
            max(uid_w) AS uid_with,
            sum(col_w) AS sum_with,
            max(uid_neg) AS uid_neg
     FROM
       (SELECT client_id, amount,
        CASE WHEN trans_date <> '930118' THEN client_id
             ELSE NULL
        END AS uid_w,       -- to count UIDs with negand
        CASE WHEN trans_date <> '930118' THEN amount
             ELSE NULL
        END AS col_w,       -- to get count(*) with negand
        CASE WHEN trans_date <> '930118' THEN NULL
             ELSE client_id
        END AS uid_neg          -- count removed UIDs
        FROM transactions) t
     GROUP BY 1) t
true_with duid_with max min avg stddev
7415632162.90 5369 7619102.4 29400 1381194.293 1335574.331
true_wo duid_wo max min avg stddev duid_neg
7415643454.90 5369 7619102.4 29400 1381196.396 1335580.445 3

Standard Deviation of a column with negand

Analyst query:

SELECT stddev(amount)
FROM transactions
WHERE cli_district_id <> 68

The no-uid query is:

SELECT sqrt(sum(sum_with)/sum(cnt_with)) AS true_with,
       count(DISTINCT uid_with) AS duid_with,
       max(sqrt(sum_with/cnt_with)),
       min(sqrt(sum_with/cnt_with)),
       avg(sqrt(sum_wo/cnt_wo)),
       stddev(sqrt(sum_wo/cnt_wo)),
       sqrt(sum(sum_wo)/sum(cnt_wo)) AS true_wo,
       count(DISTINCT uid_wo) AS duid_wo,
       max(sqrt(sum_wo/cnt_wo)),
       min(sqrt(sum_wo/cnt_wo)),
       avg(sqrt(sum_wo/cnt_wo)),
       stddev(sqrt(sum_wo/cnt_wo)),
       count(DISTINCT uid_neg) AS duid_neg
FROM
  (SELECT client_id AS uid_wo, 
          sum(diff_wo) AS sum_wo,
          count(*) AS cnt_wo,
          max(uid_w) AS uid_with,
          sum(diff_with) AS sum_with,
          CASE WHEN count(diff_with) = 0 THEN NULL
               ELSE count(diff_with)
          END AS cnt_with,
          max(uid_neg) AS uid_neg
    FROM 
        (SELECT client_id,
            CASE WHEN cli_district_id <> 68 THEN client_id
                 ELSE NULL
            END AS uid_w,
            CASE WHEN cli_district_id <> 68 THEN 
                       pow(abs(amount - (SELECT avg(amount)
                                         FROM transactions
                                         WHERE cli_district_id <> 68)), 2)
                 ELSE NULL
            END AS diff_with,
            pow(abs(amount - (SELECT avg(amount)
                              FROM transactions)), 2) AS diff_wo,
            CASE WHEN cli_district_id <> 68 THEN NULL
                 ELSE client_id
            END AS uid_neg
         FROM transactions
        ) t
   GROUP BY 1) t
true_with duid_with max min avg stddev
9464.2763 5283 27475.2111 3611.3832 8036.1940 4500.4135
true_wo duid_wo max min avg stddev duid_neg
9470.0765 5369 27470.2953 3618.6551 8036.1940 4500.4135 86

Histogram of column sums with negand

Analyst query:

SELECT cli_district_id, sum(amount)
FROM transactions
WHERE trans_date <> '930118'
GROUP BY 1

Equivalent no-uid query:

SELECT cli_district_id,
       sum(sum_with) AS true_with,         -- the true answer with negand
       count(DISTINCT uid_with) AS duid_with,  -- duid with negand
       max(sum_with), min(sum_with), avg(sum_with), stddev(sum_with),
       sum(sum_wo) AS true_wo,          -- the true answer without negand
       count(DISTINCT uid_wo) AS duid_wo,      -- duid wo negand
       max(sum_wo), min(sum_wo), avg(sum_wo), stddev(sum_wo),
       count(DISTINCT uid_neg) AS duid_neg     -- duid of negand match rows
FROM
    (SELECT client_id AS uid_wo, cli_district_id,
            sum(amount) AS sum_wo,
            max(uid_w) AS uid_with,
            sum(col_w) AS sum_with,
            max(uid_neg) AS uid_neg
     FROM
       (SELECT client_id, amount, cli_district_id,
        CASE WHEN trans_date <> '930118' THEN client_id
             ELSE NULL
        END AS uid_w,       -- to count UIDs with negand
        CASE WHEN trans_date <> '930118' THEN amount
             ELSE NULL
        END AS col_w,       -- to get count(*) with negand
        CASE WHEN trans_date <> '930118' THEN NULL
             ELSE client_id
        END AS uid_neg          -- count removed UIDs
        FROM transactions) t
     GROUP BY 1,2) t
GROUP BY 1

Histogram of a column Standard Deviation with negand

Analyst query:

SELECT operation, stddev(amount)
FROM transactions
WHERE cli_district_id <> 68
GROUP BY 1

The no-uid query is:

CREATE TEMP TABLE foo_wo AS
  SELECT operation, avg(amount) AS avg_wo
  FROM transactions
  GROUP BY 1;

CREATE TEMP TABLE foo_with AS
  SELECT operation, avg(amount) AS avg_with
  FROM transactions
  WHERE cli_district_id <> 68
  GROUP BY 1;

SELECT operation,
sqrt(sum(sum_with)/sum(cnt_with)) AS true_with,
       count(DISTINCT uid_with) AS duid_with,
       max(sqrt(sum_with/cnt_with)),
       min(sqrt(sum_with/cnt_with)),
       avg(sqrt(sum_wo/cnt_wo)),
       stddev(sqrt(sum_wo/cnt_wo)),
       sqrt(sum(sum_wo)/sum(cnt_wo)) AS true_wo,
       count(DISTINCT uid_wo) AS duid_wo,
       max(sqrt(sum_wo/cnt_wo)),
       min(sqrt(sum_wo/cnt_wo)),
       avg(sqrt(sum_wo/cnt_wo)),
       stddev(sqrt(sum_wo/cnt_wo)),
       count(DISTINCT uid_neg) AS duid_neg
FROM
  (SELECT client_id AS uid_wo, operation,
          sum(diff_wo) AS sum_wo,
          count(*) AS cnt_wo,
          max(uid_w) AS uid_with,
          sum(diff_with) AS sum_with,
          CASE WHEN count(diff_with) = 0 THEN NULL
               ELSE count(diff_with)
          END AS cnt_with,
          max(uid_neg) AS uid_neg
    FROM 
        (SELECT client_id, operation,
            CASE WHEN cli_district_id <> 68 THEN client_id
                 ELSE NULL
            END AS uid_w,
            CASE WHEN cli_district_id <> 68 THEN 
                       pow(abs(amount - avg_with), 2)
                 ELSE NULL
            END AS diff_with,
            pow(abs(amount - avg_wo), 2) AS diff_wo,
            CASE WHEN cli_district_id <> 68 THEN NULL
                 ELSE client_id
            END AS uid_neg
         FROM 
            ( SELECT client_id, foo_wo.operation AS operation, amount,
                     cli_district_id, avg_wo, avg_with
              FROM
              foo_wo JOIN transactions
              ON foo_wo.operation = transactions.operation
              JOIN foo_with
              ON foo_wo.operation = foo_with.operation ) t
        ) t
   GROUP BY 1,2) t
GROUP BY 1

DROP TABLE foo_wo;
DROP TABLE foo_with;

Sum of column with two negands

Query:

SELECT sum(amount)
FROM transactions
WHERE cli_district_id <> 68 AND
      trans_date <> '930118'

No-uid Query:

SELECT count(DISTINCT uid_neg_tr) AS duid_neg_tr,
       count(DISTINCT uid_neg_cl) AS duid_neg_cl,
        -- w_no means with none (both negands dropped)
       sum(sum_w_no) AS true_w_no,
       count(DISTINCT uid_w_no) AS duid_w_no,
       max(sum_w_no), min(sum_w_no), avg(sum_w_no), stddev(sum_w_no),
        -- w_bo means with both (both negands present)
       sum(sum_w_bo) AS true_w_bo,
       count(DISTINCT uid_w_bo) AS duid_w_bo,
       max(sum_w_bo), min(sum_w_bo), avg(sum_w_bo), stddev(sum_w_bo),
        -- w_tr means with trans_date negand only
       sum(sum_w_tr) AS true_w_tr,
       count(DISTINCT uid_w_tr) AS duid_w_tr,
       max(sum_w_tr), min(sum_w_tr), avg(sum_w_tr), stddev(sum_w_tr),
        -- w_cl means with cli_district_id negand only
       sum(sum_w_cl) AS true_w_cl,
       count(DISTINCT uid_w_cl) AS duid_w_cl,
       max(sum_w_cl), min(sum_w_cl), avg(sum_w_cl), stddev(sum_w_cl)
FROM
    (SELECT client_id AS uid_w_no,
            sum(amount) AS sum_w_no,
            max(uid_w_tr) AS uid_w_tr,
            sum(col_w_tr) AS sum_w_tr,
            max(uid_w_cl) AS uid_w_cl,
            sum(col_w_cl) AS sum_w_cl,
            max(uid_w_bo) AS uid_w_bo,
            sum(col_w_bo) AS sum_w_bo,
            max(uid_neg_tr) AS uid_neg_tr,
            max(uid_neg_cl) AS uid_neg_cl
     FROM
       (SELECT client_id, amount,
        CASE WHEN trans_date <> '930118' THEN client_id
             ELSE NULL
        END AS uid_w_tr,
        CASE WHEN cli_district_id <> 68 THEN client_id
             ELSE NULL
        END AS uid_w_cl,
        CASE WHEN cli_district_id <> 68 AND 
                  trans_date <> '930118' THEN client_id
             ELSE NULL
        END AS uid_w_bo,

        CASE WHEN trans_date <> '930118' THEN amount
             ELSE NULL
        END AS col_w_tr,        -- to get count(*) with negand
        CASE WHEN cli_district_id <> 68 THEN amount
             ELSE NULL
        END AS col_w_cl,
        CASE WHEN cli_district_id <> 68 AND 
                  trans_date <> '930118' THEN amount
             ELSE NULL
        END AS col_w_bo,

        CASE WHEN cli_district_id <> 68 THEN NULL
             ELSE client_id
        END AS uid_neg_cl,
        CASE WHEN trans_date <> '930118' THEN NULL
             ELSE client_id
        END AS uid_neg_tr
        FROM transactions) t
     GROUP BY 1) t

Answer:

duid_neg_tr duid_neg_cl
3 86
true_w_no duid_w_no
7415643454.90002 5369
max min avg stddev
7619102.4 29400 1381196.39688955 1335580.44522145
true_w_bo duid_w_bo
7291824062.70001 5283
max min avg stddev
7619102.4 29400 1380243.05559342 1335465.71014778
true_w_tr duid_w_tr
7415632162.90002 5369
max min avg stddev
7619102.4 29400 1381194.2937046 1335574.33107147
true_w_cl duid_w_cl
7291835354.70001 5283
max min avg stddev
7619102.4 29400 1380245.19301533 1335471.92587446

Here I've pre-computed all possible cases (either, neither, or both negands dropped). With still more negands, the number of possible low-effect combinations grows combinatorically. So beyond two negands, I think we should just compute the case where all negands are dropped (which happens "naturally" without additional CASE statements), the expected case where no negands are dropped, and then check for each individual negand (but not combinations of negands). If it turns out that more than one negand is low effect, then we'll just have to query again.

Count total rows with posors

Query:

SELECT count(*)
FROM transactions
WHERE cli_district_id IN (58,59,60)

No-uid Query:

SELECT count(DISTINCT uid_in_58) AS duid_in_58,
       count(DISTINCT uid_in_59) AS duid_in_59,
       count(DISTINCT uid_in_60) AS duid_in_60,

       sum(cnt_w_all) AS true_w_all,
       count(DISTINCT uid_w_all) AS duid_w_all,
       max(cnt_w_all), min(cnt_w_all), avg(cnt_w_all), stddev(cnt_w_all),
       sum(cnt_not_58) AS true_not_58,
       count(DISTINCT uid_not_58) AS duid_not_58,
       max(cnt_not_58), min(cnt_not_58), avg(cnt_not_58), stddev(cnt_not_58),
       sum(cnt_not_59) AS true_not_59,
       count(DISTINCT uid_not_59) AS duid_not_59,
       max(cnt_not_59), min(cnt_not_59), avg(cnt_not_59), stddev(cnt_not_59),
       sum(cnt_not_60) AS true_not_60,
       count(DISTINCT uid_not_60) AS duid_not_60,
       max(cnt_not_60), min(cnt_not_60), avg(cnt_not_60), stddev(cnt_not_60)
FROM
    (SELECT client_id AS uid_w_all,
            count(*) AS cnt_w_all,
            max(uid_not_58) AS uid_not_58,
            NULLIF(count(uid_not_58),0) AS cnt_not_58,
            max(uid_not_59) AS uid_not_59,
            NULLIF(count(uid_not_59),0) AS cnt_not_59,
            max(uid_not_60) AS uid_not_60,
            NULLIF(count(uid_not_60),0) AS cnt_not_60,
            max(uid_in_58) AS uid_in_58,
            max(uid_in_59) AS uid_in_59,
            max(uid_in_60) AS uid_in_60
     FROM
       (SELECT client_id,
        CASE WHEN cli_district_id = 58 THEN NULL
             ELSE client_id
        END AS uid_not_58,
        CASE WHEN cli_district_id = 59 THEN NULL
             ELSE client_id
        END AS uid_not_59,
        CASE WHEN cli_district_id = 60 THEN NULL
             ELSE client_id
        END AS uid_not_60,

        CASE WHEN cli_district_id = 58 THEN client_id
             ELSE NULL
        END AS uid_in_58,
        CASE WHEN cli_district_id = 59 THEN client_id
             ELSE NULL
        END AS uid_in_59,
        CASE WHEN cli_district_id = 60 THEN client_id
             ELSE NULL
        END AS uid_in_60
        FROM transactions
        WHERE cli_district_id IN (58,59,60)) t
     GROUP BY 1) t

Answer:

duid_in_58 duid_in_59 duid_in_60
44 64 61
true_w_all duid_w_all
39051 169
max min avg stddev
580 43 231.0710059171597633 121.238920650630
true_not_58 duid_not_58
27999 125
max min avg stddev
534 43 223.9920000000000000 124.563204322014
true_not_59 duid_not_59
24887 105
max min avg stddev
580 46 237.0190476190476190 115.4759366335339208
true_not_60 duid_not_60
25216 108
max min avg stddev
580 43 233.4814814814814815 122.9476358015336940

Here again we do not compute all low-effect combinations, but only those where a single one of the IN components are low-effect. If more than one are low-effect, then we need to re-query with those components removed.

Note the need for NULLIF.

Sum of column with posors

Query:

SELECT sum(amount)
FROM transactions
WHERE cli_district_id IN (58,59,60)

No-uid Query:

SELECT count(DISTINCT uid_in_58) AS duid_in_58,
       count(DISTINCT uid_in_59) AS duid_in_59,
       count(DISTINCT uid_in_60) AS duid_in_60,

       sum(sum_w_all) AS true_w_all,
       count(DISTINCT uid_w_all) AS duid_w_all,
       max(sum_w_all), min(sum_w_all), avg(sum_w_all), stddev(sum_w_all),
       sum(sum_not_58) AS true_not_58,
       count(DISTINCT uid_not_58) AS duid_not_58,
       max(sum_not_58), min(sum_not_58), avg(sum_not_58), stddev(sum_not_58),
       sum(sum_not_59) AS true_not_59,
       count(DISTINCT uid_not_59) AS duid_not_59,
       max(sum_not_59), min(sum_not_59), avg(sum_not_59), stddev(sum_not_59),
       sum(sum_not_60) AS true_not_60,
       count(DISTINCT uid_not_60) AS duid_not_60,
       max(sum_not_60), min(sum_not_60), avg(sum_not_60), stddev(sum_not_60)
FROM
    (SELECT client_id AS uid_w_all,
            sum(amount) AS sum_w_all,
            max(uid_not_58) AS uid_not_58,
            sum(col_not_58) AS sum_not_58,
            max(uid_not_59) AS uid_not_59,
            sum(col_not_59) AS sum_not_59,
            max(uid_not_60) AS uid_not_60,
            sum(col_not_60) AS sum_not_60,
            max(uid_in_58) AS uid_in_58,
            max(uid_in_59) AS uid_in_59,
            max(uid_in_60) AS uid_in_60
     FROM
       (SELECT client_id, amount,
        CASE WHEN cli_district_id = 58 THEN NULL
             ELSE client_id
        END AS uid_not_58,
        CASE WHEN cli_district_id = 59 THEN NULL
             ELSE client_id
        END AS uid_not_59,
        CASE WHEN cli_district_id = 60 THEN NULL
             ELSE client_id
        END AS uid_not_60,

        CASE WHEN cli_district_id = 58 THEN NULL
             ELSE amount
        END AS col_not_58,
        CASE WHEN cli_district_id = 59 THEN NULL
             ELSE amount
        END AS col_not_59,
        CASE WHEN cli_district_id = 60 THEN NULL
             ELSE amount
        END AS col_not_60,

        CASE WHEN cli_district_id = 58 THEN client_id
             ELSE NULL
        END AS uid_in_58,
        CASE WHEN cli_district_id = 59 THEN client_id
             ELSE NULL
        END AS uid_in_59,
        CASE WHEN cli_district_id = 60 THEN client_id
             ELSE NULL
        END AS uid_in_60
        FROM transactions
        WHERE cli_district_id IN (58,59,60)) t
     GROUP BY 1) t

Answer:

duid_in_58 duid_in_59 duid_in_60
44 64 61
true_w_all duid_w_all
220728483.1 169
max min avg stddev
7333024.6 65874.1 1306085.69881657 1220263.70128945
true_not_58 duid_not_58
156400926.6 125
max min avg stddev
7333024.6 65874.1 1251207.4128 1256838.11208
true_not_59 duid_not_59
141104905.7 105
max min avg stddev
7333024.6 70196.3 1343856.2447619 1259614.14152724
true_not_60 duid_not_60
143951133.9 108
max min avg stddev
5552105.3 65874.1 1332880.86944444 1139068.00664129
obrok commented 6 years ago

As a cautionary note - there is no guarantee that this will in fact result in a speedup. We should carefully implement this on the side and compare if it's faster.

yoid2000 commented 6 years ago

Here are a few quick-and-dirty benchmarks though. These numbers are for the original analyst query run on a cloak (attack.aircloak.com) versus the equivalent no-uid query run on my laptop's postgres. The latter of course does not include the processing that would need to be done by the cloak, but this should be negligible cause it is crunching just a few numbers. Note also that the cloak queries don't include checking for low-effect while the laptop queries do. Cloak query times measured by a stop-watch. Laptop query times as reported by postgres.

Query table cloak my laptop
Sum of column with posors transactions 5 sec 430ms
Sum of column with two negands transactions 80 sec 480ms
Histogram of a column Standard Deviation with negand transactions 70 sec 4 sec
Count total number of rows accounts 500ms 70ms
obrok commented 6 years ago

The latter of course does not include the processing that would need to be done by the cloak, but this should be negligible cause it is crunching just a few numbers

I'll believe it when I see it running side-by-side :P

yoid2000 commented 6 years ago

I'll believe it when I see it running side-by-side :P

Should I take that as you just having volunteered?

PF

obrok commented 6 years ago

Should I take that as you just having volunteered?

Negative

cristianberneanu commented 6 years ago
  1. It seems to me that the security of this approach is lower. For example, buckets with the same number of distinct users will have the same noise. Do we want to allow this decrease in security? Will this technique be a full replacement for the current design or will it run in parallel with the current anonymization method (and be limited only to some queries)?

  2. Text values are hashed, then average of hashed values is taken. What meaning does this information have? It seems completely random to me. A good hash function would distribute the input uniformly over the output space, so the standard deviation in this case should be constant, right?

  3. Why does computing the stddev requires two passes? most databases have built-in functions for it.

  4. What happens if we need to compute multiple aggregates? If my understanding is correct, we need to split the processing into multiple steps, one for each type of aggregate, right?

  5. The true median can not be offloaded to the database.

yoid2000 commented 6 years ago

It seems to me that the security of this approach is lower. For example, buckets with the same number of distinct users will have the same noise. Do we want to allow this decrease in security?

I don't think that there is much of a decrease in security. The noise seed uses several pieces of information, for instance the table and column name, and the values filtered by the condition, and of course the salt, all hashed. Thus two buckets with the same number of distinct users will still have different noise because of all the other seed components.

I don't see an obvious way that an attacker can exploit the new approach.

Will this technique be a full replacement for the current design or will it run in parallel with the current anonymization method (and be limited only to some queries)?

I don't know. It seems it would be a substantial advantage if we didn't need to worry any more about huge dumps from DB to cloak. My main concern, actually, is that there are future functions that simply cannot be executed on the DB, and therefore require a full dump of all rows.

Text values are hashed, then average of hashed values is taken. What meaning does this information have? It seems completely random to me.

Fundamentally what I'm looking for is a function that 1) returns the same value with the same inputs, and 2) very likely returns a different value with different inputs. (Where the inputs are the distinct values allowed by positive conditions, or disallowed by negative conditions.) I don't care what value is returned, as long as it has these two properties. (Also of course it should be a function that the database performs.) Average seems like a good candidate for numbers, but is not available for text.

The purpose of the hash is only to convert a string into a number, where different strings are quite likely to produce different numbers, so that we can take an average.

A good hash function would distribute the input uniformly over the output space, so the standard deviation in this case should be constant, right?

You mean average, right? (I suggest using average, not stddev.) Still, the point holds: the expected average is constant. But in practice the actual average will vary, especially if based on many significant digits.

Why does computing the stddev requires two passes? most databases have built-in functions for it.

The reason I'm using two passes now is because I want to gather information about how much individual users contribute to the computed stddev. My current thinking is that to do this I want to compute the sum of the squared differences for each user. But to do that, I need first to compute the global average, so that is the first pass.

But keeping in mind that the purpose of all this is to figure out how much noise to add, maybe there is a better way. This business of using min/max/avg/stddev of individual user contributions is just an idea that needs to be validated and maybe simplified. For that I need to do a bunch of experimentation.

What happens if we need to compute multiple aggregates? If my understanding is correct, we need to split the processing into multiple steps, one for each type of aggregate, right?

I suspect that for count, sum, average, and stddev, we can combine it all into one query. But for min, max, and median I don't think so (well, barring using UNION to do it in some brute force sort of way). But maybe there is a way that I haven't thought of.

The true median can not be offloaded to the database.

Why not?

cristianberneanu commented 6 years ago

I don't think that there is much of a decrease in security. The noise seed uses several pieces of information, for instance the table and column name, and the values filtered by the condition, and of course the salt, all hashed. Thus two buckets with the same number of distinct users will still have different noise because of all the other seed components. I don't see an obvious way that an attacker can exploit the new approach.

I am thinking that it is now easier to guess the input for the noise layers. The table, column name and the condition are static and could be easily guessed. The salt is also static. The number of distinct users in a set could also be easier guessed than the hash of all users in the set. Since the algorithm is deterministic, it seems to me that a reduction in the input complexity leads to a reduction in the possible outputs for the seed.

I don't know. It seems it would be a substantial advantage if we didn't need to worry any more about huge dumps from DB to cloak.

It would be a pretty big improvement if the average number of rows per user is large. I suspect that this holds more true for large data sets.

You mean average, right? (I suggest using average, not stddev.) Still, the point holds: the expected average is constant. But in practice the actual average will vary, especially if based on many significant digits.

I think it holds for all the computed statistics. From what I see, you are using min, max, avg and stddev to compute the noise. If the output of the hash is in the range [0, max_value], the average will tend towards max_value/2, the stddev will tend towards max_value/4, min will tend towards 0 and max will tend towards max_value.

Fundamentally what I'm looking for is a function that 1) returns the same value with the same inputs, and 2) very likely returns a different value with different inputs. (Where the inputs are the distinct values allowed by positive conditions, or disallowed by negative conditions.) I don't care what value is returned, as long as it has these two properties. (Also of course it should be a function that the database performs.) Average seems like a good candidate for numbers, but is not available for text.

It seems to me that the avg and stddev functions are not oblivious to duplicates present in the input, but instead will be affected by them. An alternative would be to use xor the hashed values together (as we do know, but do it in the database). I am not sure if xor is an available aggregate in many cases (it seems bit and / bit or are, but not bit xor).

If we could control the data store, we could define our own aggregates, but that doesn't seem feasible,

If the values are made distinct by some other means, adding the hashes together would also work.

The reason I'm using two passes now is because I want to gather information about how much individual users contribute to the computed stddev. My current thinking is that to do this I want to compute the sum of the squared differences for each user. But to do that, I need first to compute the global average, so that is the first pass.

I still don't understand. Why isn't the builtin stddev function able to do that as well?

Also, as opposed to the median, the standard deviation can be computed in a single pass. If one knows the sum, sum of squares and count, then:

mean = sum / count
variance = sum_squares / count - mean ^ 2
stddev = sqrt(variance)

or stddev = sqrt(sum_squares / count - sum ^ 2 / count ^ 2)

The true median can not be offloaded to the database.

Why not?

Most databases don't provide a builtin function for median as it requires that all values are present in memory during processing. The alternatives such as percentile cont/disc are not deterministic (input-order matters).

cristianberneanu commented 6 years ago

An alternative would be to use xor the hashed values together (as we do know, but do it in the database)

Actually, this is not true and has the same issue as the others. We need to make the values distinct before aggregating them, no mater which is the aggregation method used.

yoid2000 commented 6 years ago

I don't think that there is much of a decrease in security. The noise seed uses several pieces of information, for instance the table and column name, and the values filtered by the condition, and of course the salt, all hashed. Thus two buckets with the same number of distinct users will still have different noise because of all the other seed components. I don't see an obvious way that an attacker can exploit the new approach.

I am thinking that it is now easier to guess the input for the noise layers. The table, column name and the condition are static and could be easily guessed. The salt is also static. The number of distinct users in a set could also be easier guessed than the hash of all users in the set. Since the algorithm is deterministic, it seems to me that a reduction in the input complexity leads to a reduction in the possible outputs for the seed.

You are certainly right in that there is less input complexity, but given the salt, I'm not sure this much matters. Even if the analyst knows the names and the number of distinct users, the salt prevents the analyst from knowing how those things contribute to the noise. And certainly the whole thing depends on the analyst never learning the salt. If that happens, we are screwed regardless...

yoid2000 commented 6 years ago

If the values are made distinct by some other means, adding the hashes together would also work.

That's an interesting idea.

I'll try to think of a way to test whether average is giving us the properties we need here.

yoid2000 commented 6 years ago

Also, as opposed to the median, the standard deviation can be computed in a single pass. If one knows the sum, sum of squares and count, then:

mean = sum / count
variance = sum_squares / count - mean ^ 2
stddev = sqrt(variance)

Yes, but one doesn't know the sum and count. That's the point: the first pass is to compute the mean, and the second pass is to compute everything else.

Am I missing something???

yoid2000 commented 6 years ago

Most databases don't provide a builtin function for median as it requires that all values are present in memory during processing. The alternatives such as percentile cont/disc are not deterministic (input-order matters).

If most databases don't have such a function, why do we? ;)

obrok commented 6 years ago

I was browsing wiki on Variance and stumbled upon this:

A mnemonic for the above expression is "mean of square minus square of mean". This equation should not be used for computations using floating point arithmetic because it suffers from catastrophic cancellation if the two components of the equation are similar in magnitude. There exist numerically stable alternatives.

May be relevant, given what you're discussing.

cristianberneanu commented 6 years ago

Yes, but one doesn't know the sum and count. That's the point: the first pass is to compute the mean, and the second pass is to compute everything else.

The sum, count and sum of squares can all be computed in a single pass (holding 3 different accumulators). There is no need for two passes (which means going through the data twice).

If most databases don't have such a function, why do we? ;)

Because it seems useful? :)

yoid2000 commented 6 years ago

The sum, count and sum of square can all be computed in a single pass (holding 3 different accumulators). There is no need for two passes (which means going through the data twice).

Well, cool then. Can you show me an example SQL query to replace one of the ones in the issue?

yoid2000 commented 6 years ago

Well, I found this: https://www.strchr.com/standard_deviation_in_one_pass

But is says this:

"Unfortunately, the result will be inaccurate when the array contains large numbers (see the comments below)."

So .....

cristianberneanu commented 6 years ago

This equation should not be used for computations using floating point arithmetic because it suffers from catastrophic cancellation if the two components of the equation are similar in magnitude. There exist numerically stable alternatives.

This would be a reasonable concern normally. But since we are adding noise to the result anyway, maybe we shouldn't worry too much about it?

yoid2000 commented 6 years ago

This would be a reasonable concern normally. But since we are adding noise to the result anyway, maybe we shouldn't worry too much about it?

I wondered about that too. It could be that the inaccuracies more-or-less reflect the noise we would anyway add!

cristianberneanu commented 6 years ago

Well, cool then. Can you show me an example SQL query to replace one of the ones in the issue?

Query:

SELECT sqrt(sum(sm)/sum(cnt)) AS sd,
       count(DISTINCT client_id) AS duids,
       max(sqrt(sm/cnt)), min(sqrt(sm/cnt)),
       avg(sqrt(sm/cnt)), stddev(sqrt(sm/cnt))
FROM
   (SELECT client_id, sum(diff) AS sm, count(*) AS cnt
    FROM
      (SELECT client_id, pow(abs(amount - (SELECT avg(amount)
                                           FROM transactions)), 2) AS diff
       FROM transactions
      ) t1
    GROUP BY client_id
   ) t3

would become:

SELECT sqrt(sum(ss)*sum(c) - sum(s)^2)/sum(c) AS sd,
       count(DISTINCT client_id) AS duids,
       max(sd), min(sd), avg(sd), stddev(sd)
FROM
  (SELECT client_id, s, c, ss, sqrt(ss*c - s*s) / c as sd FROM
     (SELECT client_id, sum(amount) as s, sum(amount*amount) as ss, count(amount) AS c
       FROM transactions GROUP BY client_id) t
  ) t
cristianberneanu commented 6 years ago
yoid2000 commented 6 years ago

corrected the above post

Fails to run. Even without the FROM FROM I get this error:

ERROR: column "sd" does not exist LINE 1: SELECT sd, ^ HINT: Perhaps you meant to reference the column "t.s" or the column "t.ss". SQL state: 42703 Character: 8

cristianberneanu commented 6 years ago

Can you please try again?

If I understand things correctly, you want the global SD and then min, max, avg, stddev values of the per-user SD, right?

cristianberneanu commented 6 years ago

I measured the two queries on the banking data set and these are the results I got:

Two-pass query:

1076.9590672994160468 |  2500 | 5084.5108525469063452 | 5.2687018236584354 | 1237.7821228878516319 | 1183.56562761568154477421638421962745

Time: 1627.668 ms

Single pass query:

 1076.9590672994160470 |  2500 | 5074.24270211822 | 1.73205080756888 | 1237.06637298607 | 1182.79869763656

Time: 514.699 ms

The single pass query is almost 3 times as fast. There are discrepancies in the resulting min/max values especially.

yoid2000 commented 6 years ago

If I understand things correctly, you want the global SD and then min, max, avg, stddev values of the per-user SD, right?

I was not trying to get the per-user SD, but rather the amount each user contributes to the global SD. The difference is this. In computing per-user SD, I would use the average for that user. To compute what each user contributes, I use the global average in computing each user's value.

Is this difference important? I'm not sure at this point. But that could anyway explain why min and max are off.

cristianberneanu commented 6 years ago

Yes, that might explain the difference. I didn't noticed this detail.

PS: This expression: pow(abs(amount - (SELECT avg(amount) FROM transactions)), 2) can be simplified to: (amount - (SELECT avg(amount) FROM transactions)) ^ 2.

yoid2000 commented 6 years ago

By the way, one other thought regarding weakening the security. If you think about the seeding of noise in the context of message encryption, we can imagine that the seed components of table.column name, floated column values, and number of unique UIDs are the contents of a message, and that the salt is the encryption key. Assuming that the encryption key is large and random, and the the hash algorithm is cryptographically strong, then knowledge of the message contents won't help an attacker learn the key, and knowledge that the message contents can only be one of a small number of values doesn't help the attacker know which of that small number the values are.

If this way of viewing the situation is correct, then our security isn't much weakened by the change....

sebastian commented 5 years ago

We still need to implement stddev, median, and count(distinct column). Those are however moved out of the current milestone.

cristianberneanu commented 5 years ago

stddev is probably easy to do, count(distinct) & friends is somewhat harder as it required multiple trips to the database. median has no proposed solution until now.

sebastian commented 5 years ago

@yoid2000 reported offline that he has a means of doing count(distinct) without separate queries now. Still leaves median.

Let's postpone all of this to release 19.2 in any case.