Open yoid2000 opened 6 years ago
As a cautionary note - there is no guarantee that this will in fact result in a speedup. We should carefully implement this on the side and compare if it's faster.
Here are a few quick-and-dirty benchmarks though. These numbers are for the original analyst query run on a cloak (attack.aircloak.com) versus the equivalent no-uid query run on my laptop's postgres. The latter of course does not include the processing that would need to be done by the cloak, but this should be negligible cause it is crunching just a few numbers. Note also that the cloak queries don't include checking for low-effect while the laptop queries do. Cloak query times measured by a stop-watch. Laptop query times as reported by postgres.
Query | table | cloak | my laptop |
---|---|---|---|
Sum of column with posors | transactions | 5 sec | 430ms |
Sum of column with two negands | transactions | 80 sec | 480ms |
Histogram of a column Standard Deviation with negand | transactions | 70 sec | 4 sec |
Count total number of rows | accounts | 500ms | 70ms |
The latter of course does not include the processing that would need to be done by the cloak, but this should be negligible cause it is crunching just a few numbers
I'll believe it when I see it running side-by-side :P
I'll believe it when I see it running side-by-side :P
Should I take that as you just having volunteered?
PF
Should I take that as you just having volunteered?
Negative
It seems to me that the security of this approach is lower. For example, buckets with the same number of distinct users will have the same noise. Do we want to allow this decrease in security? Will this technique be a full replacement for the current design or will it run in parallel with the current anonymization method (and be limited only to some queries)?
Text values are hashed, then average of hashed values is taken. What meaning does this information have? It seems completely random to me. A good hash function would distribute the input uniformly over the output space, so the standard deviation in this case should be constant, right?
Why does computing the stddev
requires two passes? most databases have built-in functions for it.
What happens if we need to compute multiple aggregates? If my understanding is correct, we need to split the processing into multiple steps, one for each type of aggregate, right?
The true median can not be offloaded to the database.
It seems to me that the security of this approach is lower. For example, buckets with the same number of distinct users will have the same noise. Do we want to allow this decrease in security?
I don't think that there is much of a decrease in security. The noise seed uses several pieces of information, for instance the table and column name, and the values filtered by the condition, and of course the salt, all hashed. Thus two buckets with the same number of distinct users will still have different noise because of all the other seed components.
I don't see an obvious way that an attacker can exploit the new approach.
Will this technique be a full replacement for the current design or will it run in parallel with the current anonymization method (and be limited only to some queries)?
I don't know. It seems it would be a substantial advantage if we didn't need to worry any more about huge dumps from DB to cloak. My main concern, actually, is that there are future functions that simply cannot be executed on the DB, and therefore require a full dump of all rows.
Text values are hashed, then average of hashed values is taken. What meaning does this information have? It seems completely random to me.
Fundamentally what I'm looking for is a function that 1) returns the same value with the same inputs, and 2) very likely returns a different value with different inputs. (Where the inputs are the distinct values allowed by positive conditions, or disallowed by negative conditions.) I don't care what value is returned, as long as it has these two properties. (Also of course it should be a function that the database performs.) Average seems like a good candidate for numbers, but is not available for text.
The purpose of the hash is only to convert a string into a number, where different strings are quite likely to produce different numbers, so that we can take an average.
A good hash function would distribute the input uniformly over the output space, so the standard deviation in this case should be constant, right?
You mean average, right? (I suggest using average, not stddev.) Still, the point holds: the expected average is constant. But in practice the actual average will vary, especially if based on many significant digits.
Why does computing the stddev requires two passes? most databases have built-in functions for it.
The reason I'm using two passes now is because I want to gather information about how much individual users contribute to the computed stddev. My current thinking is that to do this I want to compute the sum of the squared differences for each user. But to do that, I need first to compute the global average, so that is the first pass.
But keeping in mind that the purpose of all this is to figure out how much noise to add, maybe there is a better way. This business of using min/max/avg/stddev of individual user contributions is just an idea that needs to be validated and maybe simplified. For that I need to do a bunch of experimentation.
What happens if we need to compute multiple aggregates? If my understanding is correct, we need to split the processing into multiple steps, one for each type of aggregate, right?
I suspect that for count, sum, average, and stddev, we can combine it all into one query. But for min, max, and median I don't think so (well, barring using UNION to do it in some brute force sort of way). But maybe there is a way that I haven't thought of.
The true median can not be offloaded to the database.
Why not?
I don't think that there is much of a decrease in security. The noise seed uses several pieces of information, for instance the table and column name, and the values filtered by the condition, and of course the salt, all hashed. Thus two buckets with the same number of distinct users will still have different noise because of all the other seed components. I don't see an obvious way that an attacker can exploit the new approach.
I am thinking that it is now easier to guess the input for the noise layers. The table, column name and the condition are static and could be easily guessed. The salt is also static. The number of distinct users in a set could also be easier guessed than the hash of all users in the set. Since the algorithm is deterministic, it seems to me that a reduction in the input complexity leads to a reduction in the possible outputs for the seed.
I don't know. It seems it would be a substantial advantage if we didn't need to worry any more about huge dumps from DB to cloak.
It would be a pretty big improvement if the average number of rows per user is large. I suspect that this holds more true for large data sets.
You mean average, right? (I suggest using average, not stddev.) Still, the point holds: the expected average is constant. But in practice the actual average will vary, especially if based on many significant digits.
I think it holds for all the computed statistics. From what I see, you are using min, max, avg and stddev
to compute the noise. If the output of the hash is in the range [0, max_value]
, the average will tend towards max_value/2
, the stddev will tend towards max_value/4
, min
will tend towards 0
and max
will tend towards max_value
.
Fundamentally what I'm looking for is a function that 1) returns the same value with the same inputs, and 2) very likely returns a different value with different inputs. (Where the inputs are the distinct values allowed by positive conditions, or disallowed by negative conditions.) I don't care what value is returned, as long as it has these two properties. (Also of course it should be a function that the database performs.) Average seems like a good candidate for numbers, but is not available for text.
It seems to me that the avg
and stddev
functions are not oblivious to duplicates present in the input, but instead will be affected by them. An alternative would be to use xor the hashed values together (as we do know, but do it in the database). I am not sure if xor
is an available aggregate in many cases (it seems bit and
/ bit or
are, but not bit xor
).
If we could control the data store, we could define our own aggregates, but that doesn't seem feasible,
If the values are made distinct by some other means, adding the hashes together would also work.
The reason I'm using two passes now is because I want to gather information about how much individual users contribute to the computed stddev. My current thinking is that to do this I want to compute the sum of the squared differences for each user. But to do that, I need first to compute the global average, so that is the first pass.
I still don't understand. Why isn't the builtin stddev
function able to do that as well?
Also, as opposed to the median, the standard deviation can be computed in a single pass. If one knows the sum, sum of squares and count, then:
mean = sum / count
variance = sum_squares / count - mean ^ 2
stddev = sqrt(variance)
or
stddev = sqrt(sum_squares / count - sum ^ 2 / count ^ 2)
The true median can not be offloaded to the database.
Why not?
Most databases don't provide a builtin function for median
as it requires that all values are present in memory during processing. The alternatives such as percentile cont/disc
are not deterministic (input-order matters).
An alternative would be to use xor the hashed values together (as we do know, but do it in the database)
Actually, this is not true and has the same issue as the others. We need to make the values distinct before aggregating them, no mater which is the aggregation method used.
I don't think that there is much of a decrease in security. The noise seed uses several pieces of information, for instance the table and column name, and the values filtered by the condition, and of course the salt, all hashed. Thus two buckets with the same number of distinct users will still have different noise because of all the other seed components. I don't see an obvious way that an attacker can exploit the new approach.
I am thinking that it is now easier to guess the input for the noise layers. The table, column name and the condition are static and could be easily guessed. The salt is also static. The number of distinct users in a set could also be easier guessed than the hash of all users in the set. Since the algorithm is deterministic, it seems to me that a reduction in the input complexity leads to a reduction in the possible outputs for the seed.
You are certainly right in that there is less input complexity, but given the salt, I'm not sure this much matters. Even if the analyst knows the names and the number of distinct users, the salt prevents the analyst from knowing how those things contribute to the noise. And certainly the whole thing depends on the analyst never learning the salt. If that happens, we are screwed regardless...
If the values are made distinct by some other means, adding the hashes together would also work.
That's an interesting idea.
I'll try to think of a way to test whether average is giving us the properties we need here.
Also, as opposed to the median, the standard deviation can be computed in a single pass. If one knows the sum, sum of squares and count, then:
mean = sum / count
variance = sum_squares / count - mean ^ 2
stddev = sqrt(variance)
Yes, but one doesn't know the sum and count. That's the point: the first pass is to compute the mean, and the second pass is to compute everything else.
Am I missing something???
Most databases don't provide a builtin function for median as it requires that all values are present in memory during processing. The alternatives such as percentile cont/disc are not deterministic (input-order matters).
If most databases don't have such a function, why do we? ;)
I was browsing wiki on Variance and stumbled upon this:
A mnemonic for the above expression is "mean of square minus square of mean". This equation should not be used for computations using floating point arithmetic because it suffers from catastrophic cancellation if the two components of the equation are similar in magnitude. There exist numerically stable alternatives.
May be relevant, given what you're discussing.
Yes, but one doesn't know the sum and count. That's the point: the first pass is to compute the mean, and the second pass is to compute everything else.
The sum, count and sum of squares can all be computed in a single pass (holding 3 different accumulators). There is no need for two passes (which means going through the data twice).
If most databases don't have such a function, why do we? ;)
Because it seems useful? :)
The sum, count and sum of square can all be computed in a single pass (holding 3 different accumulators). There is no need for two passes (which means going through the data twice).
Well, cool then. Can you show me an example SQL query to replace one of the ones in the issue?
Well, I found this: https://www.strchr.com/standard_deviation_in_one_pass
But is says this:
"Unfortunately, the result will be inaccurate when the array contains large numbers (see the comments below)."
So .....
This equation should not be used for computations using floating point arithmetic because it suffers from catastrophic cancellation if the two components of the equation are similar in magnitude. There exist numerically stable alternatives.
This would be a reasonable concern normally. But since we are adding noise to the result anyway, maybe we shouldn't worry too much about it?
This would be a reasonable concern normally. But since we are adding noise to the result anyway, maybe we shouldn't worry too much about it?
I wondered about that too. It could be that the inaccuracies more-or-less reflect the noise we would anyway add!
Well, cool then. Can you show me an example SQL query to replace one of the ones in the issue?
Query:
SELECT sqrt(sum(sm)/sum(cnt)) AS sd,
count(DISTINCT client_id) AS duids,
max(sqrt(sm/cnt)), min(sqrt(sm/cnt)),
avg(sqrt(sm/cnt)), stddev(sqrt(sm/cnt))
FROM
(SELECT client_id, sum(diff) AS sm, count(*) AS cnt
FROM
(SELECT client_id, pow(abs(amount - (SELECT avg(amount)
FROM transactions)), 2) AS diff
FROM transactions
) t1
GROUP BY client_id
) t3
would become:
SELECT sqrt(sum(ss)*sum(c) - sum(s)^2)/sum(c) AS sd,
count(DISTINCT client_id) AS duids,
max(sd), min(sd), avg(sd), stddev(sd)
FROM
(SELECT client_id, s, c, ss, sqrt(ss*c - s*s) / c as sd FROM
(SELECT client_id, sum(amount) as s, sum(amount*amount) as ss, count(amount) AS c
FROM transactions GROUP BY client_id) t
) t
corrected the above post
Fails to run. Even without the FROM FROM
I get this error:
ERROR: column "sd" does not exist LINE 1: SELECT sd, ^ HINT: Perhaps you meant to reference the column "t.s" or the column "t.ss". SQL state: 42703 Character: 8
Can you please try again?
If I understand things correctly, you want the global SD and then min, max, avg, stddev
values of the per-user SD, right?
I measured the two queries on the banking data set and these are the results I got:
Two-pass query:
1076.9590672994160468 | 2500 | 5084.5108525469063452 | 5.2687018236584354 | 1237.7821228878516319 | 1183.56562761568154477421638421962745
Time: 1627.668 ms
Single pass query:
1076.9590672994160470 | 2500 | 5074.24270211822 | 1.73205080756888 | 1237.06637298607 | 1182.79869763656
Time: 514.699 ms
The single pass query is almost 3 times as fast. There are discrepancies in the resulting min/max values especially.
If I understand things correctly, you want the global SD and then min, max, avg, stddev values of the per-user SD, right?
I was not trying to get the per-user SD, but rather the amount each user contributes to the global SD. The difference is this. In computing per-user SD, I would use the average for that user. To compute what each user contributes, I use the global average in computing each user's value.
Is this difference important? I'm not sure at this point. But that could anyway explain why min and max are off.
Yes, that might explain the difference. I didn't noticed this detail.
PS: This expression: pow(abs(amount - (SELECT avg(amount) FROM transactions)), 2)
can be simplified to: (amount - (SELECT avg(amount) FROM transactions)) ^ 2
.
By the way, one other thought regarding weakening the security. If you think about the seeding of noise in the context of message encryption, we can imagine that the seed components of table.column name, floated column values, and number of unique UIDs are the contents of a message, and that the salt is the encryption key. Assuming that the encryption key is large and random, and the the hash algorithm is cryptographically strong, then knowledge of the message contents won't help an attacker learn the key, and knowledge that the message contents can only be one of a small number of values doesn't help the attacker know which of that small number the values are.
If this way of viewing the situation is correct, then our security isn't much weakened by the change....
We still need to implement stddev
, median
, and count(distinct column)
.
Those are however moved out of the current milestone.
stddev
is probably easy to do, count(distinct)
& friends is somewhat harder as it required multiple trips to the database. median
has no proposed solution until now.
@yoid2000 reported offline that he has a means of doing count(distinct)
without separate queries now. Still leaves median
.
Let's postpone all of this to release 19.2
in any case.
This issue discusses the possibility of removing the need to float UIDs (the no-uid approach). Being able to do so could allow the cloak to request aggregates from the backend DB rather than request every row.
The enabling concept is that we change the way we create so-called UID layers (which after this I'll call dynamic layers). Instead of seeding (in part) based on the values of the UIDs in the bucket, we seed on the number of DISTINCT UIDs that comprise the bucket. As a result, we no longer need to see the individual UIDs. Rather we can get per-bucket aggregates.
A second use of UIDs is in determiming how much noise to add to an answer, for instance because of flattening etc. The intent here is to use other hints about the effect of extreme users to approximate the way we flatten and add noise today.
The cost of all this is increased complexity of the SQL we produce, and in increased computation in the DB. The latter, however, is a good tradeoff as long as the cost of doing it in the DB is less than the cost of transmitting the data to the cloak and doing the work there.
The mechanisms in here aren't fully worked out, but so far I don't see any major issues.
As you go through this, please try to think of better ways to accomplish the basic goals. This is my first whack at it, but the first whack is almost never the best, so let's try to find a good way before committing too much code.
Basic Mechanisms
Obtaining counts of distinct UIDs
As I said, instead of obtaining the UIDs themselves, we obtain counts of distinct UIDs per bucket. This means that in the outer
SELECT
there is always something like:Obtaining statistics about user contributions (composite functions)
Because we can't use individual rows to decide how much noise to add or how much flattening to do, we need to gather certain statistics about how much users contribute to the answer. In the examples I gather
max
,min
,avg
, andstddev
. What this means in practice is that the first (topmost) inner select is grouped by UID:Floating columns for the purpose of seeding
In the current approach, when we float columns we gather all column values, and use all of the distinct values combined as a seed component. With the no-uid approach, we can no longer do that. Nevertheless, we still want to produce a seed component based on the combined distinct values. In the examples I use an average of the distinct values to do this, on the basis of the average being influenced by all such values.
This means that, where floating is necessary, we'll see in the outer
SELECT
something like:In the case of text, my examples do the following (which work for postgres), though this is just one of I'm sure many ways to do this.
Multi-query jobs
It appears unavoidable to have queries that require multiple queries at the database (i.e. for SQL, multiple semi-colons if not multiple trips to the DB).
One such case is standard deviation, which requires two passes through the data: one to compute the global average, and then another to compute the actual standard deviation. I'm sure we'll see more of this as we get into more complex statistical or machine learning functions.
Another is the mixing of composite and single-point aggregation functions in a single analyst query. Composite aggregation functions are those that use all values in the computation, and include count, sum, average, and standard deviation. Single-point aggregation functions are those that produce a single value, like min, max, and median. These are treated quite differently, and so such queries need to be broken up into multiple queries.
Examples
Following are some examples. They are designed to work on the banking database. In all cases (except where mentioned), the correctness of the examples has been checked against the expected answers from the database.
Count total number of rows
Almost the simplest possible query is this:
Today the cloak would modify this query to be something like this:
and then compute the counts itself, using client_id to seed the dynamic noise layer and to flatten (and in principle to do low count filtering, though that wouldn't trigger in this case).
With the new no-uid approach, the cloak could do something like this:
duids
(count of DISTINCT UIDs) is used as part of the seed computation, in place of the hash of actual UIDs today.sum(cnt)
produces the same values ascount(*)
from the original query. We'll add noise tosum(cnt)
for each bucket.duids
is also used to low-count filter. The noisy threshold is computed from the various seed components, includingduids
.max(cnt)
,min(cnt)
,avg(cnt)
, andstddev(cnt)
are used to decide how much noise to add. (Exact details to be determined if we decide to go ahead with this approach.) I'm not sure we need all four.max(cnt)
andmin(cnt)
give us the most extreme values in the bucket. We can see how extreme it is relative to the average (avg(cnt)
). We can also compute what it contributes to the standard deviation (stddev(cnt)
). In general, the more extreme, the more flattening and the more noise. I think this will turn out to be a reasonable approach, but more work needed obviously.Note as a side point that, for the banking table, this reduces the number of rows returned from the DB from 5639 to 1. In this case, the query times at the DB are about the same.
The answer to the above query is:
The max, avg, and stddev show not only that there are no extreme values, but that every user contributes the same amount. Both the fact that
max == avg
andstddev = 0
tell us this. So we can add the minimum noise here (SD=1 per layer).If however we do the same query on the transactions table:
we get this answer:
Here we see that some UIDs are contributing more than others. So the question is, how much flattening should we do, and how much noise should we add. For the first question, we would like to know how extreme the max is compared to the subsequent few users.
As a thought experiment, one way to get this avg and stddev would be for 1/2 of the values to be 107, and the other half to be 363 (other than our max and min). Were this the case, then indeed our max would be extreme compared to the next highest values. But one can suppose this to be very unlikely. Even in this worst case, we only need to flatten the extreme value by 312 to effectively hide it among the others. (And presumably we could flatten the min by adding 98.) In this worst case, we'd want the baseline noise to be 363. The cloak currently reports noise of 340 to this query.
Note finally that we can probably use the distribution of values in the shadow as a hint, for instance to increase our confidence that we don't have some corner case.
Count total number of DISTINCT UIDs in the table
Analyst query:
With the new no-uid approach, the cloak could do something like this:
The answer shows that only minimum noise is needed:
Count DISTINCT values for a given column
Analyst query:
I'll write this as two queries. Someone with more SQL skills than me can perhaps figure out how to do this in one query efficiently.
The two queries are these:
Though note that if the table has one row per user, then the first query is not necessary cause we'll know the answer. That is the case above.
Suppose the analyst wants to do this on transactions, for instance:
The no-uid queries are:
The answers to the two queries are:
Here we see that some additional noise and maybe flattening is needed, but again not too much for the same arguments as I made above.
Sum of a column
Analyst query:
Equivalent no-uid query:
Average of a column
Analyst query:
Equivalent no-uid query:
Note here that for the purpose of deciding the amount of noise, we would divide the values here (
max
,min
, andavg
) byduids
. Somax=3.92 and
avg=1.06`.Standard Deviation of a column
Analyst query:
The no-uid query is:
Note that the average needed for the standard deviation computation is computed as a SELECT embedded in the
abs
function. My postgres is smart enough to store the computed value and re-use, but not every implementation might.There is a small difference in the answers of the analyst and no-uid queries here. The analyst query returns 9470.0803, while the no-uid query returns 9470.0766. So not sure I'm really doing the computation right here.
The answer is this:
Note that the amount of noise added here would be proportional to the max or avg divided by the number of distinct users (so max 5.1 and avg 1.5).
Max of a column
Analyst query:
I think the easiest way to deal with this is to just get the top 100 or even 500 rows from the table, and more-or-less do the same kind of computation we've been doing:
The reason for getting so many is to minimize the probability that everything we gather belongs to just one or two distinct users. Also, if the total number of rows is small, then with this LIMIT we could obtain all rows and then avoid the problem of
max
being less thanmin
.Min of a column
Same as
max
, but in reverse.Median of a column
Analyst query:
Idealy we'd like to sample a set of values above and below the true median, and again do more-or-less what we do today. This seems to me to require two queries (or a UNION or JOIN of essentially two queries).
Possibly we could consider just sampling above or below the true median, and assuming that the distribution of value above is pretty close to what it is below.
Average and Standard Deviation of a column
Analyst query:
The no-uid query is:
Here it is just a matter of including all of the needed values. Other than counting duids, we double the required computations.
Average and Median of a column
Analyst query:
This would just require two separate queries, one for avg and one for max. I don't see a way to avoid that.
Histogram row count per value
Analyst's query:
This is just like the simple counting all rows query, except that we additionally group by the column in both SELECTs.
The answer to the above query is this:
A histogram query can of course return a lot of rows, if for instance the values tend to be unique. In this case, one could prevent having to process all these rows with a pair of queries like this:
The first query will, presumably, produce some number of values that are LCF and therefore go into the
*
bucket. We need to think a little about how to determine the amount of noise for the*
bucket in this case, because strictly speaking we don't have the min/max/avg/sd for the UIDs that went into the*
bucket. Probably we can make an assumption that the statistics for the UIDs for the complete set more-or-less holds for those of the values that are LCF, and set noise accordingly.The second query tells us the count of the
*
bucket for those values that have only a single user. I think that these can just be added to the*
bucket computed from the first query, perhaps with a little more noise.Note that the shadow, if we have one, could be a hint that we can benefit by making two queries instead of one.
Histogram of row counts by user (number of transactions)
This query computes a histogram of how many users have how many transactions.
No-uid query:
Though note that since in any event this query groups by client_id, there is no reason to separately compute
duids
or max/min/avg/sd. In other words, thenum_trans
is always the same asmin
andmax
, and sd stddev is always 0.As with above, if we think there might be a long LCF tail, we could break it into two queries:
Histogram of averages
Analyst query:
No-uid query:
Histogram of Standard Deviations
Analyst query:
Note that with standard deviations, we can't simply add a
GROUP BY
for the histogram column like the examples above. This is because theavg
for each histogram value needs to be computed first. Maybe there is a way to do that with a single SQL query, but I don't know what it is (or even if it is a good idea if there is a way). I think if we are going to go with no-uid, then we need to be prepared to deal with compound queries (a string of;
separated queries all at once) or even multiple queries (query, reply, query, reply).A (composite) no-uid query that works is:
Count total number of rows with posand
Query:
No-uid modified:
Here we are floating the posand so that we can seed the corresponding noise layers. We take the average of the posand because otherwise we'd have to do a
GROUP BY
and that, in some cases, could explode into a lot of rows.For instance, take the following query:
If we were to modify it like this:
Then it would return almost one row per user, so virtually no savings.
So instead we can take the average of the lastnames (after conversion to int) and use that as a seed component for the
LIKE
posand:There is almost certainly a better way to produce some composite value out of strings. And the above is postgres specific.
If we implement #2459, its datetime equivalent, and some shadow-based approach for strings, then we can avoid most floating.
Count total number of DISTINCT UIDs in the table with posand
Analyst query:
With the new no-uid approach, the cloak could do something like this:
The posand could also be placed on the outer select, but I would assume generally more efficient to do it in the inner select.
Count DISTINCT values for a given column with posand
Analyst query:
I'll write this as two queries. Someone with more SQL skills than me can perhaps figure out how to do this in one query efficiently.
The two queries are these:
Standard Deviation of a column with posand
Analyst query:
The no-uid query is:
Note that the SELECT embedded in the
abs
function also needs the WHERE clause.The answers to the two above queries are not quite identical, so I wonder if something is wrong. The first query returns 8242.9778 while the second returns 8242.6026.
Here is the same thing with two posands.
Analyst query:
The no-uid query is:
Note that the SELECT embedded in the
abs
function also needs the WHERE clause.Here the difference in answer is still bigger: 844.0798 versus 843.1372. Hmmmm.
Histogram with HAVING in inner select
Analyst query (noting first that the cloak probably requires that the analyst put the uid in the inner select, which is not done here, and second that the analyst would use the
bucket
function, notfloor
):No-uid query:
Here I'm pulling up the min, max, and count of
trans_date
, the column in theHAVING
clause for the purpose of seeding theHAVING
noise layer. This reflects what we do today for aggregated inner selects.Histogram with WHERE in aggregated inner select
Analyst query:
No-uid query:
Count total number of rows with negand
Query:
The new plan is to drop low-effect negands. From the shadow, we can decide one of three things:
Here we discuss only the third case. In the third case, we need to determine from querying the database whether or not the negand is low effect.
In the current setup, where the cloak pulls every row from the DB, the cloak can drop the negand in its query to the DB, and inspect the returned rows to see how many would have been excluded by the negand. Then if the cloak decides that the negand is low effect, it can in principle act as though the negand were never there (i.e. leave the rows in the answer). It can also choose to keep the noise layers associated with the negand, or exclude them.
Here is one possible no-uid query:
The answer to the above query is:
The basic idea of this query is that we compute three things:
true_with
).true_wo
).duid_neg
).Depending on the value
duid_neg
, and whether or not it passes LCF, we select withertrue_with
ortrue_wo
as the result we report to the analyst (after noise). In this particular case,duid_neg=86
, so clearly would pass LCF and so we'd use thetrue_with
value when reporting.What I'm doing here is including the rows where
cli_district_id = 68
in the computation. This allows me to computetrue_wo
andduid_neg
. In order to computetrue_with
, what I'm doing is generating new columns where the value isNULL
ifcli_district_id = 68
(to reflect the fact that the row would have been excluded from the original query).Since in this case we are computing
count(*)
, which normally would count NULL rows, I need to correct for the case where the original column value is in factNULL
. Otherwise we wouldn't be counting those rows when we should. Thus thecorrections
column and subsequent handling. (Note: the correction mechanism isn't properly tested.)The reason for the
CASE
statement when computingcnt_with
is because otherwise themin
,avg
, andstddev
values are messed up. This is because without theCASE
,count(col)
returns 0 which then gets incorrectly incorporated into themin
,avg
, andstddev
computations.Another approach one could take with this is to tag columns where
cli_district_id = 68
, and then use the tag when computingtrue_with
andtrue_wo
:The answer is this:
This was in fact my original approach. It got me into trouble in cases where we have multiple negands. I mention it here just for completeness and to see if it gives you any other good ideas.
As another example, assume the analyst query is this:
The corresponding no-uid query is:
Sum of a column with negand
Analyst query:
(True answer 7415632162.8918, true answer without negand, 7415643454.89165, difference of 11291.99985.)
Standard Deviation of a column with negand
Analyst query:
The no-uid query is:
Histogram of column sums with negand
Analyst query:
Equivalent no-uid query:
Histogram of a column Standard Deviation with negand
Analyst query:
The no-uid query is:
Sum of column with two negands
Query:
No-uid Query:
Answer:
Here I've pre-computed all possible cases (either, neither, or both negands dropped). With still more negands, the number of possible low-effect combinations grows combinatorically. So beyond two negands, I think we should just compute the case where all negands are dropped (which happens "naturally" without additional
CASE
statements), the expected case where no negands are dropped, and then check for each individual negand (but not combinations of negands). If it turns out that more than one negand is low effect, then we'll just have to query again.Count total rows with posors
Query:
No-uid Query:
Answer:
Here again we do not compute all low-effect combinations, but only those where a single one of the
IN
components are low-effect. If more than one are low-effect, then we need to re-query with those components removed.Note the need for
NULLIF
.Sum of column with posors
Query:
No-uid Query:
Answer: