Closed yoid2000 closed 7 years ago
@sebastian @obrok @cristianberneanu @sasa1977
Gang, here is a new issue that is designed to prevent the need for doing probing with inverse negative clauses. The technique is generally referred to as "layered noise".
Layered noise can also be used to allow us to get rid of most of the math restrictions, and I'll present that in another issue. Layered noise will also remove the need to do checks for attacks with LIKE
, and in many cases remove the need for gathering extra data to check negative clauses. I'll discuss both of these in future issues as well.
I guess Sebastian asked one or more of you (Yapee at least) to have a look at this stuff. Whoever this is, please have a read and lets plan to talk once you have digested it. In particular, please read it very critically: assume that there are bugs and attacks that I haven't considered...
Overall this approach seems promising, because there is an intuitive high-level reason why it should cover all cases (we'll see about that, but I'm hopeful). Floating is similar to what we already do for SaD ranges so I like it from an implementor's perspective as well.
Some notes, questions:
When a column goes through an aggregate function (GROUP BY uid)
If a column goes through an aggregate function that implies a GROUP BY and we require any GROUP BY in a subquery to include uid, so in that case this specification is redundant. What about something like:
select avg(a)
from foo
group by b
do we float a
or min(a)
, max(a)
, count(a)
? Seems like a
?
If min
/max
are uncomputable then most likely only count
is. Maybe we can just float count
in that case.
Why does the column name need to be included in the seed?
Why does the *
need to be included in the seed?
You say
- The number of layers.
The total number of layers is included in the seed for each layer? Is that intentional?
Including the type of clause seems a bit underdefined. I guess we would need to include all clauses used? Or just the last one?
Related - At some point you mention that the seed should include "number of filters" - but that seems to expose us to an averaging attack because you can include a bunch of redundant filters on the same column to build many queries which only differ by noise.
Related - There seems to be at least some room for an averaging attack because the type of filter is included. For example you can usually form two queries one with GROUP BY
and another with =
that do the same thing, giving you 2 samples. More options may exist, especially if we need to include all types of filters used?
In general in this scheme it seems we need to be on the lookout for noise averaging attacks the most as opposed to difference attacks previously.
In Example 6:
The grants and gender columns need to be floated, obviously.
Is this a typo? The query does not mention anything about grants/gender.
What about something like:
select avg(a)
from foo
group by b
do we float a
or min(a)
, max(a)
, count(a)
? Seems like a
?
The filter here is acting on b
, so you float b
. (You only do min/max/count
when a GROUP BY uid
is encountered.)
(My example 7 is like this one.)
- If min/max are uncomputable then most likely only count is. Maybe we can just float count in that case.
When are min/max uncomputable?
- Why does the * need to be included in the seed?
This is to deal with attacks that exploit generation of a NULL
value, and use count(*)
versus count(col)
to include or exclude that NULL
value. I've added an example 10 to describe.
- Why does the column name need to be included in the seed?
A good question. One of my fears is that the analyst has some clever way of learning what the noise value is for certain column values. For instance, if he can learn that any column consisting of nothing but value 5
produces a noise of -1.63, then maybe this can be exploited in some attack. (I can't think of a way the analyst can learn this, nor is there an immediately obvious way the knowledge can be exploited if it is learned, but I'm nervous just the same.)
So, by adding the column name, I reduce the attack surface considerably. Maybe he can figure out that column level
and value 5
have noise -1.36, but this doesn't help him with column winpoints
and value 5
.
By the way, the purpose of adding the number of noise layers to the seed is in part for the same reason.
I've also considered doing more rounding of answers than we do now, so that it is harder to learn noise values, but not sure we need it so haven't written it down.
- You say
- The number of layers.
The total number of layers is included in the seed for each layer? Is that intentional?
Yes, as at least partly explained in the previous post. To expand on this a bit...what I want to do is to make the noise dependent on that the analyst types as much as possible, but without giving the analyst the ability to add substantial "chaff" --- essentially the ability to create a lot of noise samples and average it away. It is pretty hard for the analyst to play with the number of layers because each layer adds noise, so he can't just append an arbitrary number of "chaff" clauses or something. Also the simplest way to make a difference attack is to remove a clause, and that immediately changes all of the seed values.
- Including the type of clause seems a bit underdefined. I guess we would need to include all clauses used? Or just the last one?
This is yet another of these things I throw in to make noise harder to predict while limiting the added flexibility it gives the analyst in making noise samples. But it is probably the least important of such things.
If it is underspecified, then we should fix. My inclination would be to use the first filter that the column encounters as the clause type label.
When are min/max uncomputable?
It's you who brought this up:
Alert: There may be column types where min and max are not computable. I haven't given this any thought yet. If so, then we need to float some other aggregate.
I'm only saying that if you can't compute min
then you're unlikely to be able to compute avg
for example.
- Related - At some point you mention that the seed should include "number of filters" - but that seems to expose us to an averaging attack because you can include a bunch of redundant filters on the same column to build many queries which only differ by noise.
Ha ha, this is the thing I just said would be hard for the analyst to attack. I literally forgot exactly how the thing worked (I was saying "number of layers", but indeed I have the optimization where we don't have one layer per filter).
I think you are right, and probably we should give up that optimization and have one layer per filter. Then this attack goes away. I would think it relatively rare that the same column goes through a lot of filters.
- Related - There seems to be at least some room for an averaging attack because the type of filter is included. For example you can usually form two queries one with GROUP BY and another with = that do the same thing, giving you 2 samples. More options may exist, especially if we need to include all types of filters used?
You are right...it is stupid to have a label that can be trivially bypassed. Lets assume that we'll drop this part of the seed.
- In general in this scheme it seems we need to be on the lookout for noise averaging attacks the most as opposed to difference attacks previously.
I think the new attacks we need to watch for are:
uid
layer.In short, we may have expanded the attack surface considerably...
- In Example 6:
The grants and gender columns need to be floated, obviously.
Is this a typo? The query does not mention anything about grants/gender.
Yes, typo. I had changed the example without changing the explanation. Fixed now.
This looks to me more computationally expensive then condition probing. Increasing the amount of selected data will greatly reduce the performance of the query. Was the previous solution unfeasible or was this selected mainly for performance reasons?
The XOR of the hashes of the distinct values in the column.
Since this is done only for =
clauses, won't all the values be the same?
Since this is done only for = clauses, won't all the values be the same?
In case of GROUP BY
, =
- yes. In case of ranges, negative conditions - no.
@cristianberneanu
This looks to me more computationally expensive then condition probing. Increasing the amount of selected data will greatly reduce the performance of the query. Was the previous solution unfeasible or was this selected mainly for performance reasons?
Yes. I take it as a given that this approach is far cheaper computationally. With the previous approach, you had to remove AND clauses, which would often greatly increase the number of rows that the database needs to look at. Sometimes you could benefit from the LIMIT
, but by no means always.
By contrast, this approach increases the number of columns that need to be conveyed, but it never increases the number of rows that anything has to deal with, and in fact the database needs to look at all those columns anyway. So there is some extra work, but I don't think that much. I could be wrong...
The XOR of the hashes of the distinct values in the column.
Since this is done only for = clauses, won't all the values be the same?
In case of GROUP BY, =- yes. In case of ranges, negative conditions - no.
It is not really a problem if all of the values in the column are the same. It just means that the seed will be based on a single value (plus stuff like column name and number of layers) rather than multiple values...
I take it as a given that this approach is far cheaper computationally.
I think it will be more expensive. The cloak is a lot slower for one and the transfer of the data from the database to the cloak is expensive in itself.
It is not really a problem if all of the values in the column are the same.
It will be a performance problem if we select extra data for each row, even if that data is the same.
I take it as a given that this approach is far cheaper computationally.
I think it will be more expensive. The cloak is a lot slower for one and the transfer of the data from the database to the cloak is expensive in itself.
Intuitively I agree with what @yoid2000 is saying... Could we somehow get a better answer to this question without implementing both approaches?
There are many variables involved, so we won't be able to get a definitive answer in any case. If the columns involved are indexed and the number of filters is low, the probing should be faster. If the columns are not indexed and there are lots of filters, maybe floating the values will be faster. In between these, there results will be mixed.
Maybe I answered the original question too quickly. This work was initiated because of performance reasons, but it has two other benefits:
The attack is this. Suppose I want to learn the exact number of users with some attribute A
(or some combination of attributes A1 & A2 & A3
). Learning an exact number is dangerous, because then a lot can be deduced from an exact number. (If I know the number of all people in the CS dept, and the number of men, then I can deduce that there is for instance one woman.)
To learn the exact number, I form a bunch of queries of the form:
A & B
A & NOT B
A & C
A & NOT C
etc.
Since (A & B) + (A & NOT B) = A
, then as long as many of these answers are not low-count filtered, I can derive the exact value of A.
With noise layer, there would always be a fixed noise layer attached to A
(which might be something like dept = 'CS'
), so an exact answer could not be produced.
There are many variables involved, so we won't be able to get a definitive answer in any case. If the columns involved are indexed and the number of filters is low, the probing should be faster. If the columns are not indexed and there are lots of filters, maybe floating the values will be faster. In between these, there results will be mixed.
Yes, I agree. With the caveat that the number of columns that have to be floated is also based on the number of filters.
Perhaps what we'd find is that average performance for floating is worse, but worst case performance for probing could be much much worse. Also, in cases like Teambank where anyway the query has to be emulated, the performance for probing can be very bad.
Anyway, assuming that the query was allowed, then the noise layer in the first query would use salary* in the seed, and the second would use salary. This would result in different seeds, and therefore the analyst would not know if a difference in answer is due to the seed or the presence or absence of the victim.
That's a bit problematic. What about a query like:
SELECT count(*), count(thing) FROM
(SELECT z/(6003614.195712-z+d+y+m) AS thing FROM
(SELECT zipcode * 100 AS z,
year(bday) / 10000 AS y,
month(bday) / 1000000 AS m,
day(bday) as d
FROM user_info
WHERE salary = 100000
) t1
) t2
maybe the seed should be *
in case of only count(*)
at the top level, salary
in case of only count(salary)
and salary*
in this case? This gives another query with the same semantics though, so more room for averaging?
(Though if I think about it, the mere fact that bday and zipcode are used but never encounter a filter is by itself suspicious and could be used as the basis for rejecting the query!)
But it is used in an expression that ultimately ends up influencing the result (at least in the count(thing)
case) - I suspect deciding when a usage is "legitimate" in this way will be very difficult.
SELECT count(*), count(thing) FROM
Grrrrr. I don't want to obsess over this, but I don't want to ignore it either.
Probably we need to accept the fact that the analyst may be able to do a small amount of averaging. As long as it isn't too much, it is probably acceptable. Still, this gives the analyst one free sample for any column that has no NULL
values...
By the way, as an optimization if it is clear from the query what the condition is, then it is not necessary to float the column. Rather, the compiler can just compute what the value in the column would be if it were floated. For instance, in the following query:
SELECT count(*)
FROM table
WHERE col1 = 20
it is clear that if we were to float col1
, then the resulting values would all be 20.
@yoid2000
Any time a filter occurs, then we want to create a noise layer based on the values in the relevant column prior to the filter.
This is unclear to me. It seems to signify that the query:
SELECT age, count(*)
FROM players
WHERE age = 10
GROUP BY age
should have a noise layer based on all the values of age
(i.e. prior to the filter), but this contradicts all the later examples, and my understanding from reading the rest. Therefore, what is actually meant?
Detail: bucket
, extract_match
, and extract_matches
don't need extra floating. They are all only allowed in the top-level query, i.e. all the data required is already present. All the same they do need extra layers of noise to be added, based on the raw values going into the functions. This for example seems to be missing from example 9?
For instance, if he can learn that any column consisting of nothing but value 5 produces a noise of -1.63, then maybe this can be exploited in some attack. (I can't think of a way the analyst can learn this, nor is there an immediately obvious way the knowledge can be exploited if it is learned, but I'm nervous just the same.)
Isn't this something the analyst would learn by using a GROUP BY
clause? Like SELECT age, count(*) FROM patients GROUP BY age
? Then he will know for each value of age
that there was a noise component based only on the particular value? Or am I misunderstanding, and the seed for any of the noise values across all the GROUP BY
clauses are based on the complete set of distinct reportable age
-values? I.e. same seed is used across all rows (I don't think that to be the case...)?
The hash and the token should be cryptographically strong (SHA256).
There is a mention of a token
here which isn't further defined. Is this a remnant from some past iteration? The negative condition checks issue also makes use of some tokens based on the filter type?
Can the tokens go/be removed?
Any time a filter occurs, then we want to create a noise layer based on the values in the relevant column prior to the filter.
This is unclear to me. It seems to signify that the query:
SELECT age, count(*)
FROM players
WHERE age = 10
GROUP BY age
should have a noise layer based on all the values of
age
(i.e. prior to the filter), but this contradicts all the later examples, and my understanding from reading the rest. Therefore, what is actually meant?
The wording is clumsy. What I mean is what you deduced from the examples: that I want to float the column as it existed before the filter so that any changes to the column after the filter don't influence what we float. But the filter of course removes rows, which is what we want.
I cleaned up the text.
Detail: bucket, extract_match, and extract_matches don't need extra floating. They are all only allowed in the top-level query, i.e. all the data required is already present. All the same they do need extra layers of noise to be added, based on the raw values going into the functions. This for example seems to be missing from example 9?
I think example 9 is ok, cause we are floating income
, which is what is used in the bucket()
. Then the cloak has to itself execute the bucket()
function, and add the corresponding noise layer (which is mentioned in the text as income_class
).
For instance, if he can learn that any column consisting of nothing but value 5 produces a noise of -1.63, then maybe this can be exploited in some attack. (I can't think of a way the analyst can learn this, nor is there an immediately obvious way the knowledge can be exploited if it is learned, but I'm nervous just the same.)
Isn't this something the analyst would learn by using a
GROUP BY
clause? LikeSELECT age, count(*) FROM patients GROUP BY age
? Then he will know for each value ofage
that there was a noise component based only on the particular value? Or am I misunderstanding, and the seed for any of the noise values across all theGROUP BY
clauses are based on the complete set of distinct reportableage
-values? I.e. same seed is used across all rows (I don't think that to be the case...)?
You are right in that we don't use the same seed for all rows...every bucket has a different seed for the age
layer, because every bucket has a different value of age
. However, the analyst still can't learn any of the noise values.
If the analyst does SELECT age, count(*) FROM patients GROUP BY age
, then each bucket will have two layers, one for the GROUP BY
, and one from the uid
(the "classic" noise that we've always had). So even if the analyst knows the exact count for some reason, the age
layer is obscured by the presence of the uid
layer. (And furthermore, even if the analyst could learn the noise values for age
for this query, they would change in a query that had a different number of layers, so still not helpful).
There is a mention of a token here which isn't further defined. Is this a remnant from some past iteration? The negative condition checks issue also makes use of some tokens based on the filter type?
the tokens go/be removed?
Yes, I fixed the text.
I think example 9 is ok, cause we are floating income, which is what is used in the bucket(). Then the cloak has to itself execute the bucket() function, and add the corresponding noise layer (which is mentioned in the text as income_class).
income_class
is producing a noise layer due to being used in GROUP BY
. But we still need an additional layer based on the income
value going into the bucket
function!? Hence it should have 4 layers of noise, not 3?
income_class is producing a noise layer due to being used in GROUP BY. But we still need an additional layer based on the income value going into the bucket function!? Hence it should have 4 layers of noise, not 3?
But income_class
is a function of income, no?
Exactly, "it's a function of", but a function that does a many-to-one mapping from income
to income_class
.
Hence the income
noise layer of bucket(income by 5) as income_class => 0
would have a noise layer influenced by any combination of the numbers 0 through 5, whereas income_class
(because of the GROUP BY
) produces a noise layer only based on 0.
Hence the income noise layer of bucket(income by 5) as income_class => 0 would have a noise layer influenced by any combination of the numbers 0 through 5, whereas income_class (because of the GROUP BY) produces a noise layer only based on 0.
OK, but why is it needed?
I had another look. There should indeed be three noise layers derived from the three conditions (and a fourth noise layer from the uid). So @sebastian is right about that. betrag
goes through two conditionals, one with the range, and one with the GROUP BY
, so two of the layers should be based on betrag
. Since two of the noise layers are on the same column, there should be a counter associated with the seed for each.
One question is, should the betrag
layer derived from the range be based on all the rows (from the combined buckets), or just from the per-bucket rows. It could probably be either, but I don't see a problem with basing in on just the per-bucket rows. Done this way, the only difference between the seeds for the two betrag
layers is the counter, but this indeed produces different seeds.
Note that if there were an attack that involved dropping the range, then this would remove the noise layer and force the resulting answer to be different whether or not a victim was included. So this is the right behavior. It does give the analyst a free extra sample, but this is something we already decided we can live with as long as the analyst can't get very many such samples...
We currently also allow WHERE
-clauses of the type columnA = columnB
and columnA <> columnB
. The question then becomes which values are going to be used in the seed? The distinct set of values in the intersection of columns columnA
and columnB
?
You don't need to create a noise layer for col=col clauses. The analyst can't attack with those clauses so they don't need to be protected
Ah, excellent!
@obrok discovered an attack that exploits the fact that we use "number of layers" as part of the seed. The attack is to keep increasing the number of layers with "chaff" layers like age <> 1000
. Then using different chaff expressions so that the chaff noise is averaged away. With each increased number of layers, the analyst gets a different sample for the (non-chaff) layer he is trying to attack.
The reason I had number of layers as part of the seed in the first place was because I was concerned that an analyst might be able to reverse engineer the noise associated with a given condition, but in fact I don't have an explicit way of doing that...I was only concerned.
But it seems that we need to remove the number of layers from the seed.
@obrok if you are ok with this, I can modify the description accordingly?
@yoid2000 please do
done.....
But @obrok , as we discussed, please see if you can find attacks that let the analyst reverse engineer the noise value for a given condition.
@yoid2000 can the salt used for the extra layers and the UID layer be the same?
To clarify - I think the neatest way to code this is to actually treat the UID layer uniformly with the other layers, so it would be nice if the salt could be the same.
I can think of no reason why the salt can't be the same.
PF
On Fri, Apr 21, 2017 at 5:57 PM Paweł Obrok notifications@github.com wrote:
To clarify - I think the neatest way to code this is to actually treat the UID layer uniformly with the other layers, so it would be nice if the salt could be the same.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Aircloak/aircloak/issues/1276#issuecomment-296230433, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qT1xMfhuctpy9yNNVoBOaw3tPVYXks5ryNHFgaJpZM4MfQaF .
Overview
This issue describes some changes to the details of anonymization. The primary reason for the change is to eliminate the costs and complexity of probing. For the purpose of this issue, I assume that we have all of our current math restrictions. A separate issue will deal with relaxing math restrictions.
The main drawback of the new approach is that it can result in higher noise levels, though most queries won't see this, and it would be surprising to see a query that even doubles the noise. (And, over time if it is important we can mitigate this drawback at the cost of more complexity.)
At the "conceptual" level, the change can be described as follows. Much of the complexity in the cloak is there to deal with the "difference attack", whereby the analyst uses the mere fact that two answers are different to conclude something about the database. Currently, the cloak defends against this by modifying answers so that two queries that would otherwise produce different answers in fact produce the same answer. In the new approach at least so far as attacks that drop clauses, we do the opposite: we ensure that two queries that would otherwise produce the same answer may well in fact have different answers.
Note that this change does not remove the need for other existing mechanisms. For ranges, we still do shrink-and-drop (SaD). For
IN
clauses, we still check each element in the list, check to see if it is individually low-count, and drop the corresponding rows if it is. We also still need to do something to handleLIKE
clauses, either what is already proposed, or I believe we'll be able to add a noise layer instead.The basic idea is to generate an extra layer of noise per
AND
'd equality clause (=
only, this doesn't work for not equal<>
clauses). For the class of difference attacks that involves removing a clause, having this layer results in the noise changing when the clause is removed whether or not the removal changed the UID set. As a result, we don't need to make an explicit check (probe) as to whether the UID set would change or not when the clause is dropped."Adding additional layers" literally means generating multiple noise values and summing them together. Every answer has at least one noise layer. This is the same layer we have today, i.e. seeded by the uid list of the answer. On top of this, we have zero or more layers per clause. (Later we discuss how to generate these layers, but normally the standard deviation (SD) of the summed layers is equal to the SD of the single layer we currently create.)
Whenever the cloak produces a bucket, the contents of that bucket is influenced by some column or columns (with the exception of
select count(*) from table
).This can happen because of
WHERE
,HAVING
, orGROUP BY
(not includingGROUP BY
onuid
), because these are the things that filter rows into one or another bucket. Note that Aircloak queries have an implicitGROUP BY
when there is no aggregation function in the outermostSELECT
.Any time a filter occurs, then we want to create a noise layer based on the values in the relevant column immediately after the filter. This is done by floating the column of the filter (or, if the SQL is simple, by deducing what the value of the of all column values would be after the filter).
Broadly speaking, there are two ways we can try to do this. One is to use the values (constants) in the SQL itself. The other is to use the values in the column. I initially tried using the values in the SQL itself, but found this hard to do. The problem generally is that there are too many ways that the analyst can change the constants in SQL without changing the semantics of the SQL. This allows the analyst to generate a set of queries that effectively remove the noise of extra layers through averaging.
So instead we use the values in the columns themselves (which, after all, reflect the semantics of the query). To do this, we need to "float" those values to the top (cause them to appear in the
SELECT
call made by the cloak to the database so that the cloak may inspect the values). We then use those values to seed the noise layer. The following section describes how and when to float column values. The subsequent section describes how to generate the seed from the values.Floating Column Values (Or Not)
The general goal is to float the values in the column itself, and not the values as modified by math and constants in the SQL. There are three basic cases:
GROUP BY uid
).These are described in following sections.
Native Column
When a column does not go through an aggregation function and does not encounter any math before the filter, it is floated in its native form (not modified for instance by math), and used to generate a noise layer.
Note however, that if it is obvious what the value of the floated column would be, then it is not necessary to literally float the column.
Following is an example:
Here we want to make a noise layer based on the native values in the
salary
column. Therefore,salary
could by floated in the query made to the database (along withuid
as we currently do), something like this:However, in this case, it is obvious that if we float
salary
, then all values would be100000
, so the cloak may skip the floating and just use the expected value.We would also float
salary
if aGROUP BY
were used:In this case, each bucket would have a noise layer based on
salary
values used for that bucket. (In this case, salary is floated naturally by the query itself.)In the following example, salary is again floated:
As:
Native Column with Math
If the analyst adds some math to the column, we still want to float the native column. As an example:
Here again we want to float
salary
(and in this caseage
as well), and notsalary * 10
. The reason for this is that we don't want the analyst to manipulate the value that gets used to create the noise layer, since each different value would produce a different seed, and in this way the analyst could average away the noise. So the database query would be (something like):Following is an example with a range:
Here salary is floated and used in a single noise layer in spite of the fact that encounters two filters. The query from the database is as follows:
There are two extra noise layers here. Note that the floated salary is also needed for SaD.
Note that we never want to make a noise layer from something in the SQL that does not encounter a filter. So in the following:
We would not float or make a noise layer from
salary
, because it never encounters a filter of any kind. Not making a layer prevents an attacker from adding arbitrary SQL that doesn't effect the output per se, but causes a new seed and therefore allows the attacker to average away the noise layer.Native Column with Safe Discontinuous Functions (Discon)
For the purposes of this issue, we define safe discons the same as we currently do.
As with math, we want to float the native column.
The following query:
Would be floated like this:
and a noise layer formed from
float_date
.(Note that, strictly speaking, it might be safe in this case to generate the layer purely from the
month(date)
value, primarily because there is no math or constant here that would allow the analyst to manipulate values and therefore averate away the noise layer. However, to keep things uniform, it seems simpler to always form a tagged column for discons. We can revisit this later though.)The following query is the same, but with the addition of a filter preceding the discon:
Once again, we float
date
and use as a noise layer:In the following example there are two discons:
These are safe (no constants), so we float
firstname
andage
, and make two noise layers:The following example has
bucket
.Here we want to float
age
, as follows:Note that SaD still needs to be done on each bucket.
The following example has
extract_match
:Here we float
name
:Aggregate Function
When a column goes through an aggregate function (
GROUP BY uid
), then of course it is no longer possible to float the native values. Rather, only aggregated values can be floated. When this happens, we float three aggregate values,min
,max
, andcount
, and use these values to generate the seed.Note that a column can pass through two or even more aggregations. In the first aggregation, three values are generated (
min
,max
, andcount
). In principle we could at the next aggregation takemin
,max
, andcount
of the three previous values, resulting in 9 values. This is both expensive and unnecessary. Instead we take themin
of themin
s, themax
of themax
es, and thesum
of thecount
s. Later on we have an example of this.Here is a simple example:
Because
winpoints
is aggregated, the following request is made from the database:and the
min
,max
, andcount
values are used to generate the seed.Note that from a privacy point of view, it would be fine in this case to float the
sum
itself, but in other cases it is necessary to generate the three values. Therefore from an implementation point of view, it may be easier to always generate and use these rather than code up when they are and are not needed.Another example is this:
The
min
,max
, andcount
values are floated as follows:In the following example, the aggregation occurs after the filter:
Here again, the
min
,max
, andcount
values are floated, as follows:As a reminder, let me point out again that the purpose of the noise layer is to deal with the case where the filter (in this case
WHERE winpoints >= 40 AND winpoints < 50
) is removed. However, for shrink-and-drop (SaD), it is in any event necessary to floatmin
andmax
. The noise layer does not eliminate the need for SaD.In the following query, there are two filters for the same column, one before the aggregation, and one after:
Here again we float
min
,max
, andcount
.We need to defend against the case where the analyst removes one of the two conditions. If we were to naively put a single noise layer for the
winpoints
column, then the analyst could remove one without changing the noise layer. It would be nice to have one noise layer, and then for instance us the number of filters as part of the seed. This however opens up an averaging attack where the analyst can just add chaff filters to his heart's content. So instead we have a noise layer per filter encountered. To prevent each noise layer from having the same seed, a counter is added to the seed. The counter is incremented every time the same noise value would otherwise be generated.The following is an example of an aggregation function occurring before the filter, but with some math involved:
As with prior examples, we float
min
,max
, andcount
ofwinpoints
minus the math. So this:As a result, the math doesn't effect the noise layer itself, even though it does effect the values in the
sw
column.In the following query, the math on the column occurs after the aggregation rather than before. Regardless, the cloak floats
min
,max
, andcount
as above.Following is an example with
HAVING
:We float as follows:
The following query is also legitimate, where we would float
min(winpoints)
etc:Noise
Generating the Seed per Noise Layer
Each noise layer has a seed. For columns, the seed is generated by a hash of the following items concatenated together:
count(*)
, then the column name is appended with '*'.min
,max
, andcount
have been floated, then the values consist of the concatenation of these three values.The hash should be cryptographically strong (SHA256).
The noise layer should be computed after SaD or any other mechanism that drops rows (i.e. dropping low-count values from an
IN
clause).Generating the Noise
In each cloak aggregation function (
count
,sum
, etc.), there is a point where mean zero noise is added to the true answer with some computed standard deviation SD. This is described in https://github.com/Aircloak/aircloak/blob/master/cloak/docs/anonymization.md.In the case of both
sum()
andcount()
, there is a point where a noisy number Nv is generated with mean 0 and SD 1 (and then multiplied by something). Inmax()
,min()
, andmedian()
there is a point where noise is added to the "top average" with a certain SD. We can regard this SD as an SD of 1 multiplied by "a quarter of the standard deviation of ....".So in other words, in the current implementation, there is a point in the process where a noise value is generated with SD = 1. Let's call that noise value Nv. With layered noise, we are going to produce Nv differently. Normally the new "layered" Nv will also have SD = 1, but sometimes it may be more, as explained below.
In the current cloak, the noise is a single Gaussian sample pulled from a PRNG seeded with the distinct
uid
's in the bucket. With layered noise, this noise is the sum of one or more Gaussian samples. (Note that all other noisy values, i.e. those used for noisy thresholds, are computed exactly the same as today.)The sum of Gaussian samples is still Gaussian. The standard deviation (SD) of a sum of N Gaussian samples, each with SD=s is sqrt(N)*s. So for example, the sum of 4 Gaussian samples, each with SD=0.5 has cumulative SD=1. I mention this because as long as there are four layers or less, the SD of each layer is set so that the cumulative SD is 1.
Specifically, the SD of each individual noise layer, the sum of which produces Nv, is set according to this table:
The noise layer Nv, then, is composed of the sum of N layers, where one layer is seeded as we seed the current layer (with
uid
's), and subsequent layers are seeded with distinct column values as described above.Rounding
TBD
Examples
Lets look at some examples.
Example 1
The following query might be part of an attack where the analyst knows that there is one woman in the CS dept., and wants to learn her salary.
Q1:
In this query, gender, salary, and dept are filtered by the
GROUP BY
, and so for each bucket, noise layers for gender, salary, and dept are produced based on the native values in the three columns.No explicit floating is needed, because the columns already appear at the top. Because we are counting NULLs here, the actual column strings used in the seeds for each layer are
gender*
,salary*
, anddept*
.Note that the other query in the attack would be this:
Q2:
Since there is only one woman in the CS dept, the only buckets that would appear in the first query Q1 for
dept = CS
are buckets containing men (all the woman buckets would be LCF). The true answers for every bucket withdept = CS
are the same except one of them: that matching the woman's salary.However, the noisy answer for all buckets differs between Q1 and Q2 (or not, due to rounding) for two reasons:
Because any given answer may differ for the above reasons, the analyst can't tell which bucket also differs because of the presence of the woman.
Example 2
Likewise, in this query, gender, salary, and dept all act as filters on the produced buckets (again seeded with
gender*
,salary*
, anddept*
). However, here salary and dept need to be floated. Therefore, the query that the cloak makes to the database is (something like) this:The cloak then makes one noise layer each for gender, salary, and dept.
The corresponding attack query would exclude any mention of gender, and would we protected by the same mechanisms as described in Example 1.
Example 3
Suppose here that all players have level = 5 except one, and the analyst wants to learn the sum of wins for that one. This query could be run as the first query in the attack:
Q1:
In this example, age, level, and winpoints all need to be floated because each encounters a filter (
GROUP BY
for age and winpoints,WHERE
for level).Because of the aggregation function
sum
, we need to float themin
,max
, andcount
for winpoints instead of the native values.So the query made to the database would be something like this:
Here there are three noise layers (in addition to the
uid
layer), one withmin/max/cnt
ofwinpoints
concatenated together, and one each of age and level.The second query in the attack might be this (the
WHERE
clause removed):Q2:
Compared to Q1, Q2 is missing one layer (for level) and so noise layers will be different.
Example 4
This is just like the above, only the analyst has tossed in a little math.
The floated columns and resulting noise layers would be exactly as in the example above. The math is ignored for the purposes of floating.
Example 5:
In this query, there is no additional noise layer at all, because there is no
GROUP BY
or other filter.Example 6:
This query has two noise layers, one for
winpoints*
and one forlosepoints*
. The winpoints and losepoints columns need to be floated, obviously.Example 7:
This query has two noise layers, one for
winpoints
and one fordate
.Note that the fact that an average is computed, or that losepoints is computed over, is irrelevant. losepoints is not used as a noise layer.
Example 8:
This query taken from Teambank samples.
Here
buchungsDatum*
andbuchungstext*
constitute the 2 extra noise layers, so we float these to the top:The reason that the two raw columns are floated, rather than using the output of
left(cast())
andextract_match()
is because I want to avoid the situation where an analyst can make multiple buckets with the same uid-list, but with different values. If analysts could do this, then they could make multiple noise samples are remove the effect of the extra layers. It is not clear in these particular cases whether the analyst could do that or not, but in general we don't know, so it is safer to float the raw columns.Example 9:
This query has three filters:
year_month
(acting onbuchungsDatum
),betrag
, andincome_class
(acting ultimately on betrag). We need to floatbuchungsDatum
andbetrag
, and pass them through aggregation functions because ofsum
andmedian
. The resulting query to the database is this:Here is where we have nested aggregation functions, so as described above, we take the
min
of themin
s, themax
of themax
es, and thesum
of thecount
s.In total we have two extra noise layers, one consisting of the concatenation of the buchungsDatum
min
,max
, andcount
, and one consisting of the concatenation of the betragmin
,max
, andcount
. Note finally that we should be doing SaD on the betragmin/max
values.Example 10:
This example demonstrates why we include the
*
in the column name forcount(*)
. Here, the query is designed to generate aNULL
value when divide-by-zero occurs, and then the difference betweencount(*)
andcount(col)
is used to detect whether or not aNULL
value was generated:Q1:
Q2:
Of course as it now stands math restrictions would prevent these queries, but we expect to fix that.
In these queries, only salary would be floated. (Though if I think about it, the mere fact that
bday
andzipcode
are used but never encounter a filter is by itself suspicious and could be used as the basis for rejecting the query!)Anyway, assuming that the query was allowed, then the noise layer in the first query would use
salary*
in the seed, and the second would usesalary
. This would result in different seeds, and therefore the analyst would not know if a difference in answer is due to the seed or the presence or absence of the victim.