Open obrok opened 5 years ago
Hi @obrok . I'm missing something. Can you be more specific about how the attack works? (Among other things, I don't know what where the N
of {N, infinity}
comes from...)
Well, let's say that 99
is the maximum int for simplicity. Then you issue this query:
SELECT COUNT(*), (winpoints + 50) % 1
FROM games
GROUP BY 2
You will get two rows, maybe like so:
count | %
-------+------
1000 | 0
1000 | NULL
Next you issue this query:
SELECT COUNT(*), (winpoints + 51) % 1
FROM games
GROUP BY 2
And you get something like this:
count | %
-------+------
970 | 0
1033 | NULL
I think the numbers won't have to add up to the same, because of noise. However, you have obtained the noisy count of winpoints BETWEEN 0 AND 49
and the noisy count of winpoints BETWEEN 0 AND 48
this way (the first row in both cases), which should be forbidden.
@yoid2000 I made a more concrete version of the attack against the actual implementation we have. Ignoring a bug I found, I'm still pretty convinced that one can do what I claimed.
We look at this query (currently allowed by the cloak):
SELECT COUNT(*), (value * 99999999999999999 + 999999999999999999 + 923372036854775881) % 1
FROM integers
GROUP BY 2
count | %
--------+---
751 | 0
108243 |
(2 rows)
Maximum long integer is 2 ^ 63 - 1 = 9223372036854775807
. So the count
in the first row is effectively COUNT(*) WHERE value <= 73
, because 73 * 99999999999999999 + 999999999999999999 + 923372036854775881 = 9223372036854775807 = 2 ^ 63 - 1
. The count
in the second row is COUNT(*) WHERE value > 73
. Both of these queries should be disallowed.
To compare, we can look at this query (note the last additive constant is one bigger):
SELECT COUNT(*), (value * 99999999999999999 + 999999999999999999 + 923372036854775882) % 1
FROM integers
GROUP BY 2
count | %
--------+---
701 | 0
115687 |
(2 rows)
In this case we have COUNT(*) WHERE value < 73
and COUNT(*) WHERE value >= 73
, because 73 * 99999999999999999 + 999999999999999999 + 923372036854775882 = 9223372036854775808 = 2 ^ 63
, which is equivalent to COUNT(*) WHERE value <= 72
and COUNT(*) WHERE value > 72
, because the underlying type is integer.
I'm kinda worried that this invalidates the whole approach we just spent upwards of a month on.
@obrok what is the reason for having sums of large numbers instead of a single larger number? I gather that the larger number would be rejected somewhere in the pipeline? (Either by the cloak or by the DB?)
@yoid2000 Yes, there's a limit of 10^18 on numeric constants, including on constant expressions, so this is how the attack has to be built up at the moment. Note that:
column + 123 + 123
is equivalent to
(column + 123) + 123
and not to
column + (123 + 123)
so it's not treated as a constant expression (123 + 123
), and therefore allowed even if the sum of the constants is over the 10^18 limit.
I'm missing something. Is there some circumstance where
(column + 123) + 123
and
column + (123 + 123)
return different values?
No, math should work as expected :). However:
SELECT (value + 999999999999999999) + 999999999999999999 FROM integers;
is allowed by the cloak, while
SELECT value + (999999999999999999 + 999999999999999999) FROM integers;
gives
ERROR: Constant expression is out of valid range: numeric values have to be inside the interval [-10^18, 10^18].
The error was detected at line 1, column 36:
1: SELECT value + (999999999999999999 + 999999999999999999) FROM integers;
Is that an inconsistency in the cloak that should be fixed? I'm still missing the importance of why you mention the difference in execution order....
Is that an inconsistency in the cloak that should be fixed? I'm still missing the importance of why you mention the difference in execution order....
It's just a sidenote, ignore it. It's in case you encountered this case while playing with the attack.
This looks to me like a somewhat different issue and not one invalidating the overflow protection changes. If I understand correctly, the problem here is that using this trick, one can bypass the range restrictions, right? Basically, you can issue queries that separate data into unbounded and unsnapped ranges.
I think there are similar tricks that can happen with other math functions, like sqrt
for example:
SELECT sqrt(column - 5), count(*) FROM table GROUP BY 1
All the rows with NULL
values in column
or with values less than 5 will be grouped in the same bucket.
We already have some restrictions for grouping by implied ranges. Maybe we should review them and modify them so these queries are not allowed anymore? For example, we can forbid grouping by any value that can overflow or underflow. Or we can snap the constants used in a grouping expression. Or restrict the amount of math possible in grouping expressions.
@cristianberneanu I understood @obrok 's attack to be general to the case where safe functions return NULL, while I understand what you just wrote to be specific for columns that have NULL values. If so, then these are two different things and I'm not sure we should worry about the latter...
No, they are identical.
sqrt(x)
returns NULL
when x
is NULL
or negative.
(x+ 9999999999999999999) % 1
returns NULL
when x
is NULL
or the operation overflows.
@cristianberneanu ok thanks for the explanation.
Hmmmmm .... let me think about it .... sheesh, this never ends :(
So it seems to me that a safe function can be categorized as a discontinuous function. If we simply classify it that way, and then apply the same restrictions to safe functions that we apply to discontinuous functions, does that take care of it?
So it seems to me that a safe function can be categorized as a discontinuous function. If we simply classify it that way, and then apply the same restrictions to safe functions that we apply to discontinuous functions, does that take care of it?
I think @obrok can properly answer this, as he has the most experience with the math restrictions, but it seems reasonable to me, since a discontinuity in the output of the expression is what causes the problem.
So it seems to me that a safe function can be categorized as a discontinuous function. If we simply classify it that way, and then apply the same restrictions to safe functions that we apply to discontinuous functions, does that take care of it?
Did a quick check of this, doesn't seem to help (with the restrictions as-is, just adding the classification). Details:
discontinuous
function category, but we have a restricted
category, functions like %
, round
, trunc
, etc. are in this category.restricted
category to all the math operators.restricted
category that would matter here is that an expression can have at most 5 restricted operations in it.(column + constant) % 1
. If column
is big enough so that (column * constant) % 1
or (column ^ constant) % 1
works here, then at least some bad ranges can be extracted. Furthermore, it feels to me like the cap on constant size of 10 ^ 18
is a bit arbitrary, and something we would like to remove at some point, so I can see us accidentally enabling even the full (column + constant) % 1
in the future.Ok. Any thoughts on how to address this?
Any thoughts on how to address this?
None yet, but allow me to insert a broader rant into this discussion, because I think it is relevant. Note that the rant is not angry, I'm just wrestling with what a solution might look like and this is where my thoughts go.
\
Over and over we bump into the problem that we basically want to give the user a "pixelated" or "snapped" view of the data, so that exact values cannot be observed, but if there is enough distinct users in a bucket, then that bucket can be observed. However, we're continuously trying to achieve this "from the top". What I mean by that is that all the operations in our pipeline (like math, let's say) act on the raw data, but we try to impose limits on how the pipeline is constructed so that somehow the result is appropriately bucketized. The opposite approach would be "bottom up" - bucketize the raw data first, and only then apply any operations, so that the results are guaranteed to be bucketized at every stage.
I realize I've already raised this. I forgot the name of the anonymization technique that basically only does this, but I remember you pointed out that it's out there, it's bad, and it's not what we're doing. Nevertheless, it seems to me that this is the root of the problem we're facing in this issue and a couple similar ones, and unless we directly address that, there will always be some hole, we just won't always be clever enough to find it.
\
First regarding the rant, you are of course correct: if we could figure out how to produce bucketized raw data first, then I guess we could just publish the bucketized data and the analyst could do whatever they want with it. But I don't know how to even approach this.
Regarding this issue, I assume that we need to do something along the lines of what @cristianberneanu suggested: we identify when safe functions are required, and then put additional restrictions to prevent their use in attacks.
My general understanding was that the use of safe functions is relatively rare, at least for the overflow case? In other words, we only use safe functions when we think there is some danger of overflow?
As for sqrt
, wasn't it the case before that we simply disallowed any math within the sqrt
? Something like that?
As for sqrt, wasn't it the case before that we simply disallowed any math within the sqrt? Something like that?
Doesn't seem we have any such restriction anymore. For example this is allowed by the cloak (but currently crashes due to a bug in overflow protection):
SELECT COUNT(*), sqrt(age * 99999999999999999 * 2 % 1) + 1
FROM demographics
GROUP BY 2
Which means, so far as I know, that we used the new safe feature to give us more utility on sqrt
, but that it's backfired and we need to roll back to the earlier restrictions.
Ok, so we used to have a "potentially crashing" category for /
, %
, and sqrt
, where no math would be allowed in the arguments to such operations (just the second argument in the case of /
and %
). However, if this restriction were to be applied to all the mathematical operators, then it would basically mean a limit of 1 mathematical operator per expression. Something like column * 2 + 1
would be rejected.
I guess we could apply this restriction only if the analysis determines the expression is unsafe. Although "only" is a big word here, difficult to say how often that would happen in practical applications.
In any case I don't think it makes sense to go back to the restriction on sqrt
et al. unless we entirely scrap the bound analysis mechanism. My reasoning is that if bound analysis (with some extension we make) cannot prevent this case, then it's useless.
My understanding of how things proceeded here is as follows:
sqrt
, /
, and %
, and relax some restrictions.This third step, however, is strictly speaking unnecessary for the purpose of protecting against overflow. We could just as well have kept with the "potentially crashing" restrictions. This doesn't mean that the bound analysis is "useless" per se, because it still can be used for certain types of overflow attacks (those that use multiplication or exponents etc.)
Anyway, now we find that the safe variants can be attacked. So we have the following choices:
You guys know better than me, but my guess/hope would be that doing number 2 would not be too hard, since we already have the crashing restrictions code and we just need to apply it selectively. But maybe testing becomes a bitch or something...I don't know.
Yapee you must have some intuition as to how frequently the restrictions would have to be applied, even understanding that that intuition could be way off.
In any event, at the moment I'd much rather do 1 than 3, because 1 simply restores the old status quo, and 3 will take time and more effort when we really need to move on to the Telefonica-requested features. I'd rather do 2 than 1 assuming that applying the restrictions would be relatively rare. The main downside I see is that the restrictions become even more confusing because sometimes they apply and sometimes they don't...
Thoughts?
Thoughts?
The overall approach of going back to the restrictions and relaxing them when bound analysis indicates it's OK seems reasonable enough...
However, it doesn't seem the restrictions as they were before they got removed prevent the particular attack. This is because the expressions used is of the form (a * b + c) % 1
, while the restriction only rejects expressions where the divisor contains both a constant and a column. We would probably need to tighten the restriction to disallow any math in both the arguments to /
and %
, but that seems very oppressive.
Perhaps the problem is that both integer division and modulo are in fact implicit ranges, because they create buckets in a non-continuous manner?
Maybe the rule should be that no discontinuous operations can be nested in other discontinuous operations and the safe variants are discontinuous? So a % b % c
would always be rejected, a * b / c
would be rejected if the /
divides integers and the *
can overflow, etc.
Perhaps the problem is that both integer division and modulo are in fact implicit ranges, because they create buckets in a non-continuous manner?
My understanding is somewhat similar to this as well: at the core of the issue is the bypass of range restrictions (since groups and ranges are conceptually similar) using discontinuous functions, something against which we already have partial protection, since we don't allow functions like trunc
or floor
neither in filters, nor in groups. Until now, we ignored the fact that math operators were discontinuous at the boundaries of the input/output space. Before, the discontinuity manifested in a crash, now it manifest as a NULL
output, since the result is either not defined or out of range.
Maybe the rule should be that no discontinuous operations can be nested in other discontinuous operations and the safe variants are discontinuous? So a % b % c would always be rejected, a b / c would be rejected if the / divides integers and the can overflow, etc.
It looks to me the attack can be implemented with a single discontinuous function (i.e. sqrt(column - constant)
). Also, not all inputs can result in discontinuities: 1/x
or 1%x
is discontinuous when x <= 0
, while x/10
and x%10
are never discontinuous.
Some potential approaches:
We could allow the query, but then censor the buckets that contain discontinuities: for example, in the case of sqrt(x)
the NULL
bucket would always be censored.
We could reject expressions with discontinuities in filters and groups, unless we can guarantee the input range leads to continuous output. We could request additional filters to be added or we can add the extra filters automatically.
HI, back from holiday.
We could allow the query, but then censor the buckets that contain discontinuities: for example, in the case of
sqrt(x)
theNULL
bucket would always be censored.
This is a really interesting idea. The way safe functions are currently implemented, can we distinguish between an output that is NULL
because the underlying data is NULL
, versus because the function triggered the safety output?
This is a really interesting idea. The way safe functions are currently implemented, can we distinguish between an output that is NULL because the underlying data is NULL, versus because the function triggered the safety output?
No, we can't. The only special value in a database column is NULL
, we don't have different placeholders for exception raised, etc.
Ok, but presumably we would filter NULL output only when safe functions are used, and not filter otherwise?
Yes, either we censor the NULL bucket if a safe function is aggregated over, or we need to add extra filters that exclude data that might raise exceptions. I don't have the details figured out.
We could allow the query, but then censor the buckets that contain discontinuities: for example, in the case of sqrt(x) the NULL bucket would always be censored.
I don't think this works against the original attack described. Notice that both buckets in the attack description are problematic - the one with NULL
and the one with 0
.
While thinking about the implementation of the
safe_*
functions that will returnNULL
instead of overflowing, I started thinking about the following attack sketch:Note that the noise in this function doesn't depend on the constant
9999999999999999999
. Note also that this is currently disallowed on a technicality (constants above10^18
are disallowed), however, I'm assuming there is some way to work around that, perhaps by simply using*
instead of+
, depending on the data. Now let's assume we have thesafe_mul
function implemented, so that the raw data from the database will have 0 ifwinpoints + large_constant
is below the overflow bound andNULL
otherwise. It seems to me that one can run this query, varying the large constant by 1 and get counts for all ranges ofwinpoints
of the form{N, infinity}
, which should be prohibited by the range restrictions by looking at the proportion ofNULL
to0
in the result.@yoid2000 WDYT?