Report error in returned results

sebastian commented 8 years ago

We should have a secondary out of bounds error info channel on which we can inform the user of information relevant to their query.

This is something that can be ignored by the user when they only want pure SQL-style results.

We should report an approximate estimate of the error of the results given. These might differ on a row by row basis, so in the general info box we'll just have to produce some more generic error bounds.

In a subsequent iteration I would like it to be possible to select the error as well. Exactly how this would look is up for discussion.

I see two distinct options for inline error reporting. I'll use the avg function as a placeholder for the examples.

a separate function for error column
separate functions with an error _postfix returning the error as part of a result tuple itself

The separate error function would look like this:

SELECT item, avg(*), avg_err(*) FROM purchases GROUP by item

item | avg   | error
-----+------+-------
ice  | 100   | 5.2
shoe | 2400  | 10.6
...

The error as part of the result would look like this:

SELECT item, avg_with_err(*) FROM purchases GROUP by item

item | avg+err
-----+-----------
ice  | (100, 5.2)
shoe | (2400, 10.6)
...

The former is potentially cleaner and produces data that is easier to work with. The second is potentially less confusing, as there is no room for confusion regarding what value the error relates to.

https://trello.com/c/iknVuOfr/6893-report-error-in-returned-results

cristianberneanu commented 8 years ago

The only aggregate case in which we can reliably compute error estimates (the standard deviation) is the count. For that, I suggest we have a call that results in either the SD or the (value, SD) tuple. Something like COUNT_SD(*) resulting in either 2.5 or (1000, 2.5).

For the other estimates I don't think we can produce reliable error estimates as we process anonymized values resulting in an unpredictable final result.

sebastian commented 8 years ago

That is unfortunately my feeling as well. I am hoping we can get some feeling for the error even so. Given a non-changing implementation. Of course this isn't trivial, but maybe we can still get an upper bound that is somewhat useful.

We should definitively give this additional thought

cristianberneanu commented 8 years ago

It would be a lot easier if we had a custom anonymization method for each aggregate function instead of doing post-processing on anonymized values.

sebastian commented 8 years ago

Yes, absolutely. Since we control the whole chain now (both pre- and post-processing), there is in theory nothing preventing us from doing exactly that. But we have to be super careful to avoid there being strange interactions between anon-steps that somehow leak state.

sebastian commented 8 years ago

Now we are in fact performing custom anonymization. While this does in fact give us more control, I am not sure to what extent we are going to be able to produce tight mathematical bounds (since the error is so data dependent).

See note in: https://github.com/Aircloak/aircloak/issues/302

cristianberneanu commented 8 years ago

@sebastian We can create a separate function for getting the error, but it might be more usable if we always pack the estimated error with the anonymized value.

Then, in the UI, we can either show both or put the error in a tooltip or just change the color of the aggregator to show the amount of trust in the value (green for low errors, yellow for large errors and so on).

sebastian commented 8 years ago

I like the idea of showing errors by default in systems we control, but for systems that expect the output to be normal SQL output, we need some way of making the error value available on demand only.

cristianberneanu commented 8 years ago

I like the idea of showing errors by default in systems we control, but for systems that expect the output to be normal SQL output, we need some way of making the error value available on demand only.

OK, in that case we need separate functions.

@yoid2000 The generic way to implement this would be to compute the real value and then compare it with the anonymized value to get the difference, but that will probably reveal too much information.

For sums and counts, I am thinking of computing the difference between the outlier values and the average value of the remaining top values and compute the maximum error possible taking into consideration the maximum effect of the noise. To get absolute guarantees, we have to limit the noise to some boundaries (for example, 5 SD). Something similar for min and max (but without the noise).

Then I will put the maximum value of the error in some predefined bins, like: <0.1%, <1%, <5% or >5%. Do you think this method is ok?

I have no idea yet about how to tackle the rest of the functions, like avg, stddev and median.

yoid2000 commented 8 years ago

For sums and counts, I am thinking of computing the difference between the outlier values and the average value of the remaining top values and compute the maximum error possible taking into consideration the maximum effect of the noise. To get absolute guarantees, we have to limit the noise to some boundaries (for example, 5 SD). Something similar for min and max (but without the noise).

An alarm goes off in my head when I see any computation based on outliers. We've thrown those away for a reason, and by bringing them back into the picture, there may be attacks that are allowed. For instance, in some corner cases, the outlier presence or absence might make the difference between reporting <0.1% and 1%.

@sebastian Is it possible to put work on this error reporting on hold for the time being? This is a hard problem for a non-critical requirement (important, I know, but non-critical). I would like to give this some thought, but I'm still taken up with the CNIL anonymity stuff...

sebastian commented 8 years ago

Absolutely possible (and fine to put on ice). I'll move it to another milestone.

cristianberneanu commented 8 years ago

An alarm goes off in my head when I see any computation based on outliers. We've thrown those away for a reason, and by bringing them back into the picture, there may be attacks that are allowed. For instance, in some corner cases, the outlier presence or absence might make the difference between reporting <0.1% and 1%.

Yes, that is why I wanted to check with you. But the amount of outliers is noisy. We can also use the absolute value of the noise (instead of the maximum possible value) to compute the error bounds. Then the amount of information leaked is reduced.

Is it possible to put work on this error reporting on hold for the time being?

OK, I will find something else to work on in the meantime.

yoid2000 commented 8 years ago

It occurs to me that there is after all a way to provide the analyst with an indication of how much noise has been added. This is based on the IIOTTTAAHCHFOWAQ principle of (it-is-ok-to-tell-the-analyst-anything-he-could-have-figured-out-with-another-query, which is how we are dealing with min/max now, thanks to @seb).

The key observation is that for both sum() and count(), the noise is based on the average value (after outliers removed). Since anyway the analyst can compute avg(), as long as the analyst knows how noise is computed, the analyst can know how much noise (roughly) was added. And of course we don't hide how we compute noise.

So here is what I propose:

For sum() and count()

In both of these cases, the noise value is based on the average sum or count respectively, after outliers have been removed. So lets release a rounded version of this noise value. I think the resulting number should be pretty accurate, so I suggest the following rounding algorithm, which keeps us within around 5% to 10% accuracy:

Compute X = noise_value * 0.05

Find the next higher value in the following sequence of values:

0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 1000 ....

Use that to round.

Example: noise_value = 3.7. Then X = 0.185. Next higher value in sequence is 0.2. So, round to the nearset 0.2 (i.e. round to 3.8).

Note, 3.8 is the value we release to the analyst. 3.7 is the actual noise value used.

Call the released noise value for sum() as the rounded_sum_noise.

For avg()

avg() is computed as sum() / count(). So the noise value should be:

rounded_sum_noise / count()

For stddev()

stddev() is computed as the anonymized average of real variances. So can re-use the computation for avg() above (i.e. rounded_sum_noise / count()).

For min(), max(), and median()

There is no noise value per se here, so I don't think we can release anything. As long as the analyst knows how these are computed, he/she can understand that the inaccuracy might be arbitrarily high, depending on outliers or distribution of values around the true median.

cristianberneanu commented 8 years ago

@yoid2000 I have some issues with this approach:

The key observation is that for both sum() and count(), the noise is based on the average value (after outliers removed). Since anyway the analyst can compute avg(), as long as the analyst knows how noise is computed, the analyst can know how much noise (roughly) was added. And of course we don't hide how we compute noise.

I don't see how can the analyst know how much noise was added. The noise is random and it is scaled by the average of an unknown amount of users. The computed avg() is also noisy, so you can't use it to reliably estimate the error.

In both of these cases, the noise value is based on the average sum or count respectively, after outliers have been removed. So lets release a rounded version of this noise value.

Anyway, the noise is only a small part of the final error, one for which we already have a pretty good bound (the SD is fixed). Most of the time, especially for small data-sets, the final error should be dominated by the values of the removed outliers.

Without taking the outliers into consideration, we can't produce guaranteed error bounds. Without that, I don't see the value in this feature.

yoid2000 commented 8 years ago

What I'm trying to reveal here is roughly what standard deviation was used to compute the noise, that's all. But that is a very useful thing. In getting ready to do a demo earlier in the week, I was comparing the results of cloak queries with raw queries. Sometimes it was clear that the noise was using SD=2, but sometimes it was clear that the noise was somewhat more. This all depends on what the average value is across users. So I really felt that, as an analyst, the SD of the noise is something I want to know.

So I realized that the average would make for a good estimate of the noise I was seeing, and indeed in this case the noisy average was very close to the true average, and a good indication of SD.

Let me give an example.

Say I do this query on cloak and raw db:

SELECT age, count(*) FROM (SELECT datediff(year, date_of_birth, getdate()) AS age FROM patients) t GROUP BY age ORDER BY age ASC

I get these results for the cloak:

And these for the raw db:

I see that the results are pretty close, so I'm pretty sure we have SD=2 here.

And indeed if I look at the average number of rows per user with this query:

SELECT avg(num_rows) FROM (SELECT count(date_of_birth) AS num_rows FROM patients GROUP BY id) t GROUP BY num_rows

Then I get an answer of exactly 1 from both cloak and raw db.

Now, say I look at the home city of patients:

SELECT addresscity, count() FROM (SELECT p.address3 as address_city FROM pt_addresses p) t GROUP BY addresscity ORDER BY count() DESC

For the cloak I get this:

And for the raw db:

Now if I compare answers, I see that the values are off by substantially more. If I look at the average number of rows per user, say for Toronto, I see that the average is more than one, so I can deduce that the noise level is higher:

SELECT avg(num_rows) FROM (SELECT count(p.address3) as num_rows FROM pt_addresses p WHERE p.address3 = 'Toronto' GROUP BY patient_id) t

For the cloak I get 2.481, and for the raw db, 2.523

Note that the averages for the two are pretty close, so indeed the cloak average is fairly accurate (in this case), and so my estimate of noise is probably pretty good. But the analyst should certainly not have to do this check everytime, and there are cases (like this one) where it would not occur to the analyst to bother to make the check, because the analyst might not expect the same patient to have multiple addresses in the table.

I fully understand that we cannot possibly guarantee error bounds (even without outliers, a normal distribution has no hard bounds). But the analyst can know this, and understand that what he is getting is not an error bound, but rather how much noise was added by the cloak after outliers are removed. This is very important information for the analyst.

cristianberneanu commented 8 years ago

OK, I will provide the noise estimation with the result of the aggregation functions.

How would you prefer that this information is delivered to the analyst? If I return this value directly from the count, sum, avg or stddev functions, I will change their semantic and lose compatibility with the normal SQL aggregators.

I can provide the estimation in a side channel so it can be shown in the user interface, but that means it won't be available using external tools. Another approach would be to add separate functions like count_with_noise, sum_with_noise, ....

Aircloak / aircloak

Report error in returned results #267