No-uid approach for simple queries

sebastian commented 6 years ago

We should experiment with supporting no-uid to noise generation for simple queries that do not require probing or floating. I.e. those where we can generate the necessary state without much hassle.

@yoid2000 has provided statistical means of generating the noise value without the user id's. Paul, can you please expand?

Prime initial candidates are:

SELECT count(*) FROM ...
SELECT col1, col2, ..., count(*) FROM ... GROUP BY ...

yoid2000 commented 6 years ago

I'm still playing around with understanding noise generation (both for the current uid approach as well as the no-uid approach). I'm documenting exactly how noise and flattening will be done separately. There is more work to be done in coming up with good no-uid noise generation, but what I have now looks satisfactory, including for min and max. We can improve more later.

From what I have so far, in lieu of uid's, it looks like we'll need to collect min, max, avg, stddev, and count. These stats are computed over a table that contains one row per distinct UID. So in other words, no-uid queries require an inner select that does group by on uid, and then computes the stats on the resulting table. For now, I'll just assume that there is some function that takes these stats as input and computes noise and flattening.

Besides this, we also need to compute the number of distinct UIDs. This is both to determine if we should do LCF, and to seed the noise function (to seed, we'll use distinct count instead of a hash of UIDs).

The function for computing noise can be found at #3013.

I don't have anything for median right now. For average and standard deviation, we can derive them from noisy sum and count as we currently do (though I plan to explore computing them directly).

In what follows, I'll generate individual comments for specific simple queries what we may want to allow soon.

yoid2000 commented 6 years ago

Note that the uid and nouid approaches generate different noise values. As long as we are doing nouid for some queries, and uid for the rest of the queries, it will almost certainly be possible for an analyst to generate two samples for the same query. My inclination is to not do anything about this, in part because in general attacks are not generated with simple queries, so getting an extra noise sample for the queries we allow won't be so bad.

yoid2000 commented 6 years ago

This comment looks at simple queries using count(), sum(), min() and max() where there is no sub-query and no conditions.

These look relatively easy to implement, and since they would otherwise involve transferring all rows to the cloak, are a big win as well. Hopefully the examples below are self-explanatory with regards to how to implement.

SELECT count(*)
FROM accounts

The above analyst query becomes:

SELECT sum(cnt),                               -- the true answer
       count(DISTINCT client_id) AS duids,                     -- used for LCF and seed
       count(cnt), max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- "stats" used to compute noise
FROM
   (SELECT client_id, count(*) AS cnt
    FROM accounts
    GROUP BY client_id) t

The functions are called like this:

baseNoise = computeNoiseFromSeed(seed_materials, duids)

where the seed_materials are what they would be today, with the exception that duids replaces the hash of uids

suppressDecision = computeLCF(baseNoise, duids) (noise,flatten) = computeNoiseForSum(baseNoise, stats)

where stats = count(cnt), max(cnt), min(cnt), avg(cnt), stddev(cnt)

Note the role of the sub-query is to get the per-user contribution (in this case, the number of rows).

SELECT count(DISTINCT client_id)
FROM accounts

The above query could become:

SELECT sum(cnt),                               -- the true answer
       count(DISTINCT client_id) AS duids,                     -- used for LCF and seed
       count(cnt), max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- used to compute noise
FROM
   (SELECT client_id, count(DISTINCT client_id) AS cnt
    FROM accounts
    GROUP BY client_id) t

Note this is just like the above, but in the sub-query we substitute count(DISTINCT client_id) for count(*).

Because we know that the per-user contribution is 1, however, would could just submit the original analyst query, and them assume the following: duids = count(DISTINCT client_id) count(cnt) = count(DISTINCT client_id) max(cnt) = min(cnt) = avg(cnt) = 1 stddev(cnt) = 0

Since counting distinct UID's is probably pretty common, it might make sense to optimize for the case of count(DISTINCT uid).

SELECT sum(amount)
FROM transactions

Above analyst query becomes:

SELECT sum(sm),                                -- the true answer
       count(DISTINCT client_id) AS duids,                 -- used for LCF and seed
       count(sm), max(sm), min(sm), avg(sm), stddev(sm)    -- used to compute noise
FROM
   (SELECT client_id, sum(amount) AS sm
    FROM transactions
    GROUP BY client_id) t

Note that the behavior here is exactly as with SELECT count(*), except that in the sub-query, the per-user contribution is the sum of values, not the count of rows.

SELECT max(amount)
FROM transactions

Above analyst query becomes:

SELECT count(DISTINCT client_id) AS duids,                 -- used for LCF and seed
       count(mx), max(mx), min(mx), avg(mx), stddev(mx)    -- used to compute noise
FROM
   (SELECT client_id, max(amount) AS mx
    FROM transactions
    GROUP BY client_id) t

Note first that the sub-query computes the per-uid max (rather than sum or count as in the above examples). Note second that the true answer can be found in max(mx), so it isn't necessary to separately ask for it.

The reported max would be computed with the function max = computeMax(baseNoise, column_stats).

SELECT min(amount)
FROM transactions

Above analyst query becomes:

SELECT count(DISTINCT client_id) AS duids,                 -- used for LCF and seed
       count(mn), max(mn), min(mn), avg(mn), stddev(mn)    -- used to compute noise
FROM
   (SELECT client_id, min(amount) AS mn
    FROM transactions
    GROUP BY client_id) t

Note for min, we just substitute max for min in the sub-query.

The reported min would be computed with the function min = computeMin(baseNoise, column_stats).

yoid2000 commented 6 years ago

This comment looks at simple histograms of count(), sum(), min(), and max() with no sub-queries, no conditions, and only a single reported column. This looks very doable.

SELECT frequency, count(*)
FROM accounts
GROUP BY 1

The above analyst query becomes:

SELECT frequency, sum(cnt),                                  -- the true answer
       count(distinct client_id) AS duids,                   -- used for LCF and seed
       count(cnt), max(cnt), min(cnt), avg(cnt), stddev(cnt) -- used to compute noise
FROM
   (SELECT client_id, frequency, count(*) AS cnt
    FROM accounts
    GROUP BY 1, 2) t
GROUP BY 1

This is very similar to the simple queries of the last comment. In terms of the query construction, the only difference is that the column being displayed (here frequency) is included in the sub-query GROUP BY and of course outer select.

In terms of seeding, the value returned is of course used as one of the seed components.

In terms of LCF, duids can be used as normal.

In the case of SELECT col, count(DISTINCT uid), a similar simplification as above can be made.

SELECT acct_district_id, sum(amount)
FROM transactions
GROUP BY 1

The above analyst query becomes:

SELECT acct_district_id, sum(sm),                      -- the true answer
       count(DISTINCT client_id) AS duids,                 -- used for LCF and seed
       count(sm), max(sm), min(sm), avg(sm), stddev(sm)    -- used to compute noise
FROM
   (SELECT client_id, acct_district_id, sum(amount) AS sm
    FROM transactions
    GROUP BY 1,2) t
GROUP BY 1

So in other words, the construction is very similar to the above simple case.

Same idea for min and max.

yoid2000 commented 6 years ago

This comment also looks at simple histograms of count(), sum(), min(), and max() with no sub-queries and no conditions, but with multiple reported columns.

Unfortunately, the histogram with a single column doesn't generalize to multiple columns. This is because of the way we currently do LCF, where we can suppress some but not all column values. As a result, the queries get a bit more complex, but only a little. I think we can implement multi-column histograms sooner rather than later.

SELECT col1, col2, count(*)
FROM accounts
GROUP BY 1,2

So, for instance, suppose we were to take the above analyst query and change it to this:

SELECT col1, col2, sum(cnt),                                 -- the true answer
       count(distinct client_id) AS duids,                   -- used for LCF and seed
       count(cnt), max(cnt), min(cnt), avg(cnt), stddev(cnt) -- used to compute noise
FROM
   (SELECT client_id, col1, col2, count(*) AS cnt
    FROM accounts
    GROUP BY 1, 2, 3) t
GROUP BY 1, 2

The problem is that, for each column value combination, we know the number of distinct UIDs, but not the actual UIDs. So if we were to get an output like this:

col1	col2	duids
a	b	1
a	c	1
a	d	1
...	...	...
a	z	1

We would not know if or to what extent we could combine the rows into an a | * row.

One alternative would be to rather build a query like this:

SELECT frequency, acct_district_id, sum(sm),               -- the true answer
       count(DISTINCT client_id) AS duids,                 -- used for LCF and seed
       max(client_id), min(client_id),                     -- also for LCF
       count(sm), max(sm), min(sm), avg(sm), stddev(sm)    -- used to compute noise
FROM
   (SELECT client_id, frequency, acct_district_id, sum(amount) AS sm
    FROM transactions
    GROUP BY 1,2,3) t
GROUP BY 1,2

Note here we additionally report the min and max uid for each row. With this output, you would learn up to two distinct UIDs for each value combination. Then you could count the minimum possible number of distinct UIDs per possible * column, and run LCF on these. In this case, you could get false negatives, whereby you don't make a * row when you could have, because you underestimated the number of distinct UIDs.

Another shortcoming of supporting a multi-column histogram is that, the more columns that are added, the more value combinations there will be, and therefore the more returned rows and less performance savings. One way to mitigate this would be to place a condition to only collect value combinations with 2 or more distinct UIDs:

SELECT frequency, acct_district_id, sum(sm),               -- the true answer
       count(DISTINCT client_id) AS duids,                 -- used for LCF and seed
       max(client_id), min(client_id),                     -- also for LCF
       count(sm), max(sm), min(sm), avg(sm), stddev(sm)    -- used to compute noise
FROM
   (SELECT client_id, frequency, acct_district_id, sum(amount) AS sm
    FROM transactions
    GROUP BY 1,2,3) t
GROUP BY 1,2
HAVING count(distinct client_id) > 1

But in doing this, we'd lose some information, potentially a lot, and so would not be able to do LCF reporting well in some cases. In #2651 I describe a way to get back some of the lost information, but it involves making a second query. There is probably a way to manage this in one query, but it would probably involve CASE statements, so will get complex and we should deal with it later.

yoid2000 commented 6 years ago

This comment looks at simple queries with bucketization (histogram from bucket()). No sub-queries and no conditions, and a single bucket.

This is essentially no different from simple queries with a single column histogram. I'm assuming for now that any bucket can be treated the same as any column. In other words, SELECT bucket(col), count(*) is handled essentially the same as SELECT col, count(*), SELECT bucket(col1), sum(col2) is handled as SELECT col1, sum(col2), etc.

By extension I'm assuming that simple queries with histograms from multiple buckets, or histograms from combinations of buckets and columns, can be handled as described for simple queries with histograms from multiple columns.

SELECT bucket(acct_district_id BY 5), count(*)
FROM accounts
GROUP BY 1

The above analyst query becomes:

SELECT floor(acct_district_id/5)*5, sum(cnt),                   -- the true answer
       count(DISTINCT client_id) AS duids,                      -- used for LCF and seed
       count(cnt), max(cnt), min(cnt), avg(cnt), stddev(cnt)    -- used to compute noise
FROM
   (SELECT client_id, acct_district_id, count(*) AS cnt
    FROM accounts
    GROUP BY 1,2) t
GROUP BY 1

yoid2000 commented 6 years ago

This comment looks at simple queries with a statically-analyzable posand. By "statically-analyzable", I mean a posand where the seed component related to the posand can be determined just from inspection of the posand.

In this case, all of the prior examples apply directly. The only difference is that the seed generation needs to account for the condition's seed component.

Note that I'm including here BETWEEN posands, which I assume to also be statically analyzable.

SELECT count(*)
FROM accounts
WHERE cli_district_id = 68

The above analyst query becomes:

SELECT sum(cnt),                               -- the true answer
       count(DISTINCT client_id) AS duids,                     -- used for LCF and seed
       count(cnt), max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- "stats" used to compute noise
FROM
   (SELECT client_id, count(*) AS cnt
    FROM accounts
    WHERE cli_district_id = 68
    GROUP BY client_id) t

Note that the posand is simply moved into the sub-query. Otherwise the query handling is the same as in other examples. If there are multiple statically-analyzable posands, they are likewise all moved into the sub-query.

Personally I still am of the opinion that we should use something along the lines of the "binary search" thingy I played with so that all numeric posands become statically analyzable.

yoid2000 commented 6 years ago

This comment looks at multiple aggregation functions in a query.

SELECT count(*), sum(balance), min(amount), max(amount)
FROM transactions

The above analyst query becomes:

SELECT sum(cnt), sum(sm),                          -- the true answers
       count(DISTINCT client_id) AS duids,                     -- used for LCF and seed
       count(cnt), max(cnt), min(cnt), avg(cnt), stddev(cnt),  -- "stats" for count(*)
       count(sm), max(sm), min(sm), avg(sm), stddev(sm),       -- "stats" for sum(balance)
       count(mn), max(mn), min(mn), avg(mn), stddev(mn),       -- "stats" for min(amount)
       count(mx), max(mx), min(mx), avg(mx), stddev(mx)        -- "stats" for max(amount)
FROM
   (SELECT client_id, count(*) AS cnt, sum(balance) AS sm,
           min(amount) AS mn, max(amount) as mx
    FROM transactions
    GROUP BY client_id) t

This query is of course the combination of everything needed for each of the four aggregation functions.

yoid2000 commented 6 years ago

@sebastian this is ready for you to look at

sebastian commented 6 years ago

This is very interesting. Thank you for the writeup @yoid2000. I have created an issue to have a prototype made.

cristianberneanu commented 6 years ago

My impression of this is still that the privacy guarantees aren't equivalent to the ones provided by the current solution. If we decide to lower them, we could probably also simplify and improve the current design as well (but maybe not to the same degree).

An alternative path would be to push the current design towards this new model incrementally:

We stop storing hash sets for all of the floated values and instead we keep statistics.
We implement offloading of grouping by user id (maybe with a threshold for the average number for rows per user).
We implement offloading of statistics per-user.
We implement offloading of statistics globally.

Another thing to keep in mind is that not all of the statistics on text columns can be offloaded on all of the platforms: for example, Mongo has no API for hashing strings (SAP HANA also has limited capabilities).

cristianberneanu commented 6 years ago

@yoid2000 Paul, to make sure I understand things correctly:

The base noise is generated only from the static noise layers and the count of distinct user ids, right? The dynamic noise layers are dropped. How is the count(distinct uid) value used for generating the noise? What is the exact algorithm? Is the hashed value just added like an additional noise layer?

yoid2000 commented 6 years ago

@cristianberneanu

No, we still have both static and dynamic noise layers. The static noise layers are generated just as today. The dynamic noise layers are done just as today, but with one exception: hash(count(distinct uid)) is used in place of hash(uid_list) that we have today.

In other words, something in the current code something computes a value that is derived from a hash of a list of UIDs (or maybe a list of hashed UIDs XOR'd together) and a secret salt.

Wherever that value is used today, we use instead a value taken from:

hash(count(distinct uid), salt)

yoid2000 commented 6 years ago

@cristianberneanu I can see how the functions I wrote are confusing.

You should look at #3013 to see how I plan to compute the noise.

My hope is that generating the seed for the noise requires very little change (count(distinct uid) instead of list of uids), and that the the decisions for how big the SD and flatening should be for sum, and likewise how to modify the min and max, should be relatively easy to fit into the current code.

If it doesn't look like this is the case, then probably one or the other of us mis-understands something, so lets coordinate closely.

Of course, the changes for manipulating the SQL are quite disruptive....

sebastian commented 6 years ago

@cristianberneanu: Let me just re-iterate that I want this prototyped before we do a full implementation! I.e. all the work that would go into checking if it is a valid query that we can use the new approach for etc should be ignored. I would like to see something quick and dirty for count.

obrok commented 6 years ago

The static noise layers are generated just as today. The dynamic noise layers are done just as today, but with one exception: hash(count(distinct uid)) is used in place of hash(uid_list) that we have today.

Why do you think hash(count(distinct uid)) is going to be faster than hash(uid_list)?

sebastian commented 6 years ago

I think it's on the assumption that count(distinct uid) is executed in the DB whereas hash(uid_list) requires the entire list of uid's to be transferred from the DB to the cloak in order to be hashed

yoid2000 commented 6 years ago

I would like to see something quick and dirty for count.

I guess you just want some hard-coded SQL (what the cloak would have generated), and some code that computes noise based on no-uid, and compare performance with the cloak?

If you are willing to have the prototype in perl, I could have this in a day or two

cristianberneanu commented 6 years ago

Currently, for each bucket, we have a static noise layer and also a list of dynamic noise layers, one for each uid in that bucket. It is not clear to me if you want to keep this dynamic list of noise layers for each {bucket,uid} pair.

If we can drop it and have the dynamic noise layer be based on the count of distinct uids and the statistics for the column per bucket, then that would be a significant gain, one that can apply to the current algorithm as well.

If not, then the change won't matter that much from a performance POV, since we need to gather all the noise layer data per {bucket, uid} pair anyway. At most, we can drop the uid column from the list of retrieved columns. This can also be applied to the current algorithm as well.

sebastian commented 6 years ago

I guess you just want some hard-coded SQL (what the cloak would have generated), and some code that computes noise based on no-uid, and compare performance with the cloak?

If you are willing to have the prototype in perl, I could have this in a day or two

I did something like that for in my prototype 0 already. It seems to have given quite decent performance when the db part can be scaled. At this point I think I would rather like to see the end-to-end path in the cloak and get a sense of how far reaching the changes would have to be.

yoid2000 commented 6 years ago

We have a terminology mis-match.

When I say "noise layer", I mean "one of the Gaussian noise values that is summed together to make the final noise value". By this definition, a condition like col = val has only two noise layers, the static one and the dynamic one.

A given bucket can have multiple static and multiple dynamic noise layers, and often it has the same number of static and dynamic noise layers.

I'm not sure what you mean when you say there is a dynamic noise layer for each uid. I think you are probably referring to what I call "seed components".

Anyway, we are certainly dropping the list of UIDs. That is the whole point of the no-uid approach.

As described earlier in this issue, the following analyst query:

SELECT count(*)
FROM accounts

results in the following request to the cloak:

SELECT sum(cnt),                               -- the true answer
       count(DISTINCT client_id) AS duids,                     -- used for LCF and seed
       count(cnt), max(cnt), min(cnt), avg(cnt), stddev(cnt)   -- "stats" used to compute noise
FROM
   (SELECT client_id, count(*) AS cnt
    FROM accounts
    GROUP BY client_id) t

As you can see from this, no UIDs are collected. Rather you get a bunch of statistical computations per bucket.

Does this make sense????

cristianberneanu commented 6 years ago

I'm not sure what you mean when you say there is a dynamic noise layer for each uid. I think you are probably referring to what I call "seed components".

Correct.

Anyway, we are certainly dropping the list of UIDs. That is the whole point of the no-uid approach.

Great. Can we also drop the per-uid "seed components"? If yes, can we do that for the current approach as well? We could get a speed-boost a lot faster if we do that.

As you can see from this, no UIDs are collected. Rather you get a bunch of statistical computations per bucket. Does this make sense????

It does in this simple case.

yoid2000 commented 6 years ago

Great. Can we also drop the per-uid "seed components"?

Yes. The per-uid "seed components" are replaced by the count(distinct UID) value.

If yes, can we do that for the current approach as well? We could get a speed-boost a lot faster if we do that.

I don't think so, because in the current approach we do other UID-specific stuff as well, like identify the outlier-group and the top-group, and flatten, etc. To avoid all that work in the no-uid approach, we need to get the per-bucket statistics, so a rather big change.

It does in this simple case.

heh

cristianberneanu commented 6 years ago

@yoid2000 How do you propose that we merge the low-count buckets into the censored bucket?

Each LC bucket will have a set of labels, the count of distinct user_ids and a set of statistics for each aggregate (count, min, max, avg, stddev).

Combining the counts of distinct uids properly would require the complete list of user ids.

Combining the stats for average and stddev is also problematic. One idea would be to return the sum of counts, min of mins, max of maxs, avg of avg and the weighted avg of stddevs, although the last two will vary from the actual results. Another method would be to store the count, sum and sum of squares per-bucket, but that has the known accuracy issues.

yoid2000 commented 6 years ago

How do you propose that we merge the low-count buckets into the censored bucket?

I thought I had written something about this in #2561, but I don't see it so it must have been my imagination.

Lets think about this from a requirements PoV first (@sebastian you are needed for this, maybe @fjab as well).

The original purpose of the censored column was just to give the analyst some idea of what is missing. (At least that is how I remember it.) In other words, the reported value didn't have to be so accurate, as long as it is in the right ballpark. Of course, we've grown accustomed to having an accurate value, so it might be painful to fall back on this (maybe more painful for us than our customer ....)

@sebastian ? @fjab ?

In the meantime, let me think a bit. I doubt this will be a big problem ....

cristianberneanu commented 6 years ago

In other words, the reported value didn't have to be so accurate, as long as it is in the right ballpark.

My main worry is not about accuracy, is about security. The censored bucket has to be low-count filtered also and for that we need the true count of distinct user ids. If we use some pseudo-count instead, that means it will have lower protection, which might result in data leaks.

yoid2000 commented 6 years ago

I'm not too worried about the LCF of the censored bucket. We can be a little bit conservative here and make an estimate based on what we've learned. If 1 in 1000 times we let slip something we would not otherwise have, that will be ok...

sebastian commented 6 years ago

I think @cristianberneanu refers to the partial aggregations we do. As in for the column pair name and salary we might only have to redact the salary part, if there are enough users for a given name. However if we don't know who the users were for the different salary values it's unclear if we can reveal the name or not. If for example I were to appear in the database with 20 different salaries, but furthermore be the only Sebastian, then it's clearly not OK to reveal my name (however without knowing who the users for the name-salary-pairs were, we cannot know).

yoid2000 commented 6 years ago

Yes, I know. I'm just saying that if based on what we see we are pretty confident that we pass LCF, and pretty confident means 999 of 1000 or maybe 9999 of 10000, this this isn't a real problem.

cristianberneanu commented 6 years ago

I think @cristianberneanu refers to the partial aggregations we do.

No, I am talking about the simple case. The partial censoring is a generalization of the simple case that uses the same steps iteratively.

Yes, I know. I'm just saying that if based on what we see we are pretty confident that we pass LCF, and pretty confident means 999 of 1000 or maybe 9999 of 10000, this this isn't a real problem.

I can't see how we can be confident of anything from using the counts only. We can do another query to get the proper count, but that would be somewhat slow. We can also select the first/last user_id per-bucket so that we have partial user information when combining the buckets, but we don't support those functions yet.

cristianberneanu commented 6 years ago

We can also select the first/last user_id per-bucket so that we have partial user information when combining the buckets, but we don't support those functions yet.

Although we can use min/max instead, as we do for isolator queries.

yoid2000 commented 6 years ago

I don't mind using min/max, at all, but I just want to say something about how we should be viewing risk.

The question is not "can we ever display a row that fails LCF?"

Rather the question is something like, "can we avoid displaying rows that fail LCF often enough so that an analyst cannot reasonably exploit it with an attack, or recognize when it happens with high confidence?"

So for instance, say I have 10 buckets that fail LCF, each with a count of distinct UIDs of 2 or 3. I can reasonably ask "what is the probability that all of those distinct UIDs are the same?" I can also use the overall number of distinct UIDs in other (displayed) buckets to get an idea of how many overall distinct UIDs there are, and therefore how likely it is that all the distinct UIDs from failed buckets are the same.

If the probability is very low (1/1000 or 1/10000), then I feel comfortable displaying the censored bucket even though we can't be 100% sure, because that would be very hard for an analyst to exploit.

yoid2000 commented 6 years ago

Regarding the LCF decision for the censored buckets, I think using min/max is a good solution. In any event, if we are computing number of distinct UIDs, then it should not be much in the way of extra computation for the DB to also compute min and max UID.

yoid2000 commented 6 years ago

Regarding the noise added to censored buckets, how about if, rather than try to derive it from the statistics of the buckets to be censored, we derive it from the noise of the reported buckets. So something like this:

Say we are computing sum. Take cen_true_sum as the true sum of the censored bucket (i.e. the sum of sums from the buckets that failed LCF). Take rep_true_sum as the true sum of one or more reported buckets. Take rep_noisy_sum as the noisy sum from the same one or more reported buckets.

Compute snr = rep_true_sum / rep_noisy_sum.

Then set cen_noisy_sum, the reported noisy value for the censored bucket, as cen_noisy_sum = cen_true_sum / snr.

Essentially what we are doing here is assuming that the distribution of values in the censored bucket is the same as the distribution of values in the reported buckets, and therefore that the generated noise would be proportionally the same. Obviously this is not strictly true, but if we buy that the purpose of the censored buckets is to convey roughly how much data is lost, and not a precise value of the lost data, then this should be ok. We could also round the value in order to convey the sense that this is not a fully accurate value.

So in essence we are trading accuracy for simplicity and cost of computation.

What do you think?

yoid2000 commented 6 years ago

One other thing that @sebastian pointed out to me yesterday.

In a query like the following:

SELECT sum(sm),                                -- the true answer
       count(DISTINCT client_id) AS duids,                 -- used for LCF and seed
       count(sm), max(sm), min(sm), avg(sm), stddev(sm)    -- used to compute noise
FROM
   (SELECT client_id, sum(amount) AS sm
    FROM transactions
    GROUP BY client_id) t

duids and count(sm) are the same value. Both simply count the number of distinct UIDs. So you only need one of these (the duplicate comes from the fact that initially I had only four statistics, max,min,avg,and stddev, so when I added count I failed to notice that I already had it).

cristianberneanu commented 6 years ago

I can also use the overall number of distinct UIDs in other (displayed) buckets to get an idea of how many overall distinct UIDs there are, and therefore how likely it is that all the distinct UIDs from failed buckets are the same.

Essentially what we are doing here is assuming that the distribution of values in the censored bucket is the same as the distribution of values in the reported buckets, and therefore that the generated noise would be proportionally the same.

I don't this holds in the general case. The buckets failing LCF will most likely have their own distribution of values.

Obviously this is not strictly true, but if we buy that the purpose of the censored buckets is to convey roughly how much data is lost, and not a precise value of the lost data, then this should be ok. We could also round the value in order to convey the sense that this is not a fully accurate value.

At some point, we will need to explain to clients how the censored buckets work. Having a probabilistic algorithm makes for an uncomfortable discussion.

There are also edge cases that we need to consider. What if there is one or two users that have many buckets no one else have? Can someone attack the cloak using this scenario?

If we get the two uids per bucket, we can then count the different uids when merging buckets. If it passes the LCF, we know for sure that it is safe to display it. Of course, there will be false negatives, when many buckets will have identical min/max user uids, even they have other uids in between, but I can't think of anything bettern than this at the moment.

cristianberneanu commented 6 years ago

BTW, we can have an algorithm for merging bucket statistics that gives good results for count(x), sum(x) etc. But it won't work well at all when using count(distinct x) and sum(distinct x). We run into the same issue we have for count(distinct uid).

yoid2000 commented 6 years ago

I suggest we move forward on the assumption that we'll be able to generate a reasonable estimate for the censored buckets with the data we collect. (And assuming that we'll use max/min uid to do LCF on censored buckets).

In the meantime, let's discuss the requirements for censored bucket accuracy, @sebastian and @fjab

sebastian commented 6 years ago

Of course, there will be false negatives, when many buckets will have identical min/max user uids, even they have other uids in between

How can the min and max of the user ids be the same if there are other user id's between?

cristianberneanu commented 6 years ago

How can the min and max of the user ids be the same if there are other user id's between?

For example, the following buckets with the associated user ids: [1, 2, 3, 7] [1, 3, 4, 7] [1, 5, 6] [6, 7] [6] [1, 3, 5, 6, 7] when merged using min/max values, result in a bucket with the user ids [1, 6, 7], which should fail LCF, while the actual user ids are [1, 2, 3, 4, 5, 6, 7] which should pass LCF.

sebastian commented 6 years ago

gotcha. I thought you meant min = max.

cristianberneanu commented 6 years ago

@sebastian should I implement the algorithm for aggregation of lcf buckets here or is the current prototype enough for now?

sebastian commented 6 years ago

Let's postpone it a bit until we have more experience with the no-uid prototype. In particular I want to have the measure of accuracy progress further first: https://github.com/Aircloak/aircloak/issues/3048

I want to understand whether our approach to noise adding in the no-uid algorithm used by Paul creates values that are better or worse than what we are producing today, and furthermore how well they represent the underlying procedure.

Following that we can discuss how to progress the most productively from there.

cristianberneanu commented 6 years ago

I need a few clarifications here before implementing this algorithm into production:

for the LCF we will do a sampling of users using min/max uid per-bucket; this means the results of the LCF in the no-uid case will differ from the normal uid list scenario. Is that ok?
in the case of running aggregators over text values, we will need to use a hashing function to gather the statistics; that hashing function will imply emulation on certain backends, like mongodb. Is that ok?
also, the hash result for a value has no direct connection to the space of possible values, so it isn't clear to me at all how the statistics for the distribution of the hashes (which will always be mostly uniform) are relevant for the distribution of values themselves.

yoid2000 commented 6 years ago

@sebastian are you satisfied with the noise levels you measured, or do you have issues and/or want to make more measurements?

yoid2000 commented 6 years ago

for the LCF we will do a sampling of users using min/max uid per-bucket; this means the results of the LCF in the no-uid case will differ from the normal uid list scenario. Is that ok?

It is ok if the min/max approach is different, so long as it works fairly well. You gave this example:

For example, the following buckets with the associated user ids: [1, 2, 3, 7] [1, 3, 4, 7] [1, 5, 6] [6, 7] [6] [1, 3, 5, 6, 7] when merged using min/max values, result in a bucket with the user ids [1, 6, 7], which should fail LCF, while the actual user ids are [1, 2, 3, 4, 5, 6, 7] which should pass LCF.

In this example, although the only distinct UIDs that you see are [1,6,7], you can in fact know that in total there are at least 5 distinct UIDs, because the last bucket has a count of 5. There might also be some heuristics you can use to increase the number of distinct UIDs. For instance, say you have the following:

min=5, max=12, count = 3 min = 12, max = 21, count = 4

Then you know that there are at least 6 distinct UIDs, because you know that the 3 UIDs in the second bucket greater than 12 cannot be in the first bucket.

I also think it would be ok to make some statistical assumptions about the likelihood of collisions in the buckets. For instance, suppose you have the following:

min=5, max=982, count = 4 min = 12, max = 1002, count = 6

And suppose also that you know that the UID space is well populated (by which I mean that most of the values between 5 and 1002 are valid UIDs for one user or another). Then statistically it is highly probable that the unknown UIDs are different, so you can increase the count above that which you strictly know to be true. So instead of saying that there are 4 distinct UIDs in the above case, you are probably safe to say there are 8 or 9. The fact that this will sometimes be wrong is not really a problem. It only becomes a problem if the attacker can somehow exploit this, and I think that would be really hard.

With these kinds of heuristics, I think the min/max approach will work well. In other words, I think it will be rare that the min/max approach suppresses a * bucket that the old approach would not have suppressed.

yoid2000 commented 6 years ago

in the case of running aggregators over text values, we will need to use a hashing function to gather the statistics; that hashing function will imply emulation on certain backends, like mongodb. Is that ok?

Could you clarify this? I'm not sure what you are referring to.

obrok commented 6 years ago

in the case of running aggregators over text values, we will need to use a hashing function to gather the statistics; that hashing function will imply emulation on certain backends, like mongodb. Is that ok?

It seems that if doing a no-uid query triggers emulation, then we might as well perform the query normally. Or am I misunderstanding something?

sebastian commented 6 years ago

It seems that if doing a no-uid query triggers emulation, then we might as well perform the query normally. Or am I misunderstanding something?

I wonder about that too.

in the case of running aggregators over text values, we will need to use a hashing function to gather the statistics; that hashing function will imply emulation on certain backends, like mongodb. Is that ok?

Could you clarify this? I'm not sure what you are referring to.

In a very high number (maybe even majority) of cases, the uid-column isn't numerical, but rather a string (or arbitrary) value of some sort. If we want to calculate the min/max uid value, we would, if I understand @cristianberneanu right, have to hash the values first, which makes the min/max values for all intents and purposes useless?

sebastian commented 6 years ago

It seems that if doing a no-uid query triggers emulation, then we might as well perform the query normally. Or am I misunderstanding something?

I wonder about that too.

Well, I guess that is still to be seen... It might be faster to do the emulated no-uid aggregation in the cloak rather than running the full traditional anon.

Aircloak / aircloak

No-uid approach for simple queries #2807