anonymization / Shrink-and-drop for forced alignment

yoid2000 commented 8 years ago

sorry, will add proper text later...

yoid2000 commented 8 years ago

This is the second part of the implementation of fixed alignment (issues #676 and #681).

Even with fixed-alignment, there are possible cases where an analyst can isolate a victim by changing the range. For instance, suppose that there is a user with an outlying high salary of 1.5M, and the nearest lower salary is 900K. Then the analyst could make the following two queries:

SELECT count(*) FROM table WHERE disease = 'AIDS' AND salary >= 0 AND salary < 1000000

SELECT count(*) FROM table WHERE disease = 'AIDS' AND salary >= 0 AND salary < 2000000

The latter query includes the high salary user while the former does not.

The conceptual solution to this attack is essentially to check for other queries that the analyst could make in the future (or may have made in the past), and to modify the rows in the query so that they are identical to the rows that will (did) occur in the other query. This is of course done only if the two queries are nearly the same. This ensures that there is never a difference between a pair of queries whose distinct users differ by only a small amount.

To give a concrete example, suppose that the cloak is given the following range:

salary >= 0 AND salary < 2000000

Assuming "125" alignment with displacement by 1/2, the possible other ranges that could constitute the other query in an attack pair are these (a partial list):

range	requirement for success
0 -> 1M	victim alone in range 1M -> 2M
1M -> 2M	victim alone in range 0 -> 1M
1.5M -> 2M	victim alone in 0 -> 1.5M
0 -> 0.5M	victim alone in 0.5M -> 2M
0 -> 0.2M	victim alone in 0.2M -> 2M
0.2M -> 0.4M	victim alone in 0 -> 0.2M, and from 0.4M -> 2M
...	...
1.8M -> 2M	victim alone in 0 -> 1.8M
0 -> 0.1M	victim alone in 0.1M -> 2M
etc. etc.	etc. etc.

The shrink-and-drop algorithm is this (conceptually):

For each potential range that 1) has overlap with the stated range, and 2) does not extend to the right of the stated range:
Compute a seed from the set of distinct users in the overlapping range between the stated and potential ranges.
Using that seed, run low-count on all the rows that are in the stated range but not the potential range (the non-overlapping part).
If not low-count, do nothing.
Else if low-count, then suppress the rows that are in the non-overlapping part.

Note that in practice, not all of these non-overlapping parts need to be checked for per se. This is because if smaller non-overlapping parts are not low-count, then the larger non-overlapping parts, which are super-sets of the smaller non-overlapping parts, will also not be low-count. For instance, if the 1M -> 2M range is not low-count, then any larger range (i.e. 0.5M -> 2M) will also not be low-count. Therefore, the cloak can check the potential ranges with the smallest non-overlap first, and only need continue with larger non-overlapping parts if the smaller parts are low-count.

As an example, assume that the stated range is 0 -> 2M. The cloak would take the following steps:

Check the potential range 1M -> 2M.
1. Make a seed from the distinct users in 1M -> 2M (overlapping part).
2. Using that seed, check the rows in the range 0 -> 1M for low-count (non-overlapping part).
If not low-count (the normal case), then check the potential range 0 -> 1M.
If not low count, then we are done --- nothing is suppressed.
Lets say that the potential range 1M -> 2M is low-count. Then check the potential range 1.5M - 2M. (Here the overlapping part is 1.5M -> 2M, and the non-overlapping part is 0 - 1.5M.)
If not low-count, then suppress the rows in the range 0 -> 1M.
If low-count, then check the potential range 1.8M - 2M, and so on until the non-overlapping part is not low-count, and then suppress the largest range that was low-count.

Note that it is possible that both the upper half and the lower half are low-count. In this case, all rows are suppressed, even if the complete range is itself otherwise not low-count.

Note also that the reason we don't need to look at potential ranges that lie to the right of the stated range is that those potential ranges, were they given as stated ranges, would check for potential ranges on their left. In other words, if 0 -> 2M was a stated range, but later 1.5M -> 2.5M was a stated range, then the latter would check for the former. It would not be necessary for the former to check for the latter.

yoid2000 commented 8 years ago

Note that next I will write an issue on how to do safe ranges without fixed alignment. Given the complexity of implementing fixed alignment, including that for date-times and shrink-and-drop, we might want to hold off on finishing this implementation until we are sure we need it.

obrok commented 8 years ago

@yoid2000 Isn't there a problem of combinatorial explosion when someone attaches multiple ranges to their query?

So for example given the query

SELECT COUNT(*) FROM foo
WHERE a >= 10 AND a <= 20
AND b >= 10 AND b <= 20
AND c >= 10 AND c <= 20

isn't it true that wee need to perform ~8 times more of these shrink checks instead of just ~3 times more?

yoid2000 commented 8 years ago

@obrok

That is a good observation, but I think it is adequate if we just do shrink and drop on each one. This is in part because the probability of having more than one range on which you can attack a user is vanishingly small. But also, I'm not even sure the attack works even when we only look at one at a time. But I need to think about it, and I'm too tired at the moment...

yoid2000 commented 7 years ago

@obrok

I thought a bit more about combinations of ranges. It is not necessary to deal with combinations (though if we allowed OR and/or NOT, then it would be).

obrok commented 7 years ago

It seems to me that as written it doesn't work in the following scenario:

There's a lot of users with salary in the 1.0M - 1.5M range
There's one user with a salary of 1.6M
It seems to me the user' won't be suppressed from the query of 0 - 2.0M but can be isolated by issuing a 0-2.0M and then a 1.0 - 1.5M query

yoid2000 commented 7 years ago

What should happen for the 0-2M query is this:

You compare the query 0-1M against 0-2M. The difference is not LCF, so do nothing more with the 0-1M sub-range.
You compare the query 1M-2M against 0-2M. The difference is LCF (because no users in the 0-1M sub-range), so you further explore the 1M-2M sub-range.
You compare the query 1M-1.5M against 0-2M. The difference is LCF, because only one user in the 1.5M-2M sub-range.
You compare the query 1.5M-2M against 0-2M. Difference is not LCF, so don't explore that subrange any more.
YOu explore sub-ranges under 1M-1.5M, but you don't find any LCF cases. So this is the final sub-range that you use for the query.

obrok commented 7 years ago

It makes sense now - I originally thought we only investigate smaller and smaller intervals that are at or touching the left or right border of the original interval. Having a fully recursive routine makes more sense.

obrok commented 7 years ago

@yoid2000 I came up with the following description from what I understand about this process. Could you read it and let me know if you think it's accurate?

* Notation:
  * x n y - range intersection of x and y
  * select(d, r) - a subset of d where the appropriate parameter falls in range r
  * low_count?(d, s) - true if there is too few items in d according to the low count algorithm seeded with seed s

* shrink_and_drop(d, R):
  * Input: data (d), range on one of the parameters (R)
  * If low_count?(d, seed(d)):
    * Mark all data in d for suppression
    * return
  * any_low_count = false
  * Generate subranges one level distant, these include the original range shifted by half
    * Examples:
      * {0, 10} -> {-2.5, 2.5}, {0, 5}, {2.5, 7.5}, {5, 10}, {7.5, 12.5}
      * {0, 5} -> {-2.5, 2.5}, {-1, 1}, {0, 2}, {1, 3}, {2, 4}, {3, 5}, {4, 6}, {2.5, 7.5}
      * {0, 2} -> {-1, 1}, {-0.5, 0.5}, {0, 1}, {0.5, 1.5}, {1, 2}, {1.5, 2.5}, {1, 3}
  * For each subrange r:
    * Generate a seed (S) from R n r
    * If low_count?(select(d, R - (R n r)), S)
      * Mark all items in select(d, R - (R n r)) for suppression
      * any_low_count = true
  * If any_low_count:
    * For each subrange r:
      * shrink_and_drop(D, r)

yoid2000 commented 7 years ago

I'm not clear about one thing from your description.

From what I understand of your description, if you started SaD with a range of 0-10, you would test a number of sub-ranges (the ones you list).

Now suppose you are testing 2.5-7.5 (i.e. you call SaD with that sub-range). What you want to check for low-count are not the datapoints in that range, but rather the datapoints outside of that range. In other words, if the combined ranges of 0-2.5 and 7.5-10 are low-count, then you want to suppress the rows in those two ranges, and then iterate on sub-ranges of 2.5-7.5.

From your description, I think instead that you are actually testing the datapoints in 2.5-7.5 for low count.

yoid2000 commented 7 years ago

The point being, the goal of this is to make sure there isn't a future (or past) query the analyst could make that differs from this query by one user....

obrok commented 7 years ago

You're talking about this line:

low_count?(select(d, R - (R n r)), S)

What I meant to convey is:

Take the complement of r with respect to R
Check if that complement is low-count

yoid2000 commented 7 years ago

Ok, in that case it looks fine to me...

(functionally...there might be some nice optimizations that handle the whole thing in one pass with high probability, on the assumption that low-count is relatively rare...)

obrok commented 7 years ago

To avoid having to perform arbitrarily many steps in shrink_and_drop I came up with the following approximation:

* approximate_shrink_and_drop(d, R):
  * Let d' = d except for the 2 smallest items with regards to R
  * If empty?(d')
    * return d
  * Else
    * Let s = generate seed from d'
    * If min(d') == max(d')
      * return select(d, {min(d'), max(d')})
    * Else if low_count?(select(d, R - {min(d'), max(d')}), s)
      * return select(d, {min(d'), max(d')})
    * Else
      * return d
  * Repeat the procedure above for 2 largest items with regards to R

The overall effect of this is to suppress up to 2 largest and smallest items if an aligned range including all the other items and excluding the small/large items will be found. The problem I see with this is that there may be another range that does the job that will not be found this way, like for example a larger range, shifted by half. Example:

Let's assume alignment is 1-2 instead of 1-2-5, then:

[1, 5, 6, 7, 8, 9]
0 <= x < 10 - includes all items
5 <= x < 15 - includes all but one item

But the procedure will try to consider the range fix_align({6, 9}) - the ranges {0, 10} and {5, 15} are equally good approximations of {6, 9} so one will be chosen depending on the implementation of fix_align. So it's possible that we'll only check {0, 10} and miss the fact that {5, 15} leads to an attack.

yoid2000 commented 7 years ago

I'm a bit tired, and having trouble understanding what the above really does.

But I have a question. Why not keep the recursive approach, but add a pre-step where you do this:

Define R' as the range from the lowest to the highest value within the analyst-defined range R.
Define R" as the smallest fixed alignment range that encompasses R'.
Start the recursive algorithm from R" (essentially as though you had already been running it up to this point).

So for your example, where you have 7,7,7,7,7,7,7,7,7.001, and the analyst supplies the range 0-10, you would identify R' = 7 - 7.001. Then you would identify R" as, well, the same in this case, R" = 7 - 7.001. Then start the recursive algorithm from the range R", essentially as though you had been running it many times from the original 0-10.

obrok commented 7 years ago

That seems like a good idea, and it seems to me we don't need any recursive calls with that? We just compact the range, align it, and do one iteration of generating and checking subranges.

yoid2000 commented 7 years ago

I think it still needs to be recursive after that...for the same reason we needed recursive before...

On Tue, Nov 15, 2016 at 3:37 PM, Paweł Obrok notifications@github.com wrote:

That seems like a good idea, and it seems to me we don't need any recursive calls with that? We just compact the range, align it, and do one iteration of generating and checking subranges.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Aircloak/aircloak/issues/688#issuecomment-260657645, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qVv-I9ZJU5JcbHoDUYG3lEY_gC7gks5q-cOzgaJpZM4KJ-lU .

obrok commented 7 years ago

I'm not actually sure why we would need it to be - could you provide an example where the recursive calls result in suppressing something?

yoid2000 commented 7 years ago

It is the same as the example from before:

In this case, R' would be say 1.6 - 9.5 (just eyeballing the data points here). The nearest 125 range encompassing this would be R' = 0-10, which is the original range anyway. We need to recurse here for the same reason we needed to recurse before...to find the smallest possible snapped range where the rest is low-count ( in this case, 3-5).

obrok commented 7 years ago

But in this case doing what I said also results in suppressing the two extreme points. I'm not trying to be flippant, just to understand what needs to happen.

So in my case we would indeed see that R' = 0-10. We would then generate possible attack ranges: -5 - 5, 5 - 15, -2.5 - 2.5, 0 - 5, 2.5 - 7.5, 5 - 10, and 7.5 - 12.5. Inspecting these ranges one after the other we would find that 2.5 - 7.5 makes the whole outside be low count and suppress the outlying points.

Another point - it seems to me that we always suppress 0 or more leftmost and 0 or more rightmost points. Never anything in the middle, right?

obrok commented 7 years ago

Another point - it seems to me that we always suppress 0 or more leftmost and 0 or more rightmost points. Never anything in the middle, right?

If this is true and we are only worried about protecting individuals it suggests a very different, but I think much simpler to implement, approach. We simply remove one value from each side and make the smallest, aligned range that contains the remaining values. Then we check if the outside of that range is low count and suppress anything in that outside if it is. Won't this work?

yoid2000 commented 7 years ago

@obrok You may be right. I tried the following as a counter example:

Lets say the right-most data point is 7.4, and not 9.5.

Then R' = 1.6 - 7.4.

The next larger snapped range is R" = 0 - 10.

Then your approach would produce an adjusted range of 2.5-7.5. Now the analyst could try the range 3-5. This would indeed exclude the user at 7.4, whereas you had included it. On the other hand, the answer to 3-5 is going to be different from the answer to 0-10 regardless, since 0-10 includes the user at 1.6, and 3-5 excludes the user at 1.6.

So I don't have a counter-example...

Another point - it seems to me that we always suppress 0 or more leftmost and 0 or more rightmost points. Never anything in the middle, right?

This is my understanding, but I'm open to counter-examples...

yoid2000 commented 7 years ago

If this is true and we are only worried about protecting individuals it suggests a very different, but I think much simpler to implement, approach. We simply remove one value from each side and make the smallest, aligned range that contains the remaining values. Then we check if the outside of that range is low count and suppress anything in that outside if it is. Won't this work?

My issue with approaches that don't use a noisy fixed threshold at every step is that, in those cases where the analyst has some additional knowledge, he can account for the single user that you exclude. For example, say that the analyst happens to know that there is a user at location 9.8, but what he wants to test is whether there is a user at 9.6 or not. This approach might result in a case where you exclude the user at 9.8, but somehow include the user at 9.6. (Maybe this specific example doesn't work, but hopefully you get the picture.) Since the analyst knows you will exclude the 9.8 user, he can then attack the 9.6 user.

If on the other hand you use a noisy fixed threshold at every step, the analyst can never exploit special knowledge that he may have with any certainty.

yoid2000 commented 7 years ago

Say we have this case:

The analyst knows that 0-4 is empty, and knows there is a user a 9.5, and wants to test for presence of victim at 6.6.

R' and R" are as shown. Your approach would select 2.5-7.5, which includes the victim. Analyst could later try 4-6, which excludes the victim.

yoid2000 commented 7 years ago

Never-the-less, the question is whether this is enough of a corner case. The counter-example requires certain layout of the data, and certain knowledge of the attacker. Personally I'm comfortable with this.

Also, arguably, you could add the next layer down in your check. That is, in addition to checking for the range of size 5 sliding window, you could check for the range of size 2 sliding window, but then stop...

yoid2000 commented 7 years ago

@obrok having slept on this, I'm not after all comfortable with this corner case.

But here is another thought:

Select a fixed-threshold value T (say bound 2, mean 5, sd 1).
Start removing users from either end such that with each removal, the most new empty space is created.
Select the smallest snapped alignment that encompasses the remaining users.

Working from this example:

say you select T = 4 in step 1. You'd first remove 9.5, because that opens the whole space between 4.9 and 9.5. Next you'd remove 1.6, because that opens the space between 1.6 and 3.2. Eeyballing the picture, you'd then remove 4.9 and 4.7.

The resulting range would be 3.2 - 4.7. The nearest larger snapped-aligned range is 3-5, and you're done.

Likewise with this case (again assuming T=4):

You'd remove the four points on the right, leaving 4.1 - 5.4. The largest snapped-aligned range would then be 4-6.

sebastian commented 7 years ago

As a side note to @obrok: I hope you are implementing this (whatever the strategy) in a way that doesn't require you to have/hold all the data at the same time?

yoid2000 commented 7 years ago

This last approach I wrote down, which still looks good to me, doesn't requires to load all the data. You only need to remember the highest and lowest 15 or so users, and then at the end make your computation.

sebastian commented 7 years ago

This last approach I wrote down, which still looks good to me, doesn't requires to load all the data. You only need to remember the highest and lowest 15 or so users, and then at the end make your computation.

Exactly, that's what makes me like it :)

obrok commented 7 years ago

I am in fact requiring all the data. I took a hint from the next step in the pipeline (aggregator), which forces the enumeration of the data.

This last approach I wrote down, which still looks good to me, doesn't requires to load all the data. You only need to remember the highest and lowest 15 or so users, and then at the end make your computation.

That doesn't seem correct - to get highest and lowest 15 users you need all users to compare against them. Especially since you need highest and lowest with respect to multiple criteria. So far I fail to see how any of these approaches can be done without access to the full data set, as they deal with relationships in the data.

yoid2000 commented 7 years ago

I don't see this, unless I'm missing something basic here. The only criteria you care about is the value of the column over which the range is specified.

To keep highest and lowest 15, you just store the latest 15 values, and when you get a new one higher (lower) than one of the 15, you throw out the lowest (highest) and insert the new one...

No???

obrok commented 7 years ago

Hm, what do you mean by "throw out"? Where does it go? Those rows are still needed for processing steps further down the line. Yes, for the purposes of only finding min and max 15 one traversal through the data will be enough. But that traversal will force all the data into the memory of the cloak? Or is there another option I'm not seeing?

sebastian commented 7 years ago

Couldn't you use the low-count filter conditions module and approach? I.e. you buffer the top N and bottom N rows. The other ones you let through immediately. At the end you check what the real aligned range should be (based on the algorithm above) and then let through those of the buffered rows falling within the bounds, and drop the rest?

yoid2000 commented 7 years ago

By "throw out", I meant throw out of the buffer of 15x2 stored values.

obrok commented 7 years ago

At the end you check what the real aligned range should be (based on the algorithm above) and then let through those of the buffered rows falling within the bounds, and drop the rest?

Possible but seems to me like the size of the buffer will be around the same as the size of all the data, no?

sebastian commented 7 years ago

Wasn't the latest algorithm:

buffer top N+1 + bottom N+1
find the smallest snap-aligned range that encompasses the remaining data
re-introduce the buffered data in the new snapped range?

obrok commented 7 years ago

Wasn't the latest algorithm:

Yes

sebastian commented 7 years ago

Well in that case, yes, you will need to look at all the data prior to making the final decision, but you don't need to hold the data until the end. And you can have a constant sized buffer the size of top n + bottom n?

obrok commented 7 years ago

Hm, when you put it like this that makes sense in that you can treat every range as another transformation. OK, let me try.

obrok commented 7 years ago

Last question to @yoid2000 -

Select a fixed-threshold value T (say bound 2, mean 5, sd 1).

Am I to understand this is a random value? What's the seed?

sebastian commented 7 years ago

I am not @yoid2000, but I am answering a bit all the same:

Yes, random value.

Interesting point about the seed! You haven't seen any users at the point of determining the random threshold, so you cannot seed it by the users... What you could do is use a temporary fixed threshold of say 10 (which is a value that is highly unlikely to be exceeded by a random value generated with mean 5 and sd 1) and then replace it with an actual noisy threshold once you have seen all the users, and have the chance of seeding the anonymizer?

There is a subtlety here though: if you are seeding the PRNG based on all users, then you are also seeding it based on the value of the victim (which might end up filtered), and hence:

the seed might differ from the seed which is later used by the anonymizer
the attacker might exploit this fact somehow

Not sure how to solve this pickle.

yoid2000 commented 7 years ago

The seed question is always tricky....

In SaD, we are doing two fixed-threshold computations. The first is to decide what to ignore when determining R'. The second is to decide if the rows outside of the R" snapped-alignment are low-count or not.

For the first fixed-threshold computation, you can use a seed based on the entire set of rows from the original range.

For the second fixed-threshold computation, you can use the set of rows in the R" snapped-alignment.

yoid2000 commented 7 years ago

@sebastian is right that you should start with a temporary non-random threshold. 10 is pretty safe. 15 is safer (which is why I used that number earlier). But 10 is safe enough.

obrok commented 7 years ago

For the first fixed-threshold computation, you can use a seed based on the entire set of rows from the original range.

Can one prepare the seed in a streaming manner (with a reasonable-sized buffer)? I'm going to venture a "no" - the seed is made from unique users and for that you'd need to process them all first.

For the first fixed-threshold computation, you can use a seed based on the entire set of rows from the original range.

We don't have those users at that point in the computation - we only have the 2 * N + 2 top- and bottom-most ones.

yoid2000 commented 7 years ago

The question of computing the set of distinct users must already have been dealt with by others, no?

sebastian commented 7 years ago

Can one prepare the seed in a streaming manner (with a reasonable-sized buffer)? I'm going to venture a "no" - the seed is made from unique users and for that you'd need to process them all first.

The seed is made by XOR-ing a hash of the user-id, so you can prep multiple seeds with constant memory. I.e. you can have one seed for all users, and a second seed for all users you have let pass through.

sebastian commented 7 years ago

More specifically, here is how the seed is calculated: https://github.com/Aircloak/aircloak/blob/master/cloak/lib/cloak/query/anonymizer.ex#L197-L207

It lends itself both to incremental as well as distributed computations.

obrok commented 7 years ago

The seed is made by XOR-ing a hash of the user-id, so you can prep multiple seeds with constant memory. I.e. you can have one seed for all users, and a second seed for all users you have let pass through.

What if there are two records for a given user? They will get removed from the seed

sebastian commented 7 years ago

What if there are two records for a given user? They will get removed from the seed

Excellent point! Didn't think about that

obrok commented 7 years ago

The question of computing the set of distinct users must already have been dealt with by others, no?

It has been but as I said above I don't think it can be made streaming.

Aircloak / aircloak

anonymization / Shrink-and-drop for forced alignment #688