Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Cloned natural persons (multiple UID per person) #3588

Open yoid2000 opened 5 years ago

yoid2000 commented 5 years ago

It may be the case that sometimes there are multiple UIDs associated with the same natural person. This can happen, for instance, if a column like account_number or contract_number is used as the UID, but a given natural person has the ability to create multiple accounts or contracts.

Let's refer to this as a 'clone'.

This is the case with Telefonica. It also happened with the Kia Clinic dataset.

The main problem with this arises when there is personally identifying information (PII) like name or address in the dataset. Suppose, for instance, that there is a column email. If the analyst does:

SELECT email, count(*)
FROM table
GROUP BY 1

and the cloned natural person has used the same email in each account, then the cloak will happily display the email.

A second problem comes when the analyst uses the PII in a where clause:

SELECT salary
FROM table
WHERE email = 'victim@clone.com'

The current solution to this is to remove such PII columns, or (if the customer is willing to do the work), find the clones and individually remove or mask PII for the clones.

When cloning exists, we can basically solve the problem in the following way:

  1. Operator determine worst case clone amount (the worst case number of UIDs associated with a given natural person). Call this the clone_threshold.
  2. Label PII columns as such.
  3. Whenever a PII column is involved in a query, raise the two LCF threshold values by clone_threshold. (I'm currently unclear how these are set right now, but basically there is a min_hard_thresh and a mean_soft_thresh of some sort, and both of these would be increased by clone_threshold).

Note that it isn't really necessary to raise the amount of noise. When we do low-effect detection, we can likewise raise the threshold when dealing with PII columns.

yoid2000 commented 5 years ago

@sebastian

Another way to solve this would be to label all PII columns, and the cloak would ensure that no PII data is displayed that does not derive from at least X distinct values. This would be done on a per-PII column basis. X is a small number like 3 or 4.

Compared to the above idea, it has a key advantage that it does not require the operator to figure out a clone_threshold.

It would also give an analyst a warm-and-fuzzy feeling about PII protection (can't literally view individual last names, just portions of last names).

sebastian commented 5 years ago

Another way to solve this would be to label all PII columns, and the cloak would ensure that no PII data is displayed that does not derive from at least X distinct values. This would be done on a per-PII column basis. X is a small number like 3 or 4.

Just to clarify what you are suggesting: Assuming X=3, we would only show "Bob" if there was also Bobs and Bobby in the result-set?

Compared to the above idea, it has a key advantage that it does not require the operator to figure out a clone_threshold.

It does however seem to require more of a global view in the cloak again. I.e. it needs to look at the entire result-set as a whole and potentially merge rows... For example consider the query:

SELECT age, name, average(salary)
FROM users
GROUP BY 1, 2
ORDER BY age asc

Here the bobX's are not co-located in the answer at all. Would this mean we drop the bobX's which don't share an age? Or would we merge age too?

yoid2000 commented 5 years ago

Just to clarify what you are suggesting: Assuming X=3, we would only show "Bob" if there was also Bobs and Bobby in the result-set?

In this case it would mean that you only get to see anything at all if you write SELECT left(col,3) ..., in which case the bucket would be labeled 'Bob', and we'd be allowed to show it because there are three different values.

But in retrospect this idea doesn't make a lot of sense. Just ignore it.

yoid2000 commented 5 years ago

Probably the proper fix to clones is to be properly deal with multiple user_ids. Essentially use both account_id and customer_id as UIDs (i.e. add noise according to the worst-case of the two, etc.)

sebastian commented 5 years ago

But what about the case where no clear secondary ID exists? My feeling is that there is not necessarily a clear overlap between the cases where there are clones and those where there are multiple uid-column candidates.

yoid2000 commented 5 years ago

Yeah won't work everywhere, but will often work and sooner or later we need multi-uid anyway....

I'm realizing more and more that simply protecting an individual isn't enough. There are cases where you need to also protect say households, or partners (joint bank account), etc. I think GDPR actually comes up short in this regard, and it would be very cool if we had this capability, and then could talk about how we go beyond GDPR....

cristianberneanu commented 5 years ago

The question that comes to my mind is what is the utility of these PII columns? Why do we ever want them included in the result set? If a specific PII column has some non-PII information in it (like for example email address or SSN from which we can infer the email provider or the sex) why not write a decoder that reduces the column to the non-PII information only?

Another issue that I see is that the account id/number/etc. is not the actual user id. The user id is the email or SSN or something else that is always related to the user in an unique way.

As to detecting these cases: can't we use the results from the isolating queries to automatically make a decision? If a column is 99% isolating, than we should at least warn about it (as some values can leak from it) or maybe automatically censor it always (like we do when user ids are selected).

sebastian commented 5 years ago

I am not sure the isolating cache is enough by itself. Say we have firstname, lastname, sex, and age in our dataset. Neither of these might be seen as isolating by our cache on whether columns are isolating or not, but collectively they most certainly would be. When we have a cloned natural person the combination of these attributes would still make it past the low count filter.