Open yoid2000 opened 5 years ago
@sebastian
Another way to solve this would be to label all PII columns, and the cloak would ensure that no PII data is displayed that does not derive from at least X distinct values. This would be done on a per-PII column basis. X is a small number like 3 or 4.
Compared to the above idea, it has a key advantage that it does not require the operator to figure out a clone_threshold
.
It would also give an analyst a warm-and-fuzzy feeling about PII protection (can't literally view individual last names, just portions of last names).
Another way to solve this would be to label all PII columns, and the cloak would ensure that no PII data is displayed that does not derive from at least X distinct values. This would be done on a per-PII column basis. X is a small number like 3 or 4.
Just to clarify what you are suggesting: Assuming X=3, we would only show "Bob" if there was also Bobs
and Bobby
in the result-set?
Compared to the above idea, it has a key advantage that it does not require the operator to figure out a
clone_threshold
.
It does however seem to require more of a global view in the cloak
again. I.e. it needs to look at the entire result-set as a whole and potentially merge rows... For example consider the query:
SELECT age, name, average(salary)
FROM users
GROUP BY 1, 2
ORDER BY age asc
Here the bobX
's are not co-located in the answer at all. Would this mean we drop the bobX
's which don't share an age? Or would we merge age too?
Just to clarify what you are suggesting: Assuming X=3, we would only show "Bob" if there was also Bobs and Bobby in the result-set?
In this case it would mean that you only get to see anything at all if you write SELECT left(col,3) ...
, in which case the bucket would be labeled 'Bob', and we'd be allowed to show it because there are three different values.
But in retrospect this idea doesn't make a lot of sense. Just ignore it.
Probably the proper fix to clones is to be properly deal with multiple user_ids. Essentially use both account_id
and customer_id
as UIDs (i.e. add noise according to the worst-case of the two, etc.)
But what about the case where no clear secondary ID exists? My feeling is that there is not necessarily a clear overlap between the cases where there are clones and those where there are multiple uid-column candidates.
Yeah won't work everywhere, but will often work and sooner or later we need multi-uid anyway....
I'm realizing more and more that simply protecting an individual isn't enough. There are cases where you need to also protect say households, or partners (joint bank account), etc. I think GDPR actually comes up short in this regard, and it would be very cool if we had this capability, and then could talk about how we go beyond GDPR....
The question that comes to my mind is what is the utility of these PII columns? Why do we ever want them included in the result set? If a specific PII column has some non-PII information in it (like for example email address or SSN from which we can infer the email provider or the sex) why not write a decoder that reduces the column to the non-PII information only?
Another issue that I see is that the account id/number/etc. is not the actual user id. The user id is the email or SSN or something else that is always related to the user in an unique way.
As to detecting these cases: can't we use the results from the isolating queries to automatically make a decision? If a column is 99% isolating, than we should at least warn about it (as some values can leak from it) or maybe automatically censor it always (like we do when user ids are selected).
I am not sure the isolating cache is enough by itself. Say we have firstname
, lastname
, sex
, and age
in our dataset. Neither of these might be seen as isolating by our cache on whether columns are isolating or not, but collectively they most certainly would be. When we have a cloned natural person the combination of these attributes would still make it past the low count filter.
It may be the case that sometimes there are multiple
UID
s associated with the same natural person. This can happen, for instance, if a column likeaccount_number
orcontract_number
is used as theUID
, but a given natural person has the ability to create multiple accounts or contracts.Let's refer to this as a 'clone'.
This is the case with Telefonica. It also happened with the Kia Clinic dataset.
The main problem with this arises when there is personally identifying information (PII) like name or address in the dataset. Suppose, for instance, that there is a column
email
. If the analyst does:and the cloned natural person has used the same email in each account, then the cloak will happily display the email.
A second problem comes when the analyst uses the PII in a where clause:
The current solution to this is to remove such PII columns, or (if the customer is willing to do the work), find the clones and individually remove or mask PII for the clones.
When cloning exists, we can basically solve the problem in the following way:
UID
s associated with a given natural person). Call this theclone_threshold
.clone_threshold
. (I'm currently unclear how these are set right now, but basically there is amin_hard_thresh
and amean_soft_thresh
of some sort, and both of these would be increased byclone_threshold
).Note that it isn't really necessary to raise the amount of noise. When we do low-effect detection, we can likewise raise the threshold when dealing with PII columns.