Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Discussion: support multiple distinct users per row #2991

Open sebastian opened 6 years ago

sebastian commented 6 years ago

There are cases where a table contains data about multiple distinct entities that need protecting.

In some instances the individuals are of the same class (a person sending another person money or a email – both being identified by their bank account number or email addresses respectively), in others the individuals might be of altogether different classes (organisation and human being, or individual making a call, and the cell tower through which the call went).

Our system is not built to deal with any of these cases. If we look at the bank transaction example, one could easily learn something about more common bank transaction recipients (assuming the system has been configured to protect the senders) by a query such as:

SELECT recipient.iban, count(*)
FROM table
GROUP BY 1

The simplest solution is to say that our system does not support these use cases. We have recommended that such secondary user identifiers should be removed from the dataset. This isn't however always straight forward. As we know from our work on isolating columns, there might be any number of columns that can act as user identifier substitutes. These are oftentimes hard to classify correctly, and if we do not have the secondary uid-columns in the dataset, the cloak cannot determine these connections either. What's more: there might be a lot of valuable information in these secondary id's or secondary user identifier substitute columsn that we would want to expose through our system. Say for example that the recipients of bank transactions could be individuals (those we want to protect) but could equally so be large corporations (Amazon, Lidl, IKEA, ...) about whom we would like to learn more. Without additional knowledge it isn't always possible to identify which is which. Or say we wanted to read out the country part of an IBAN, or the issuing bank.

Below I will list some ideas for possible ways we could attack the problem of multiple users per row. This is something we need to discuss further. Hopefully this leads to some useful solutions.

Splitting table by uid-column

@yoid2000 has been playing around with the idea of splitting tables with multiple distinct users into multiple tables each containing only the subset of the data about a single protectable entity.

Doing this manually is error prone as you might not capture all potentially isolating columns. One solution would be to allow specifying multiple user-id columns in your cloak config and having the cloak partition the table automatically. The cloak would run the isolating columns check for each uid-column. The columns that are isolating for:

Different levels of protection within a column

If the customer can provide a whitelist of entities that do not require protection (i.e. large retail stores), then we could perform a table JOIN with the whitelist such that we allow whitelisted properties through the anonymizer, and redact those values that are not whitelisted. This process would also have to be done for other columns that are deemed to be isolating for the particular entity type (for example recipient.name for a user id of recipient.iban).

This makes data discovery more complicated as it is a whitelist approach where entities you do not know about ahead of time will vanish.

Composite uid

The most generic approach (but furthermore probably the one that yields the worst results) would be to construct composite contaminating id's. I.e. if Alice sends Bob a message, then we treat Alice and Bob as a single individual from that point on. These IDs would also have to be such that if Bob has sent Cecilia a message as well, then AliceBobCecilia have to become a composite ID in all cases where either of the IDs (or combination of IDs) appear. In heavily connected datasets this quickly results in there being very few distinct user ids and the resulting anonymization very destructive.

The reason we would have to make this composite IDs be sticky across all occurrences of the IDs in the result set is that otherwise you would end up with additional user id's which would in turn have the potential to let data pass the anonymizer. Consider the following dataset:

Sender UID Sender Secret Recipient UID
A x B
A x C
A x D
A x E

The query

SELECT "Sender Secret"
FROM table
GROUP BY 1

would quickly yield sender A's secret if only the combinations of sender and recipient would collectively be considered a new user.

Question:

Different classes of protectable entities

Say you are protecting organisations as well as users. If there is a clear hierarchy between the entity classes (say company and employee) then one could protect the top-level entity (the company) and the lower levels get protected automatically (the employees). This is what we are attempting to do at DATEV. However if there is no such clear relationship (say customer and shop), then such an option does not exist. I do not have any idea on how to solve this yet.

Other tricks

We can play some other slightly finicky tricks too. This brings us more into data masking territory. This particular example could be used in combination with splitting of tables.

Say we have a bank transactions dataset with a sender and a receiver and that we have split the dataset in two such that there is a sender specific version and a recipient specific version. For datatypes that we fully understand, say IBAN, we could for generate new sub-columns exposing the parts of the value that are safe. For example we include recipient.IBAN.country, recipient.IBAN.issuingBank in the table containing data about the sender. Likewise for phone numbers we could expose the region or country code.

We have to be very cautious here though. There are many cases where it might seem safe to expose a part, but where it really isn't the case. Consider email addresses. The domain or TLD might be considered safe in most cases, but clearly aren't always. If you learnt that someone with a probsteide.com email address is in the database, then the number of owners is very limited.

yoid2000 commented 6 years ago

I want to quick mention that isolating columns, as we've defined them in the past, are not the problem here. An isolating column is one where more than some fraction of the values are distinct to one user (I think that fraction is currently set to 50%).

The problem is what I would call auxiliary UIDs, or XUID for short. An XUID is a column where every value in that column is the same for a given UID. Examples of XUID are firstname, street, SSN etc.

This suggests to me that the approach where we make multiple views, each with a single UID, is not all that error prone. Normally I would expect XUID columns to be not in the table that has multiple UIDs (i.e. transactions table, or emails table), but rather in a table that would normally be joined to a transactions table. When this is the case, then in fact all that needs to be done is to remove one of the UID columns. When this is not the case, then it should be fairly apparent what the XUID columns are. For instance, if there are senderUID column, then one might expect something like senderAcctNumber or anyway something that pretty clearly identifies the column as auxiliary to the senderUID...

I suggest we have a look at TeamBank or DATEV schema and see if there are counter-examples.

yoid2000 commented 6 years ago

As a side note, I would like to break this issue into three separate issues....

sebastian commented 6 years ago

Normally I would expect XUID columns to be not in the table that has multiple UIDs (i.e. transactions table, or emails table), but rather in a table that would normally be joined to a transactions table.

I don't think this assumption holds. In an ideal world with normalized database tables the secondary personal information would be in tables that get joined into say a transaction table. Reality is messier. Especially for analytics purposes (and especially so in NoSQL databases) you often have heavily denormalized tables in order to provide easy access to data without having to do JOINs.

Even if the data is nicely normalized in the database, the customer might hide a complex (and frequently needed) JOIN behind a view which in turn is then exposed through Aircloak.

Let's take the TeamBank table as an example. Each transaction contains at least the following information for the recipient:

As a side note, I would like to break this issue into three separate issues....

Note that this is not an implementation issue, but rather an issue to base discussions off of. If you think it aids the discussion to break these into separate issues, then we should definitively do so.

yoid2000 commented 6 years ago

I can imagine that it still sometimes gets messy, but I primarily wanted to make the point that not every isolating column is dangerous.

Anyway, I agree that long-term we should have something that isn't error prone. Short term, however, this seems manageable if we help our customers with this...

yoid2000 commented 6 years ago

There are two distinct problems here. One is dealing with multiple different entities that we want to protect. Let's call this the "multiple UID" problem.

The other is that there are a subset of protected UIDs that don't need to be protected. Lets call this the "safe UID" problem.

Here I talk about the multiple UID problem. The safe UID problem can be discussed elsewhere.

Regarding multiple UID problem, there are two variants, symmetric and asymmetric. Symmetric is where the two UIDs are of more-or-less the same class, like sender/recipient in a transaction or email exchange. The other is where they are clearly different, like organization and user.

Symmetric multiple UIDs

So first I'll limit this to symmetric multiple UID, but with an understanding that both symmetric and asymmetric can have similar mechanisms.

Some terms:

(Note that there could in principle be a UID3, UID4, ...)

I think there are basically two approaches one can take with this:

  1. Schema Modification: Modify the schema so that there is only UID per table or JOIN of tables. This requires no cloak changes.
  2. Query checking: Modify the cloak so that it is aware of multiple UIDs, and have the cloak inspect and reject queries that are not safe.

Schema modification

Goal is to make it so that there appears to be two separate schemas (schema1 and schema2), one with UID1 and one with UID2.

For a given schema (say schema1) the following holds:

Note that unlinked XUID columns are left alone and not given any special treatment.

Now the question becomes, how can we implement these dual schemas. Following are some approaches, from least to most disruptive of the code base.

Change the data store

Modify the native data store to the new schema. In cases where the DB technology allows for views, this should not be very disruptive for the customer. Where views are not allowed, this could be very disruptive.

Sebastian has raised the question of it being error prone to correctly identify all linked XUID columns. One way to mitigate this would be to have a software tool that runs offline and identifies linked XUID columns. It could operate by selecting a smallish number of random UIDs from tables with two UIDs, and making queries of the form:

SELECT *
FROM table
WHERE uid = UID

on tables that have multiple rows per UID. It then identifies those columns where there is only a single distinct value for a given UID. This column is then a likely linked XUID. These would then be reported to the administrator.

Note that with this approach, an analyst could get two noise samples for most (or all?) queries just by executing the same query on each schema.

Use Cloak views

Similarly we could use cloak views (those configured in the .json file by the cloak administrator, not those that can be configured by the analyst).

The advantage here is that it could be used with data stores that don't natively have views, so less disruptive to the customer.

The substantial disadvantage is that every view made would have to be carefully constructed so that the schema is properly split. If you believe as I do that administrator views are a powerful tool for getting around cloak limitations in a safe way, then this would essentially destroy that use of admin views.

The "fix" would be to allow views to be made on views, but this sounds like a terrible idea ... complex and the wrong way to fix the problem overall.

This approach also has the double noise sample problem.

Query Checking

Cloak dynamically rejects queries based on configuration

In this approach, the UID and linked XUID columns are specified in the cloak configuration.

The cloak then rejects queries that try to mix UID1 and UID2 (or associated linked XUID) in the same query.

Noting that problems occur only when either two UIDs are used in the same query, or the wrong XUID is used (i.e. XUID2 with UID1), it seems that we only need two types of configured columns:

  1. UID column
  2. Linked XUID column

A linked XUID is one that is specifically associated with one or the other UID. For example, say you had single table with the following four columns (among others): sender_id, receiver_id, sender_acct, receiver_acct. In this case, sender_id and receiver_id would be UIDs, sender_acct would be a linked UID linked to sender_id, and receiver_acct would be linked to receiver_id.

By contrast, say you had two tables, acct_info and trans (transactions). The trans table has columns sender_id and receiver_id. These are configured as UIDs, and can be used to JOIN to other tables. The acct_info has columns acct_num and name. Because these are unlinked XUIDs, they don't have to be configured. A query that joins the two tables would do so on one UID or the other, and the XUID would automatically be associated with the appropriate UID.

The following queries are rejected:

The following queries are allowed (maybe partial list):

Regarding admin views (the query object in the .json), I think if we do this we want to change the configuration structure. As it now is, each admin view requires that the UID be specified, and if we stay with this model it means that the linked XUID also need to be specified. This seems redundant and increases the possibility of errors. Unless I'm missing something, I would think a better approach would be to have a separate structure for specifying UIDs and XUIDs, and then any admin view would import its UIDs and XUIDs from there.

I'm not sure if this has any bad interactions with projected tables.

Cloak dynamically suppresses columns based on auto-detection of XUIDs

Rather than have the admin configure the linked XUIDs, the cloak could do it. The idea here is that, for any given table with two or more UIDs, the cloak would examine each column to see if it is a linked XUID for one of the UIDs (like the offline tool mentioned above).

Once determined, the cloak operates as with configured UIDs and XUIDs above.

Failed Ideas

Randomize UID2

One idea was to randomize UID2 (in schema1), but otherwise allow it to be queried.

The benefit would have been that the analyst would be able to learn about relationships between users. For instance, supposing that UID1 is the sender and UID2 is the recipient, an analyst would still be able to ask for instance "how many distinct recipients do senders have?", "how many distinct recipients above a transaction amount of $X?", and so on.

This doesn't work because one could establish a set of attributes for each random UID2 that would expose information about individual users like this:

SELECT col, UID2, count(*)
FROM table
GROUP BY 1,2

Don't display rather than delete

A second idea was to keep UID2 and XUID2 columns, but never allow the values themselves to be displayed. The benefit would have been that these columns could still be used in conditions. The problem is that an analyst could learn an arbitrary about about each user with queries like:

SELECT col, count(*)
FROM table
WHERE UID2 = uid
GROUP BY 1

LCF Checking on both UIDs

Another idea is, rather than configure linked XUID columns, to float both UIDs, do LCF checking on both UIDs, and only allow the row if both LCF checks say it is ok. This would automatically deal with any dangerous columns.

The problem here is that it hard to make work where there is an aggregate function that has GROUP BY UID. You'd have to prevent the GROUP BY from running in the DB and rather run it in the cloak. This would increase processing, not decrease as we are trying to do with no-uid.