Handling of PII fields - Githubissues

rudigiesler commented 3 years ago

I'll start off with the specific reason for this request, and then follow with a suggestion for a generic way that this can be implemented, as I'm sure it's useful for a lot of use cases.

For the flow results service, we don't want to keep all data inside the database forever. This results in a large, slow, and expensive database. Rather, we would like to archive old data out into an S3-like service, similar to how RapidPro currently does.

When dealing with PII, you often need to provide a mechanism for a user to delete all of their PII. When all this data sits in a database, this is easy to do. But if you store it in an S3-like service, this becomes a lot more difficult. If there is PII in your archives, you will have to download, modify, and re-upload all your archives for a delete request. An easier way to do this is to not store the PII in the archives in the first place, either by encrypting the data, and only keeping the encryption key in the database (which you can then easily delete on request), or by doing a one-way hash on the data before archiving it.

There are two fields that I think could contain PII:

Contact ID. Currently some of our services use the user's phone number as their contact ID.
Response. Some questions ask for PII, which is then stored as the response

For Contact ID, it makes sense just to always encrypt/hash this field.

For Response, we want to be able to mark which questions are PII, so that we only encrypt/hash those, eg. if there's a question that asks "Tea or Coffee?", we don't want to encrypt/hash those answers, as we might want to go back and look at the choices across all our historical data, but if we're asking for a user's ID number, then we do want that encrypted/hashed, there's not much analysis we can do on this anyway

I'm proposing to add a pii boolean property to each question, defaulting to false , and not a required field, that will allow us to mark which questions might contain PII in their answers.

I'm happy to discuss other approaches, as well as some things I'm not sure about:

Do we consider Contact ID PII, or do we push the responsibility on to applications to ensure that it's not?
Do we have this in the flow results spec, or do we push the responsibility onto applications to encrypt/hash before storing it in the flow results format?
Do we encrypt the fields (reversible, but need to make sure that we delete the encryption keys on user request), or hash them (one way, so we don't need to do anything on user request)? (this could also just be an implementation detail, it doesn't ned to be part of the spec)

markboots commented 3 years ago

Hi @rudigiesler , thanks for those valuable thoughts; this is a really salient issue.

Adding on some use-cases we've seen from the Viamo perspective:

We've started using one-way secure hashes to de-identify PII at 3 levels: in data reports shared with partners/clients, inside our app's reporting environment for user accounts that don't have permission granted to see personal data, and to respect data-residency requirements. We've found this to be a good approach because researchers/users of the data can still identify repeated values in the dataset, even though they don't know who the individuals are.
We see use-cases where a data server might need to securely store PII, but expose only de-identified data to some users/reports/forms of API access.
We definitely agree on the principle that not all data is PII; it might be just a user's phone number and certain questions marked as such.

From the spec perspective: Until now, Flow Results hasn't taken a stance on de-identification or said "this is the way to do it"; it's been up to applications to share or store what they want. With a growing focus on data privacy and security and strategies like pseudonymization, would it be helpful to put some standard or suggested ways of supporting this into the spec?

I'm proposing to add a pii boolean property to each question, defaulting to false , and not a required field, that will allow us to mark which questions might contain PII in their answers.

No concerns from me.

Additional questions on how we might want to go further:

Is anyone aware of any other standards in ICT4D for common patterns on de-identification? For example, something that says: "The universal way to represent a secure personal identifier is the SHA256 of <FIELD_ONE>-<FIELD_TWO>", which would mean that any applications using this would have the same (interoperable) value for that person's ID?
Is it useful for Flow Results to put in standard ways of handling de-identification? For instance, Flow Results servers could operate in a mode where they ingest data containing PII values, but then when serving them back, the PII values could be de-identified using a standard hashing mechanism, depending on the privileges of the API key accessing the data. An approved API key could request raw data, whereas a lower-privilege API key would get de-identified data. [This could also be an API query parameter for higher-privilege API keys.]

@pld , I think Ona's team has done a lot of thinking about GDPR and data governance. Any suggestions on this? @nditada ?

markboots commented 3 years ago

Do we encrypt the fields (reversible, but need to make sure that we delete the encryption keys on user request), or hash them (one way, so we don't need to do anything on user request)? (this could also just be an implementation detail, it doesn't ned to be part of the spec)

If we include something on this within the spec*, I would lean toward hashing them with an approved "secure" hash like SHA256, since our GDPR research concluded that if there is any way to reverse the process and regain the personal information, it still constitutes personal information. Non-reversible hashed data would not count as personal information -- unless there was some other way of identifying the individual through patterns of other data.

(* I agree that we don't need this in the spec; apps can just store de-identified data from the start. But if it's a common and valuable use-case, we could consider support for it.)

pld commented 3 years ago

I do not think this appropriate in the spec, here is what FHIR says about why this is not included in its spec:

The appropriate protections for Privacy and Security are specific to the risks to Privacy and the risks to Security of that data being protected. This concept of appropriate protections is a very specific thing to the actual data. Any declaration of 'required' or 'optional' requirements that could be mentioned here are only recommendations for that kind of Resource in general for the most common use of that Resource. Where one uses the Resource in a way that is different than this most common use, one will have different risks and thus need different protections.

FLOIP / flow-results

Handling of PII fields #40