FLOIP / flow-results

Open specification for the exchange of "Results" data generated by mobile platforms using the "Flow" paradigm
6 stars 2 forks source link

Handling of PII fields #40

Open rudigiesler opened 3 years ago

rudigiesler commented 3 years ago

I'll start off with the specific reason for this request, and then follow with a suggestion for a generic way that this can be implemented, as I'm sure it's useful for a lot of use cases.

For the flow results service, we don't want to keep all data inside the database forever. This results in a large, slow, and expensive database. Rather, we would like to archive old data out into an S3-like service, similar to how RapidPro currently does.

When dealing with PII, you often need to provide a mechanism for a user to delete all of their PII. When all this data sits in a database, this is easy to do. But if you store it in an S3-like service, this becomes a lot more difficult. If there is PII in your archives, you will have to download, modify, and re-upload all your archives for a delete request. An easier way to do this is to not store the PII in the archives in the first place, either by encrypting the data, and only keeping the encryption key in the database (which you can then easily delete on request), or by doing a one-way hash on the data before archiving it.

There are two fields that I think could contain PII:

  1. Contact ID. Currently some of our services use the user's phone number as their contact ID.
  2. Response. Some questions ask for PII, which is then stored as the response

For Contact ID, it makes sense just to always encrypt/hash this field.

For Response, we want to be able to mark which questions are PII, so that we only encrypt/hash those, eg. if there's a question that asks "Tea or Coffee?", we don't want to encrypt/hash those answers, as we might want to go back and look at the choices across all our historical data, but if we're asking for a user's ID number, then we do want that encrypted/hashed, there's not much analysis we can do on this anyway

I'm proposing to add a pii boolean property to each question, defaulting to false , and not a required field, that will allow us to mark which questions might contain PII in their answers.

I'm happy to discuss other approaches, as well as some things I'm not sure about:

markboots commented 3 years ago

Hi @rudigiesler , thanks for those valuable thoughts; this is a really salient issue.

Adding on some use-cases we've seen from the Viamo perspective:

From the spec perspective: Until now, Flow Results hasn't taken a stance on de-identification or said "this is the way to do it"; it's been up to applications to share or store what they want. With a growing focus on data privacy and security and strategies like pseudonymization, would it be helpful to put some standard or suggested ways of supporting this into the spec?

I'm proposing to add a pii boolean property to each question, defaulting to false , and not a required field, that will allow us to mark which questions might contain PII in their answers.

No concerns from me.

Additional questions on how we might want to go further:

@pld , I think Ona's team has done a lot of thinking about GDPR and data governance. Any suggestions on this? @nditada ?

markboots commented 3 years ago

Do we encrypt the fields (reversible, but need to make sure that we delete the encryption keys on user request), or hash them (one way, so we don't need to do anything on user request)? (this could also just be an implementation detail, it doesn't ned to be part of the spec)

If we include something on this within the spec*, I would lean toward hashing them with an approved "secure" hash like SHA256, since our GDPR research concluded that if there is any way to reverse the process and regain the personal information, it still constitutes personal information. Non-reversible hashed data would not count as personal information -- unless there was some other way of identifying the individual through patterns of other data.

(* I agree that we don't need this in the spec; apps can just store de-identified data from the start. But if it's a common and valuable use-case, we could consider support for it.)

pld commented 3 years ago

I do not think this appropriate in the spec, here is what FHIR says about why this is not included in its spec:

The appropriate protections for Privacy and Security are specific to the risks to Privacy and the risks to Security of that data being protected. This concept of appropriate protections is a very specific thing to the actual data. Any declaration of 'required' or 'optional' requirements that could be mentioned here are only recommendations for that kind of Resource in general for the most common use of that Resource. Where one uses the Resource in a way that is different than this most common use, one will have different risks and thus need different protections.