RADAR-base / radar-prmt-android

Application to be run on an Android device to interact with the wearable devices & phone sensors for passive data streaming
Apache License 2.0
22 stars 26 forks source link

Differentiating between privacy-sensitive and anonymous data #49

Open blootsvoets opened 7 years ago

blootsvoets commented 7 years ago

We could consider to process privacy-sensitive data, but to encrypt it before sending it to the server. Keys to the encryption could be provided to researchers that are allowed to access those data (for example, using PEP). For non-sensitive data, we could send the data in plain text as we do now, so that the Kafka streams can aggregate it properly. Using the PEP mechanism, those data could also be encrypted, but the Kafka streams could get a key for only those data.

blootsvoets commented 7 years ago

Read through the PEP paper, this is based on a new encryption algorithm. It would be a complete infrastructure effort to implement this. The issue remains: everything that does not have to be shown in the dashboard or retrieved using the API, could be encrypted. This would involve especially privacy-sensitive stuff. In all cases, we would have to trust the researchers to handle their private keys with care though... Another note, for example Tresorit has a nice key exchange algorithm as well, and does not encrypt the data itself with those keys, but instead it encrypts the encryption keys for the data. That makes the data encryption less heavy, plus no re-encryption is needed if keys change. In any case, we'd have properly follow their protocol (or another well-documented protocol) to avoid any of the pitfalls in encryption.

fnobilia commented 7 years ago

If we encrypt data using a key that is unknown to the Platform, we cannot apply any analysis on this data. What kind of data would you encrypt with this method?

blootsvoets commented 7 years ago

Exactly, that's the point. So for example absolute locations, IP addresses or unprocessed voice data are privacy sensitive. However, we could choose to store them in encrypted way. The platform would not be able to read or process it, but just provide it as-is in the full data extracted from HDFS. Less sensitive data, such as battery levels, we'd send unencrypted so our platform can process it. We could decide on a stream-per-stream basis whether we want the data encrypted or unencrypted. Also, we could choose to leave the keys always unencrypted (anonymous patient ID), but just encrypt the values.

blootsvoets commented 7 years ago

Another alternative is to do the data processing on another "trusted" host, where we would provide the decryption key as well. Right now, I don't think we have the budget + motivation to have this additional infrastructure cost though.

fnobilia commented 7 years ago

The vast majority of collected variables are privacy sensitive (HR, Acc, ecc.. ). We can absolutely design something to provide also this functionality, but we should bring WP8 up in the discussion or wait a clear need/requirement.

blootsvoets commented 7 years ago

As long as the HR and Acc is not coupled to a specific person, I'd consider them anonymised data, which would be fine to process if we don't know the identity. However, something like absolute location can be used to find someones home and then identity. Likewise with voice recognition and IP address.