GerkeLab / gerkelab-com

Website source for gerkelab.com
http://www.gerkelab.com
0 stars 3 forks source link

Post: Encrypting PHI in data resources #27

Open gadenbuie opened 5 years ago

gadenbuie commented 5 years ago

encryptr is interesting and allows you to do something like

gp %>% 
  encrypt(postcode, telephone)

To encrypt the columns postcode and telephone, enabling the data to be shared without the risk of exposing PHI.

encryptr uses RSA, so it has a similar authentication model to ssh, except it seems that the private key is required for decryption.

Decryption requires the private key generated using genkeys() and the password set at the time.

The package README really doesn't spend much time explaining how to use and share keys with others.

From How does RSA work?

RSA is an asymmetric system, which means that a key pair will be generated (we will see how soon), a public key and a private key, obviously you keep your private key secure and pass around the public one. https://hackernoon.com/how-does-rsa-work-f44918df914b

A blog post could explore an example with more details about key generation, key sharing, etc.

Also there is a related ROpensci package cyphr which seems to be more oriented towards encrypting files. This might be a better package choice (better community support, etc.) but there is a gap in the README in terms of column-specific encryption.

Finally, another interesting package for secret sharing is secret by Gabor Csardi et al. This package is oriented towards sharing API keys but the UseR! 2017 presentation about secret could provide a good starting point for sketching out the ideal key-sharing workflow.

gadenbuie commented 5 years ago

Also the ROpenSci package sodium which as a pretty decent overview of how encryption can be handled in R:

# Bob's keypair:
bob_key <- keygen()
bob_pubkey <- pubkey(bob_key)

# Alice's keypair:
alice_key <- keygen()
alice_pubkey <- pubkey(alice_key)

# Bob sends encrypted message for Alice:
msg <- charToRaw("TTIP is evil")
ciphertext <- auth_encrypt(msg, bob_key, alice_pubkey)

# Alice verifies and decrypts with her key
out <- auth_decrypt(ciphertext, alice_key, bob_pubkey)
stopifnot(identical(out, msg))

# Alice sends encrypted message for Bob
msg <- charToRaw("Let's protest")
ciphertext <- auth_encrypt(msg, alice_key, bob_pubkey)

# Bob verifies and decrypts with his key
out <- auth_decrypt(ciphertext, bob_key, alice_pubkey)
stopifnot(identical(out, msg))
gadenbuie commented 5 years ago

The main idea behind the private key, pubkey pair is that users share their public keys with others. Data is encrypted for a particular person by using their public key (and your private key). They can then encrypt using the reverse keys – i.e. their private key and your public key.

image
diagram ``` sequenceDiagram participant O as Data Owner participant U as User Note over O: Has Pub/Private Key Note over U: Has Pub/Private Key O->>U: Here's my public key U->>O: Cool, here's my public key too Note over O: Encodes data with
Owner's Private Key
+ Users's Pub Key O->>U: Here's the data Note over U: Decodes data using
User's Private Key
+ Owner's Pub Key ```

The main objective is that you need a public and private key pair to decrypt the data, and in all cases the private key should not be transmitted, moved, or sent.

So when @tgerke and I talked about this originally, we thought we could later provide keys to the end user to let them decrypt data they have. This probably wouldn't be a good idea from a security perspective.

What we could do instead would be to initially deliver data encrypted using the owner's private/public keys, knowing that it will not be decryptable to anyone else. If at a later point the user is granted access, we could

  1. Regenerate the data set using the user's public key and send them the new data
  2. Have the user return the encrypted data, which is then decrypted using the owner's key pair and then re-encrypted using the user's public key.

In both cases end users can use/manipulate/etc the unencrypted data as they see fit. In the first case, the regenerated data might be updated, contain more records, etc. but would hopefully be the same shape. The second case could be used for any derivative data or for situations where the source data may have changed but the user only has access to the version they received.

tgerke commented 5 years ago

Good find re: not providing keys later. Does the Providing a public key section https://github.com/SurgicalInformatics/encryptr help? TBH I don't think I fully understand how that's different than the initial solution, but it must be since it's got a section of its own.

gadenbuie commented 5 years ago

I'm not sure I fully understand either, so I think that's where the blog post can go: walking through a scenario with multiple collaborators sharing data.

My current understanding is that putting the Owner's (or data pool's) shared key would handle the first arrow above in terms of the "User" getting the Owner's pub key. But I still think the data needs to be encrypted for someone specific, otherwise anyone with the data pool public key could just decrypt the data.