OpenMined / opus

Apache License 2.0
22 stars 9 forks source link

Trust - User has to trust that the host of the PIS will not use their SSO credentials for nefarious purposes #21

Open carrollgt91 opened 4 years ago

carrollgt91 commented 4 years ago

In addition to being an attractive target for hackers, storing these SSO credentials presents an interesting trust problem from the perspective of more privacy-conscious users.

We are just committing to the user that we are not storing their information. We are not providing strong guarantees, cryptographic or otherwise, that we will not use this information for our own gain.

Especially for more powerful SSO integrations, such as bank accounts, it might be hard to convince folks to trust us.

PlamenHristov commented 4 years ago

Solution:

I would suggest we keep everything encrypted. What this will ensure:

  1. Even if the identity server gets hacked (even though anonymised) user data will not be compromised.
  2. Even if we want to, we cannot do anything nefarious.

Implementation

Encryption

Let's put in place the facility for a user to generate a public/private key pair. Specifically let's use BIP-39. This will allow to generate a random seed phrase for the user from which we can combine with BIP-32 to generate the master pub/private key (m/0'/0'/0'):

Data consumer integration

From here we can either:

Consequence:

  1. We'll have one encrypted copy (each with different public/private key pair) for each data consumer per user. So if we have m data consumers and n users, (at some point in time) we may need to store *mn copies +1** copy encrypted with the master (root) key.
  2. If you know the seed phrase (optional combined with a password for extra security) it can be restored on whatever device the user is using.
  3. The user experience should be quite sleek and fairly intuitive
  4. It should also solve this issue
NiWaRe commented 4 years ago

@PlamenHristov although I thought about a different approach your approach with encryption also sounds good! :) Picking up on your first option in the section "Data consumer integration" I thought about using the OpenMined PySyft and PyGrid libraries (Including for encryption, etc.)

Goal

As discussed on the slack channel also with @carrollgt91 I understood the goal of this team (this repo) to build a server which should be the middle-man between the sensitive data of the user who wants to be automatically authenticated and some data-consumer who wants an authentication or some data (e.g.: another app which wants to train on some sensitive data) So based on the blog entry I imagined the SSO credentials not to be stored on the PIS bur rather being part of the client-sided data-scraping directly on the user device (also leveraging the possibility that the user is still signed in in different apps, as @carrollgt91 suggested in the #covid_mobile_data_collection channel) which then is send to the PIS on demand. The PIS (our work) should then:

  1. provide the SSI (Self-Sovereign Identity) team with the necessary data from service-endpoints (which then would be on the clients themselves as explained above) to be able to populate some DID documents or do whatever they need to do to issue the authenticating credentials.
  2. provide data to other COVID-apps which can then learn on the sensitive data, scraped from services or stored from device-sensors by the data-mining team.

Suggestion

If my description of the goal of this specific repo is correct I thought about using the Public or Private Grid Platform using the PyGrid-library to make the exchange of sensitive data from service endpoints or training data possible.

If I understood our goal correctly in both scenarios no direct channel would need to be established between the user and the data-consumer because either the data-consumer would train their model using PyGrid (the second use-case) or the data is only provided to the SSI team which would then do the issuing, validation, etc. of credentials for authentication with the data-consumer.

I may have too limited knowledge about the detailed working of PyGrid and the specific data needs of the SSI team, but potentially this could help us use much of the already existing code from other OpenMined projects.

carrollgt91 commented 4 years ago

21

@PlamenHristov Great thoughts here - I think some form of user-managed encryption scheme does solve a lot of the issues here, this one and the security breach piece, which is great. Just to make sure I understand your proposal, it seems that

Assuming those assumptions are correct...

One thing I really like about this proposal is how easy the UX is for the user to share data with data consumers when they're on the same device that has the key pair on it. It's not meaningfully different from an SSO handshake where the app you're signing into is requesting certain data from the sign on provider - i.e. sign in with facebook -> provide your name and profile photos.

However, there are some additional challenges we'd need to overcome with this strategy. I'm not as familiar with what we'll need to do to hook into the rest of OpenMined infrastructure (i.e. PySyft), so I'm not going to comment much on that piece, and instead I'll focus on the data consumer use case.

  1. The key exchange would need to be implemented in such a way that would allow for the immediate use of the data within a data consumer. I think we'd want to supply client libraries that would make this process very easy, similar to how there are tons of off-the-shelf client libraries for the OAuth and OpenID protocol. Ease of integration for the data consumer is really important, and if we have to ask them to implement custom decryption, I think that will reduce the number of applications willing to integrate. The more we can lean on existing libraries for this, the better - there's a lot to like about the BIP-based crypto you linked to. There are a good number of libraries for it in different ecosystems. However, I think it's worth examining alternative options for the encryption scheme that would be easiest for the data consumer to integrate with.

  2. We'll definitely need to have more robust client-side applications built to generate/manage these keys, as well as house the user's sensitive data. Here are a few things that we'd need:

    • Key invalidation (this would also need to be implemented server-side to allow for data to be re-encrypted under the new key)
    • Multi-Device syncing - this would be tricky to do in a secure way without some sort of peer-to-peer handshake or a "master password" concept akin to how password managers implement key sharing
    • Client-side data storage - in order for key invalidation to work, the client will need to store all of the user's sensitive data locally. Given that we're going to be obtaining much of this data via web-scraping, we're already going to need to solve the problem of ensuring that the data we're storing client-side is verifiable, but this is doubly true given this approach. In a situation where the server is allowed to at least momentarily gain access to unencrypted data, it can compare the hash of that data to the hash that was generated during the initial collection of the data to ensure that the user hasn't tampered with the data in the mean-time. There is likely a way to accomplish this with modern crypto, but I am not aware of a solution to that problem off the top of my head.