Blind signatures to address collusion between backend and HA

DP-3T / documents

Decentralized Privacy-Preserving Proximity Tracing -- Documents

2.25k stars 180 forks source link

Blind signatures to address collusion between backend and HA #210

Open a8x9 opened 4 years ago

a8x9 commented 4 years ago

This is a proposal for improving infected users anonymity against a collusion between the health authorities (HA) and the backend.

Problem with current design

HA have all personal details of patients
They are the one who provide an authorization code to the patient
Patient uploads this authorization code with their EphIDs to the backend
A collusion between the HA and the backend allows to personally attribute all EphIDs using the authorization code

Problem 1: Any EphID recorded can now be personally attributed, without needing any camera or other ways of identifying a user.

Problem 2: It is possible to determine which patients uploaded their EphIDs and which ones did not, this can jeopardize the voluntary aspect of this protocol.

Assuming a collusion between these two parties is not unrealistic given that they are likely to be controlled by the same government, or be part of an entity on which said government would have a lot of influence.

One possible solution to problem 1 would be to make the information published in case of infection unlinkable to the information broadcasted by Bluetooth, as I proposed in #66. Unfortunately the DP^3T team seems unwilling to go in this direction, so there needs to be some other protection against the attack described above. In any case, something is needed to address problem 2.

Proposed solution

Let the user generate the authorization code, and use blind signatures to prove ownership of a valid code.

Let's define:

e, N; RSA public key exponent and modulus
d: corresponding RSA secret exponent known only by backend (and HA in case of collusion)
H: a cryptographic hash function
x: large random number, the authorization code generated by user
r: large random number, generated by user, used in blinding factor

When going to the hospital for a test:

User provides to HA: m = H(x) * r^e (mod N)
HA asks backend to sign m and gives it back to user: s_b = m^d = H(x)^d * r (mod N)
User "unblind" signature: s = s_b * r^-1 = H(x)^d (mod N)

When the user gets informed of a positive test result:

User uploads the tuple (s, x) attached to their list of EphIDs to the backend,
Backend verifies: s^e = H(x) (mod N)
Backend keeps track of used x to avoid re-use of the same code for flooding.

This allows the user to prove that they have a valid authorization code, and thanks to the blinding factor r^e, the backend and HA cannot link an authorization code back to a specific patient.

The reason why H(x) is used instead of x directly, is because of the commutativity of RSA. This prevents users colluding by multiplying their authorization code together in order to generate new valid signatures.

The set of EphIDs in design 2, or the tuple (SK_t, t) in design 1, could also be used as value x in order to prevent transferability of the authorization code, solving the issue exposed in #113. This would however prevent the report of EphIDs generated between the time at which the above signature is performed, and the time at which the user uploads to the backend.

Disadvantages of proposed solution

HA needs to scan QR code from patient (m), and then patient needs to scan QR code from HA (s_b). This is an additional QR code scan compared to current design.
Authorization code cannot be activated at a later time. At least not using QR codes at the time of the 1st visit to the hospital. If this is considered problematic, a connection between HA and user, instead of QR codes, is required so the above information exchange can be done later remotely when the test results are provided.

I think this scheme should solve the concerns raised in #206, but I might have overlooked something which would make user identification trivial. Feedback and improvements are welcome.

FishmanL commented 4 years ago

One thing that might make this cleaner is having that QR code-gen be part of the covid test/appointment flow (use patient info + random entropy in their mouse movements to derive X, etc.)

FishmanL commented 4 years ago

That way, you can just have phone-to-phone comms (via e.g. HTTPS)

lbarman commented 4 years ago

Thanks a lot for the detailed input @a8x9! Sounds like a great idea (only concern: each country/region might want to specify this themselves, however we probably can make a good default recommendation). We are working in a similar direction to define exactly how the authorization code will work, stay tuned.

pevab commented 4 years ago

What about the following scenario ?

Malicious user generates an auth code x, and the associated m
Malicious user asks backend to sign m
Malicious user unblinds the signature, and gets s
Malicious can upload its SK_t and ephIDs to backend. The backend will accept.

Thus Malicious user can pretend to be sick. This increases the overall false positive rate.

The main point here is : how the backend trusts HA when being asked to sign ?

(maybe I missed something)

ps: just advocating a bit more for IRMA :) https://privacybydesign.foundation/irma-explanation/

FishmanL commented 4 years ago

@pevab that's solvable by just having the HA sign the message with their signing key

a8x9 commented 4 years ago

@pevab This design, as well as the current one, assume an authenticated channel between the HA and the backend.

The whitepaper should probably clarify how this authenticated channel is achieved, e.g., username+password or public key as mentioned by @FishmanL. Then I guess it's up to each country to determine how they distribute these credentials to the different HA actors.

Now, regarding IRMA more specifically:

I assume you consider the HA are different issuers (of the infected attribute) and the backend is a verifier.
The HA and backend also need an authenticated channel to share their public keys.
IRMA provides what they call multi-show unlinkability, this means that a code could be infinitely re-used without being detected.
The idemix paper (on which IRMA is based) mentions that you can have "one-show" attributes but that significantly complexify the design and requires constant interaction between all issuers and verifiers to check for "double-spend". And in this case you also need some authenticated channels between these different parties.

So in conclusion, IRMA could be used to achieve the same goals as this proposal, but it is significantly more complex (key generation, key distribution, zero-knowledge proofs for verification, ...) and does not provide any advantage for this use-case as far as I can tell. Don't hesitate to correct me if I'm missing some properties which would make IRMA's complexity worth it.

ryanbnl commented 4 years ago

If the workflow requires a QR code to be scanned by both doctor and patient at the moment a test is made, why not also upload the empIDs (or whatever set of data needs to be uploaded depending on the final design choices) there and then?

Pack them, sign the package with a PK from the authority (government); transfer them to the HCW's device which then uploads them with the status pending test for which they get an batch id back.

Test authority then reports the test id as being positive or negative via a trusted back channel. Data uploads are automatically deleted after 14 days no matter what.

This give you:

Minimal time between test result and warning contacts.
No exposure of a specific patient making a report which will be visible or derivable during transport.
Report is made even if the testee dies or is incapacitated between taking test and receiving results.
Infected person retains the rights not to be informed of the result of the test (they will have to isolate for 14 days if this happens). This is required in Dutch law for most tests. COVID might possibly be exempt from the law, however IANAL.

I assume that there must be some massive privacy issues with the design?

FishmanL commented 4 years ago

What about contacts between 'when you get the test' and 'when it comes back positive'?

ryanbnl commented 4 years ago

What about the contacts between when you report yourself via the app after getting the test result and when you're no longer infectious? :)

FishmanL commented 4 years ago

Any good frontend should enable you to select 'positive test result' and then autoupload nondummy data til you reselect 'no longer infectious' :p

ryanbnl commented 4 years ago

They're not going to get your empIDs after you upload the data, so there's always some time you're not covered.

But the second you're tested you have to isolate until you get a negative test result back. So it makes little difference if you upload when the test is made or when the result is known.

My arguments against doing that are the same as my arguments against requiring physical contact or interaction at testing time. I.e. technical problems causing a bottleneck there, and putting extra requirements on the testing.

a8x9 commented 4 years ago

@ryanbnl The first goal of this proposal is that neither the HA, nor the backend, even if colluding, can link back the uploaded EphIDs to your real identity. The second goal is that neither of these entities can determine if you uploaded your EphIDs or not after receiving the test results.

Giving your list of EphIDs to the health care worker who knows your identity is the exact opposite of what is intended here. You lose the pseudonymous nature of EphIDs and you lose the voluntary aspect of this protocol.

FishmanL commented 4 years ago

This is true -- even if you encrypt the ephids with some backend key before sending, the HA can link the package to an ident.

pevab commented 4 years ago

@pevab This design, as well as the current one, assume an authenticated channel between the HA and the backend.

Ok thanks.

The whitepaper should probably clarify how this authenticated channel is achieved, e.g., username+password or public key as mentioned by @FishmanL. Then I guess it's up to each country to determine how they distribute these credentials to the different HA actors.

Agreed.

Now, regarding IRMA more specifically:

I assume you consider the HA are different issuers (of the infected attribute) and the backend is a verifier.

The HA and backend also need an authenticated channel to share their public keys.

IRMA provides what they call multi-show unlinkability, this means that a code could be infinitely re-used without being detected.

The idemix paper (on which IRMA is based) mentions that you can have "one-show" attributes but that significantly complexify the design and requires constant interaction between all issuers and verifiers to check for "double-spend". And in this case you also need some authenticated channels between these different parties.

So in conclusion, IRMA could be used to achieve the same goals as this proposal, but it is significantly more complex (key generation, key distribution, zero-knowledge proofs for verification, ...) and does not provide any advantage for this use-case as far as I can tell. Don't hesitate to correct me if I'm missing some properties which would make IRMA's complexity worth it.

Actually, it's not exactly what I had in mind. Assume Alice, having been tested positive, wants to disclose SK_t (or a set of ephIDs) to the backend. With Idemix, she can do as follows:

Alice commits to the value SK_t, i.e. C <- Commit(SK_t). She sends C to HA. The commitment C has a hiding and binding property. C does not reveal the value SK_t. And Alice cannot prove that C is the commitment of another value.
HA issues a (blind signed) credential to Alice, which "contains" the commitment C.
Alice can disclose SK_t to backend, and provide a zero-knowledge proof stating that she has rightful signature from HA along with the commitment to the value SK_t.

Multi-show unlinkability would mean, in this case, that Alice may disclose SK_t this way as many times as she wants, the backend cannot correlate these disclosures. It is not a problem though, as SK_t becomes public information. Alice cannot, however, uses her same credential to upload different SK_t's.

A small advantage of this scheme over the one proposed here is that it avoids communication between HA and backend. I do agree that Idemix is more involved. But, a more interesting aspect, in my opinion, is that there are existing implementations of Idemix (IRMA for instance), which are currently deployed and running.

Having said all that, I am not a specialist of IRMA and Idemix (neither am I working for them ^^). But, from a practical standpoint, I think it is worth having a look at these protocols, as they seem to have been experimented with them for some time (well, you could say that the java implem https://github.com/privacybydesign/irma_api_common is 5 years old, so not that much ...).

I do agree with you though on Idemix being more complicated than the scheme proposed here. If integrating IRMA's implementation turns out to be more difficult/painful than directly implementing the current scheme, I would go for the current scheme.

a8x9 commented 4 years ago

@pevab Thanks a lot for your reply and the clarifications.

Alice commits to the value SK_t, i.e. C <- Commit(SK_t). She sends C to HA. The commitment C has a hiding and binding property. C does not reveal the value SK_t. And Alice cannot prove that C is the commitment of another value.

Do you have an example of a commitment scheme with such properties? More specifically, a commitment scheme with such properties, where C cannot be attributed back to Alice when she discloses her SK_t to the backend, while still limiting her to a single upload?

As far as I can tell, the issuing process of IRMA requires the attributes to be in clear. So Alice's SK_t cannot be used as an attribute.

If I understand correctly, the commitment scheme is used by Alice to tie together her secret key and a set of attributes generated by a service provider, i.e., the OK [+DisclosureProof] message content in the following diagram. This secret key is used in the zero-knowledge proof of attribute ownership, and needs to be kept secret, so Alice cannot use the SK_t as this value.

But honestly I don't know much about IRMA and zero-knowledge proofs, and I didn't understand how the protocol you described relates to the IRMA API (v1 or v2). If I completely misunderstood your design, could you please clarify which API endpoints, or which function in irma_api_common would be used with which data?

A small advantage of this scheme over the one proposed here is that it avoids communication between HA and backend.

Please note that the design I proposed here can have the same property:

Each HA entity has an RSA private key.
Backend only needs the corresponding public keys for signature verification during upload.
In this case no authenticated communication is needed between HA and backend after the key distribution phase, exactly like in the IRMA design.

However, I proposed to store the private key on the server for the following reasons:

1) Only one key pair -> the backend cannot identify in which hospital Alice was tested when she uploads her EphIDs*. 2) Significantly lower risk of a private key leaking as they are not stored on a large number of scanning devices (probably dozens per hospital). 3) One less QR code scan, i.e., no need to first retrieve the public key of a specific HA entity if there is only a single key pair.

But yes, in exchange of these advantages, you need an active connection between the HA and the backend.

* Sharing the same private key between all hospitals in a country would also mitigate this, but it makes point 2. significantly worse when this private key leaks, i.e., all devices are revoked at the same time and they all need to be reconfigured with the new private key which needs to be distributed through an encrypted and authenticated channel.

a8x9 commented 4 years ago

PoC code for this proposal.

pevab commented 4 years ago

@pevab Thanks a lot for your reply and the clarifications.

Alice commits to the value SK_t, i.e. C <- Commit(SK_t). She sends C to HA. The commitment C has a hiding and binding property. C does not reveal the value SK_t. And Alice cannot prove that C is the commitment of another value.

Do you have an example of a commitment scheme with such properties? More specifically, a commitment scheme with such properties, where C cannot be attributed back to Alice when she discloses her SKt to the backend, while still limiting her to a single upload?

I found a nice summary of idemix here https://privacybydesign.foundation/pdf/Idemix_overview.pdf

As far as I can tell, the issuing process of IRMA requires the attributes to be in clear. So Alice's SKt cannot be used as an attribute.

If I understand correctly, the commitment scheme is used by Alice to tie together her secret key and a set of attributes generated by a service provider, i.e., the OK [+DisclosureProof] message content in the following diagram. This secret key is used in the zero-knowledge proof of attribute ownership, and needs to be kept secret, so Alice cannot use the SKt as this value.

But honestly I don't know much about IRMA and zero-knowledge proofs, and I didn't understand how the protocol you described relates to the IRMA API (v1 or v2). If I completely misunderstood your design, could you please clarify which API endpoints, or which function in irma_api_common would be used with which data?

I'm not familiar with the api either, I'll have a look.

A small advantage of this scheme over the one proposed here is that it avoids communication between HA and backend.

Please note that the design I proposed here can have the same property:

Each HA entity has an RSA private key.

Backend only needs the corresponding public keys for signature verification during upload.

In this case no authenticated communication is needed between HA and backend after the key distribution phase, exactly like in the IRMA design.

However, I proposed to store the private key on the server for the following reasons:

Only one key pair -> the backend cannot identify in which hospital Alice was tested when she uploads her EphIDs*.

Significantly lower risk of a private key leaking as they are not stored on a large number of scanning devices (probably dozens per hospital).

One less QR code scan, i.e., no need to first retrieve the public key of a specific HA entity if there is only a single key pair.

But yes, in exchange of these advantages, you need an active connection between the HA and the backend.

Sharing the same private key between all hospitals in a country would also mitigate this, but it makes point 2. significantly worse when this private key leaks, i.e., all devices are revoked at the same time and they all need to be reconfigured with the new private key which needs to be distributed through an encrypted and authenticated channel.

a8x9 commented 4 years ago

Do you have an example of a commitment scheme with such properties? More specifically, a commitment scheme with such properties, where C cannot be attributed back to Alice when she discloses her SKt to the backend, while still limiting her to a single upload?

I found a nice summary of idemix here https://privacybydesign.foundation/pdf/Idemix_overview.pdf

I don't think this answers my question. As far as I see, this summary does not contain anything resembling a commitment scheme with the properties you described.

The part related to IRMA /api/v2/issue/issueID/commitments API endpoint is present in the first table on page 2.

The U value is Alice's commitment to her secret key m₀.
She send this value with her in-clear attributes m₁, ..., m_l
The issuer "binds" together, Alice's commitment to her secret key (U), and her attributes, with a "blind signature".

Now here is where lies the problem that I tried to explain in my previous post:

m₀ cannot be SK_t because Alice later reveals SK_t and m₀ has to be kept secret to preserve the unlinkability of the issuing and verifying steps.
SK_t cannot be an attribute (m₁, ..., m_l) because she provided them in-clear during the issuing process.
So either Alice has a signature on a generic attribute which can be used an "infinite" number of time, i.e., she can ZK proves having a valid signature with all possible r values (see last table on page 2). These proofs cannot be linked together, i.e, multi-show property of idemix.
Or Alice has a blind signature on a unique attribute which makes it trivial to link it back to her as this attribute is in clear during the issuing process as well as the EphIDs upload process.

That's why the idemix paper proposes the alternative "one-show" attributes which require interactions between the issuers and verifiers to check for double-spending.

But maybe I misunderstood what you were trying to describe. If this is the case, please provide a detailed description of your commitment scheme, and explain how it provides the properties you described in this post.

Now, on a more general note, I think this idemix summary clearly shows my point about the complexity of IRMA, or any other system based on idemix. It is already far too complex for what is needed here, despite the fact that it ignores the most difficult points, e.g., zero-knowledge proof that n (used in Camenisch–Lysyanskaya signatures) is the product of two same primes.

togermer commented 4 years ago

I'd like to extend the proposal in order to remove the second QR code scan and to introduce a practical workflow for laboratories. Our goal is to set up a system that HA can work with now, without the need to extend or adapt their infrastructure.

I introduce two components to the system:

reporting server (RS): An extension to the backend that could also be hosted independently. It keeps track of reported infections without being able to link this information to the corresponding EphIDs.
a: an alias for the blinded authorization code "m" to facilitate manual input. We use a 128 bit random number, base32 encoded, maybe with an added checksum. So this might be a 26-27 digit alphanumeric string suitable for printing and manual input.

I propose the following protocol:

When going to the hospital for a test:

The user app generates "m" as described above by a8x9.
The user uploads "m" to RS. RS generates "a", saves (a, m) in a table and returns "a" to the user app.
"a" is attached to the test sample (by scanning QR code / creating a printout) and given to HA.

When the lab detects an infection:

The lab worker opens a web form hosted by RS, authenticates, and (manually) inputs "a" to register the infection. RS marks (a,m) as infected.
The user app regularly transmits "a" to RS to request the infection status.

If RS returns an infection we proceed as proposed above:

RS signs m and returns the blind signature s_b to the user app.
User unblinds the signature and uploads (s, x) together with their EphIDs to the backend.
Backend verifies s and keeps track of x.

This protocol requires a basic authentication scheme between HA and RS. Furthermore, the lab worker is required to input "a" to a web form, which probably has the lowest technical requirements. If technically possible on the HA side, this could be avoided by adapting the infrastructure.

The user gets notified about an infection as soon as it is detected, without delay. Thanks to the blind signature it still should be impossible to link the uploaded data to the user identity. At the same time, the user can only upload if he has gotten s_b from RS, so he is actually infected. This should avoid misuse.

Does this work? Any feedback?

pevab commented 4 years ago

Hi,

Do you have an example of a commitment scheme with such properties? More specifically, a commitment scheme with such properties, where C cannot be attributed back to Alice when she discloses her SKt to the backend, while still limiting her to a single upload?

I found a nice summary of idemix here https://privacybydesign.foundation/pdf/Idemix_overview.pdf

I don't think this answers my question. As far as I see, this summary does not contain anything resembling a commitment scheme with the properties you described.

The part related to IRMA /api/v2/issue/issueID/commitments API endpoint is present in the first table on page 2.

The U value is Alice's commitment to her secret key m0.

She send this value with her in-clear attributes m1, ..., ml

The issuer "binds" together, Alice's commitment to her secret key (U), and her attributes, with a "blind signature".

Now here is where lies the problem that I tried to explain in my previous post:

m0 cannot be SKt because Alice later reveals SKt and m0 has to be kept secret to preserve the unlinkability of the issuing and verifying steps.

SKt cannot be an attribute (m1, ..., ml) because she provided them in-clear during the issuing process.

So either Alice has a signature on a generic attribute which can be used an "infinite" number of time, i.e., she can ZK proves having a valid signature with all possible r values (see last table on page 2). These proofs cannot be linked together, i.e, multi-show property of idemix.

Or Alice has a blind signature on a unique attribute which makes it trivial to link it back to her as this attribute is in clear during the issuing process as well as the EphIDs upload process.

That's why the idemix paper proposes the alternative "one-show" attributes which require interactions between the issuers and verifiers to check for double-spending.

But maybe I misunderstood what you were trying to describe. If this is the case, please provide a detailed description of your commitment scheme, and explain how it provides the properties you described in this post.

Sure, let's use the same notations as in the idemix overview. First, consider the issuance protocol.

Alice commits to message m_0
- She picks a random value v'
- She computes the commitment U = S^{v'}R_0^{m_0} (mod n)
- She buils a ZKP stating that she knows the value of v' and m_0 (without revealing them).
Alice sends U, and ZKP to the HA.
HA blindly signs the commitment
- HA picks a random value v''
- HA computes A s.t. Z = A^e U S^{v''} (mod n)
- HA sends (A, e, v'') to Alice
Alice finalizes
- Alice computes v = v' + v'' (mod n)
- Alice checks that (A,e,v) is a valid signature on m_0, i.e. that Z = A^e S^v R_0^{m_0} (mod n)

At this point, Alice has committed to value m_0, and has a signature from HA that asserts this fact. Now, say Alice wants to prove to backend that she knows a valid signature of HA on commitment to m_0. I assume here that she discloses m_0 to backend.

Alice randomizes the signature (A, e, v)
- Alice picks a random value r
- Alice computes A' = A S^{-r}, and w = v - e r
- She builds a ZKP stating that she knows e and w s.t. Z R_0^{-m_0} = A'^e S^w (mod n)
Alice sends A', m_0 and the ZKP to backend.
By checking the ZKP, backend is convinced that Alice knows a valid signature by HA.

To sum up:

HA has seen the commitment U, and the corresponding zkp P that Alice knows v' and m_0.
Backend has seen m_0 and a zk proof P' that Alice knows a valid signature from HA.

For HA and backend to identify Alice, they need to relate (m_0,P') and (U,P). Because of the blinding factor S^{v'} and the fact that P and P' are zero-knowledge, they cannot.

Now, on a more general note, I think this idemix summary clearly shows my point about the complexity of IRMA, or any other system based on idemix. It is already far too complex for what is needed here, despite the fact that it ignores the most difficult points, e.g., zero-knowledge proof that n (used in Camenisch–Lysyanskaya signatures) is the product of two same primes.

I still agree with the fact that Idemix is more complex. My proposal stood on the fact that IRMA is already deployed and running in production, with libraries in java, go, js, and an api. The complexity that needs to be taken into account is IRMA's api.

To be fair, I am not sure that IRMA's api allows to perform easily and precisely the above procedure, or anything with the same outcome. And in that case, I'll dismiss IRMA.

a8x9 commented 4 years ago

@togermer Thanks a lot for your improvement proposal, it is a very interesting addition.

However, I see one problem:

HA knows: user identity -> a
RS knows: a -> user IP
backend knows: user IP -> EphIDs

In case of collusion, you lose both properties of this proposal.

One solution would be to provide the RS service and the backend upload service as Tor Onion Services (or any other mixnet). So I really hope that the DP^3T team reconsider their position on this subject, especially given the fact that IP addresses are considered as personnal data in the European Union.

If a secure mixnet is used, then your improvement would simplify the interaction between the app user and the healthcare workers, and it would solve the "activating code later" problem.

Note: the app should ask for user confirmation between point 6. and 7. It should also delay the upload of EphIDs by a random time interval in order to prevent time based correlation.

a8x9 commented 4 years ago

@pevab Thanks a lot for the detailed explanation.

...

She buils a ZKP stating that she knows the value of v' and m_0 (without revealing them).

Alice sends U, and ZKP to the HA. ... I assume here that she discloses m_0 to backend. ...

Alice sends A', m_0 and the ZKP to backend.

I don't think that works. If Alice reveals her secret m₀, it becomes possible to link it to a specific issuing transaction.

During the signature verification, the verifier/issuer could check for all issued signatures (A, U, v"):

d = e^{-1} (mod λ(n)) -> issuer knows factorization of n
S^v = U * S^{v"} * R_0^{-m_0}
S^{er} = Z * R_0^{-m_0} * S^{-v} * A'^{-e}
S^r = S^{er}^d
check if: A' * S^r ?= A

I think the above test will work only for the correct (m_0, A, U, v") combination. There might be a smarter non-bruteforce way to find the right combination but I didn't find one.

For HA and backend to identify Alice, they need to relate (m_0,P') and (U,P). Because of the blinding factor S^{v'} and the fact that P and P' are zero-knowledge, they cannot.

The zero-knowledge proof P' proves that Alice knows m₀ without revealing it. But I don't think zero-knowledge proofs provide unlinkability between the proof and its secret if Alice reveals the secret.

I'm honestly not 100% sure about the above test, so don't hesitate to tell me if I misunderstood something.

Edit: my original answer was completely wrong because I forgot to take into account the S^{v'} blinding factor.