gdanezis commented 4 years ago

Many thanks for putting this proposal up for review by the community. I have provided some advice to the TCN coalition, and reviewed a few designs that proposed reporting / checking 'seen' beacons, as you propose too, as compared to designs reporting emitted beacon (like TCN, Google / Apple & DP3T). They all share an issue that, I believe, may also affect ROBERT.

The problem is that the service implementing the Exposure Status Request can reconstruct a very good approximation of the full social graph of the users querying it. This can be done within the honest-but-curious model the ROBERT scheme should be protecting against (and does not assume leakage of K_s which would be a separate, but devastating, attack).

The social graph reconstruction attack

The server end of the Exposure Status Request protocol observes tuple (user_a, time, [ EBIDi ]) regularly from each user. The service needs to know who the user is, to route back a request. (This is not the place where a mixnet is contemplated -- and the scale would make it impractical). The vector of EBIDi represent the beacons seen by a user within a period of time, say a few days.

Those beacons contain the encrypted ID of all the other users that user_a encountered. Even without the key K_s the side of the intersection between lists EBIDj and another list EBIDj belonging to another user_b is a very strong proxy of strength of social tie. The size of the intersection is therefore a measure of social adjacency (as well as proximity). This is due to social graphs, and location graphs, having a very large number of triangles -- and therefore user_a and user_b seeing the same set of users in common indicates they have a strong relationship. Of course they are also likely to see each other's EBID, and the authority can infer the EBID of each of the users, by looking for the most frequent EBID in 'adjacent' lists of EBIDs.

Of course the above is made worse by the fact that the server is assumed to know K_s and can therefore decrypt the long term identities of the users behind the EBIDs provided and simply read the full social graphs over time. Since this information leaks at the Exposure Status Request stage this leakage does not just affect infected users, or users that have been exposed, but all users in the system all the time. But I guess you are already aware of this issue. However, it is not clear why you do not, at least, allow for random EBIDs to prevent this trivial de-anonymization.

So the leakage happens both given knowledge of the key K_s and also without the knowledge of the key K_s. This makes pure key management solutions (keep K_s in an HSM) difficult. Since the bulk EBIDs are sensitive in themselves.

Is this a big deal?

In effect this scheme gives the server, and anyone else who can get lists of EBIDs from the server, the capability provided by the NSA co-traveller program (https://www.washingtonpost.com/world/national-security/nsa-tracking-cellphone-locations-worldwide-snowden-documents-show/2013/12/04/5492873a-5cf2-11e3-bc56-c6ca94801fac_story.html) that was using co-proximity in order to do contact tracing in an intelligence / national security setting. This capability, if sought by national or foreign intelligence agencies, would not be prevented by the GDPR, since personal data processed for the purposes of safeguarding national security or defence is outside the GDPR’s scope. Helpfully, the UK IP Act even provides a framework for processing such 'bulk data sets' for intelligence, and a code of practice that explains how such data can be used: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/715478/Bulk_Personal_Datasets_Code_of_Practice.pdf Needless to say that since this information has been the target of foreign signal intelligence agencies, any protections in national law are irrelevant anyway.

nadimkobeissi commented 4 years ago

The NSA example may be supplemented with Libra and Calibra, other excellent examples in which we can examine the dangers of having lists of correlated user events handled by a coterie of hand-picked trusted authorities, and the historic, disastrous, appalling privacy consequences which lie therein.

vaudenay commented 4 years ago

I am confused. Is this leaking the social graph to the honest-but-curious health authority or just the who-contaminated-whom graph by design? Isn't the latter already required by law and regulated by a code of ethics? Is this is the case, maybe the question is whether ROBERT can be proven to be compatible or not with the code of ethics.

kaythxbye commented 4 years ago

@vaudenay As far as I understand the protocol it will leak all contacts of an infected user to the server, even those contacts from a very brief encounter that are later not determined as "at risk". This is necessary because the risk assessment is done centrally. These contacts would not be required by law to be reported, at least in Germany (because they will not be notified about an infection by the app).

aboutet commented 4 years ago

@gdanezis The beacons (i.e. hello messages) seen by a user within a period of time are kept secret locally on the mobile phone of the user. These beacons are only exposed to the server if the user is diagnosed covid positive (upon its agreement). To avoid to leak information about the social graph of infected users, their id are not revealed to the server when they share the received beacons, and these beacons are not link together.

kaythxbye commented 4 years ago

@aboutet

their id are not revealed to the server when they share the received beacons, and these beacons are not link together

Are there any hard technical guarantees for that or is it just assumed that server is operated in an honest way?

ReichertL commented 4 years ago

ample may be supplemented with Libra and Calibra, other excellent examples in which we can examine the dangers of having lists of correlated user events handled by a coterie of hand-picked trusted authorities, and the historic, disastrous, appalling privacy consequences which lie therein

Libra and Calibra as in the digital currencies?

ramsestom commented 4 years ago

@aboutet

their id are not revealed to the server when they share the received beacons, and these beacons are not link together

Are there any hard technical guarantees for that or is it just assumed that server is operated in an honest way?

section 6.1 of the specifications document

gdanezis commented 4 years ago

I am confused. Is this leaking the social graph to the honest-but-curious health authority or just the who-contaminated-whom graph by design? Isn't the latter already required by law and regulated by a code of ethics? Is this is the case, maybe the question is whether ROBERT can be proven to be compatible or not with the code of ethics.

@vaudenay My reading of section 7 of 'ROBERT-specification-EN-v1_0.pdf' is that the social graph over time of all users is leaked to the server continuously, not merely the graph of who contaminated whom. Specifically it states:

"In order to check whether user UA is ”at risk”, i.e. if she has encountered infected and contagious users in the last CT days, application AppA regularly sends ”Exposure Status” Requests (ESR REQUEST) to the server Srv for IDA"

These Exposure Status” Requests contain vectors of EBIDs. Therefore the vectors of EBIDs seen over time is routinely provided by each user, allowing for the reconstruction of time series of proximity graphs and social graphs. Therefore an 'honest-but-curious' service can reconstruct this graph if they are curious (without fielding any active attacks or deviating from the protocol).

To avoid to leak information about the social graph of infected users, their id are not revealed to the server when they share the received beacons, and these beacons are not link together.

@aboutet thanks for the clarification. However, the EBID of a user that has submitted the report can be extracted with probability much better than random (I think it would be very high in many cases). As I discussed in the original post you can define a social proximity between vectors of EBIDs submitted. So in order to recover the EBID of the user you just need to find the EBID that appears most frequently, in the list of EBIDs that are close to the submitted one. Since recording of EBIDs is symmetric with good probability (if you see me I see you), and triangle relationships are very common (if you see Jack, I will also see Jack, and Jack will see both of us), so the people what saw a similar set of EBIDs to the user, have also likely seen the user.

Needless to say of course that network level information and attacks (such as IP address, cookies, etc) may also provide very strong indication about who the user is, without the need, or to supplement the above de-anonymization attack.

vaudenay commented 4 years ago

@gdanezis Hi George. My reading of 6.1 "Upload by the application" is that only diagnosed users upload a subset of their LocalProximityList corresponding to contagion period, based on medical authorization. So the curious server only learns who diagnosed people contaminated. Users who trigger the "Exposure Status Request" will only get if they have been listed by these uploads or not. Am I correct?

gdanezis commented 4 years ago

So the curious server only learns who diagnosed people contaminated.

@vaudenay Hi Serge -- we agree on what is uploaded and learned by the server as part of reporting ie Section 6.1 indeed there there is a mapping between a user and a list of its own EBIDs. However, as per the passage I quote above (section 7) the server also learns regularly a vector of EBIDs seen by each user. Hamming distance between those vectors is a measure of proximity, both spacial and then through repetition social, so the server also learn an full graph with all users.

Now, @aboutet clarified that as part of Section 7 the mapping between a user EBID or long term ID, and the vector of EBIDs uploaded is not revealed. However, as I mention, such a mapping with user EBIDs can be reconstructed via what information is available and of course via any side information.

Users who trigger the "Exposure Status Request" will only get if they have been listed by these uploads or not. Am I correct?

They indeed get back this 1 bit, but for this 1 bit to be generated they need to upload a vector of seen EBIDs. Which results in the information leakage I highlighted. This is separate and in addition to the leakage in section 6.1 on reporting infections.

vaudenay commented 4 years ago

I am still confused. Are you referring to p.11 "Application processing: AppA queries the server by sending the following request over an TLS channel: ESR REQU ESTA,i = [EBIDA,i | T ime | M ACA,i ]" I don't understand this as AppA sends a vector of collected EBID. I understand this as AppA uses its current EBID to authenticate itself to the server. Are we talking about the same?

ramsestom commented 4 years ago

However, as per the passage I quote above (section 7) the server also learns regularly a vector of EBIDs seen by each user.

No he do not. The only thing the server knows is, for each user, his EBIDs that have been observed when in proximity of some infected user(s). He do not know at all who was(where) the observers(s) (=the infected user(s)) responsible of the report of these EBIDs.

gdanezis commented 4 years ago

Many thanks for your patience @vaudenay -- I think I get it. The EBIDA,i is the single epoch pseudonym not the list of beacons seen. The matching is done to those revealed by those infected (this is where the list appears, not at the 'Exposure Status Request' stage), so indeed the trivial network reconstruction I mentioned here is not the issue.

I am closing this issue to not confuse matters, and continue to analyse this and other protocols. Many thanks!

ROBERT-proximity-tracing / documents

The 'Exposure Status Request' mechanism exposes the full social graph of all users (infected or not) to the authority. #11

The social graph reconstruction attack

Is this a big deal?