When consuming the EphIDs, utilize k-anonymity

michaelsmoody commented 4 years ago

While not specific to the implementation of the DP-3T, in looking at a system to determine contacts that may be utilized by hospitals, it seems it would be ideal to leverage the property of k-anonymity.

An example of this being used in the real world is the Pwned Password v2 check. In collaboration with Cloudflare. They devised a system to check whether hashes of passwords have been compromised, allowing you to type in your password, but not send the full hash to the server. In a similar way, an extra layer of privacy and security could be added to any contact tracing systems, that would minimize information disclosure to those that are using an app or other client to check cross-contact.

For more information on the specific of the Pwned Password implementation, please see here:

https://www.troyhunt.com/ive-just-launched-pwned-passwords-version-2/ https://blog.cloudflare.com/validating-leaked-passwords-with-k-anonymity/

Are there recommendations for anyone who might hope to implement these proposals according to best-practices?

Thanks in advance, Michael S. Moody

mex2meou commented 4 years ago

Hi @michaelsmoody, thanks for this suggestion!

Could you please clarify at which point in the processes of the DP-3T system you would envision the utilization of k-anonymity?

michaelsmoody commented 4 years ago

EphIDs need to leave the possession and custody of an individual at some point, and be stored in a central location, presumably upon trigger of the condition for which they were gathered, according to my current understanding.

Using the example of Covid-19, I'll assume for a moment that EphIDs have been generated over a time. An individual confirms a positive diagnosis, and the EphIDs then need to be transferred to the custody of a central repository. At this point, the central repository is in possession of information. Using k-anonymity in a manner as described in the use case of password hashes would allow for query of the repository without the information disclosure of particular EphIDs on the part of the querying party.

Understanding of course that this isn't specific to the use of DP-3T system so much as a recommendation for the implementation of the retrieval and searching mechanism of the repository. It should allow for querying results and protecting privacy at the same time. In the case of Covid-19, you would prevent repositories from having information of querying users because the processing for matches would be handled client side. Hand the querying user the subset of data that may include their relevant EphIDs without requiring them to expose their data to find a match.

Please let me know if this helped to explain my suggestion.

TL;DR - allow queries to use partial matches, which provide a subset of data which could match, handle the processing client side, inform the querying user of a match.

Thank you again for the excellent work on this proposed system, and for considering any application of my suggestion of k-anonymity. It's not necessarily part of the workflow of DP-3T, and more specifically applies to anyone who creates an API to query the information once gathered.

michaelsmoody commented 4 years ago

Sightly related, Cloudflare has experience building a system for these types of queries, as I noted a few other issues mentioning CDNs, etc. If you built this type of query system, it could be heavily optimized at the CDN and cached. I'm going to reach out to the individual who built the system for Pwned Passwords over at Cloudflare to chime in here.

michaelsmoody commented 4 years ago

I've tweeted @IcyApril. Hopefully they'll chime in as they built the system used for another implementation of a system similar.

kennypaterson commented 4 years ago

Thanks. We are already talking to them.

kennypaterson commented 4 years ago

Where them = Cloudflare

michaelsmoody commented 4 years ago

I looked over the reference implementation and re-reviewed some documents that have had slight updates and the FAQ updates.

So, specific to this discussion on this issue....

The current implementation is to download a list (assuming a full list, yes?) of EphIDs from infected users. I do understand that processing is already done client side. However, constant syncing of a full list of infected users and all associated IDs seems storage and transfer expensive, and through reversing the process to leverage the model whereby users pull from the repository EphIDs they've come into contact with via the k-anonymity enabled query mechanism still allows privacy to be respected, without disclosing contact.

Given the regular rotation of EphIDs, the possible length of time between installation and infection, and the eventual count of users, this seems like the data needing to be synced could get out of hand, even if done on an incremental basis using deltas. And it must be stored client side, and then processing it would require greater and greater resources as the dataset grew. (It's also possible I have fundamental misunderstandings about what that process looks like, entirely possible, perhaps even probable).

Nevertheless, it seems that querying the data in the method suggested has merit, at least to me. I wanted to be more specific in the recommendation as I didn't feel I explained it well in the follow-up.

EDIT: Fantastic on being in discussions with Cloudflare already.

DP-3T / documents

When consuming the EphIDs, utilize k-anonymity #138