Risk of de-anonymization of infected users is much higher than estimated

DP-3T / documents

Decentralized Privacy-Preserving Proximity Tracing -- Documents

2.24k stars 180 forks source link

Risk of de-anonymization of infected users is much higher than estimated #169

Open kholtman opened 4 years ago

kholtman commented 4 years ago

Introduction

This issue relates to the privacy risk analysis done in the April 12 version of the white paper. Basically, I believe this analysis grossly under-estimates the risk that infected users will be de-anonymized. I will illustrate this with some attack scenarios below. I will also outline a redesign approach which I believe will better meet the goals of the project.

These comments are mainly informed by my experience in doing industrial R&D work on the privacy issues for various types of smart phone based location tracking infrastructure. Based on my experience with accuracy issues, I have other concerns too, but these are mostly reflected in the issue reports of others already, so they are not included in the long write-up that follows.

Threat considered

The threat model and analysis (section 5.1 of white paper dated April 12) recognizes the risk of an eavesdropper who uses a stationary or mobile device that collects more data than just broadcast EphID data. This device might also collect and store camera images, or the plain GPS location of the eavesdropping device and point in time that each specific EphID was captured. As soon as the eavesdropper has access to (information allowing the matching of) the EphID values of self-reported infected user, this eavesdropper might be able to use the extra collected data to de-anonymize one or more infected users. Success rates can be very high here, especially if a lot of 'crowd-sourced' eavesdropping information is combined.

The threat model analysis made in the white paper estimates that an eavesdropping attack is costly if a large area has to be covered, and also notes that it is illegal under EU law, leading to the conclusion that the privacy risk associated with this attack is low. I will argue that the analysis made is incomplete. Specifically, the per-victim up-front cost of implementing certain types of scalable, geographically widespread eavesdropping and de-anonymization attacks is low. Though these attacks are typically illegal under the GDPR and/or other laws, this in itself is not enough to conclude the attacks will not happen. What matters is the equation that compares the immediate benefits to the attacker(s) against their risk of getting caught and being brought to justice. This couples to the ability of EU governments to effectively enforce the GDPR in the worldwide app economy and advertising/tracking ecosystem, and this ability is unfortunately very low.

I will now outline two specific attack scenarios that I have not seen mentioned earlier in other issue reports.

Attack scenario 1

This attack is performed by an independent app author and the users of that app. Imagine that the government-approved or government-promoted standard app for a county will tell its uninfected users whether they have recently encountered an infected user with high probability, but for privacy reasons it does not reveal where and when this encounter might have happened. Now consider an app author who releases an improved tracking app that will also tell its uninfected users exactly where and when the encounter took place. This will often have the effect of de-anonymizing the infected user. In risk terms, it exposes infected users to a much much greater risk of de-anonymization than was intended or promised, so it will suppress the willingness of these users to self-report. Basically we get a bad dynamic here, that goes beyond mere privacy concerns.

To make this improved but attacking app work, the app author needs to design an app that a) scans for and logs EphID beacons in just the same way that the approved app does, b) logs not just the EphID values collected, but also the time and place at which each value was collected, and c) has access to infected user EphID values, or a means to match them with the logged data. This matching means can be implemented by a routine that makes a query call to a remote server set up by the app author, a server located outside of the EU, with this server containing the necessary match-enabling data related to infected users. Now, one mitigation against the creation of such a server, as envisaged in the white paper I believe, is to implement access permission restrictions to the official database(s) that hold this data, so that only legitimate government-blessed apps can get the data. However, such access permissions represent a very low hurdle to the attacker: by running a government approved app on a rooted phone equipped with the right hacking tools, the access credentials, or the data itself, can be extracted. The necessary expertise can be hired on the dark-nets I believe, and the cost per affected user will unfortunately be very low.

As another mitigating approach, governments could make an attempt to discourage such app authors by prosecuting them, but if they reside outside of Europe, and if matching data processing is done outside of Europe with some plausible levels of deniability built in, this will all get gets very difficult. Banning the apps from the official app store might be possible, but app store policing is an imperfect game, and on Android there is the option of side-loading. Going after end users who install the improved app will also be difficult: there are privacy and right-to-repair and right-to-root implications. At a political level, attempts to ban these improved apps and their use might be hugely unpopular with the users of the improved apps. Such users might organize, or be organized, into pressure groups that force the government to break its promises to infected users, accepting a mission creep that gradually erodes their privacy.

Note that the attack scenario outlined in GitHub issue #37 can be seen as a particular refinement of the above scenario. Definitely, the mitigation considered under design 2 of the white paper can reduce the risks of de-anonymization: in my view it helps but it is not sufficient to eliminate the problem.

In terms of lowering the risk to acceptable levels for infected users, the most effective mitigating action I can see see is to change the data flow design, so that the government app can no longer download any descriptive information about infected user EphID beacon values. This means that matching will no longer happen on uninfected user's phones. Instead, the government approved app will have to send the EphID beacon values collected by the uninfected user's phone to a trusted third party database server or matching service, where they will be matched with the infected user EphID beacon values, with only a only a single yes/no bit, or some slightly richer result value, being returned to the phone of the non-infected user.

To make the above mitigation measure effective, a query rate-limiting system with active monitoring is needed on the above third-party service. This is needed to lower the risk of success of attacks which use series of specially constructed queries to narrow down time windows, or attacks with queries designed to extract all latent information in the database. Beyond a rate limiting system, a proof-of-right-to-query approach, based on fuzzy matching to tracking logs collected by the infected person's app, has great promise as an effective countermeasure. There are more options, I won't expand further here.

To be clear, the above mitigating solution definitely increases the privacy risks for non-infected users: their phones will now be sending some data to a third party, data that is never sent in the current design. However, I believe this solution direction is needed to achieve acceptable levels of privacy risk for infected users.

In fact, the current design can be said to be unbalanced in exactly the wrong way: it assigns all privacy risks to infected users, just so that the non-infected users do not have to rely on trusting some third-party data matching service operator. However, non-infected users still have to trust other third parties: e,g. smart phone OS makers, and the makers of the government-promoted app, so I believe that the burden of requiring trust in an extra third party server or service is not a big step.

Attack scenario 2

This is an attack by an ad library developer. In today's app ecosystem, many free apps rely on revenue from showing ads, and these adds are typically inserted using third party ad libraries that are linked into the free app. These libraries typically collect privacy-sensitive data from the phone. There is a legitimate need to collect this data and send it to the ad library maker back-end systems: it can significantly improve the selection of the ad to show. But this also gives the ad library maker the technical ability to go beyond acceptable or allowed data processing or sharing practices. The ad library back-end might process or forward collected data in a way that is illegal in the jurisdiction of the app end user, illegal even if the app end user has pressed 'accept' on some multi-page agreement. This is relevant here because ad libraries can also include code that accesses the GPS and Bluetooth systems of the phone (see also the links further below). Typically, access requires that the user also clicks OK on a permission dialog presented by the smart phone OS, but most end users have been trained by industry to click OK always. We may have some specific luck here: for power saving reasons, Bluetooth is often turned off and this is somewhat decoupled from the permissions system. To outline the attack data flow: GPS and Bluetooth EphID data is collected by the ad library, and sent to the ad library developer back-end systems. The ad library developer then sells this data to diverse data brokers, typically sharing a revenue cut with the app developer who included the library. When sufficient EphID and related sensor metadata is present in the databases of these of data brokers, and if these brokers also obtain the matching data for infected users, these brokers will be able to de-anonymize many infected users with high confidence. If past history is any guide, some of them will not hesitate to sell both high-confidence and low-confidence identifications. So this is a scalable, low-cost, and profitable attack that will typically proceed without the smart phone users even being aware that their phone is helping in the attack, and that they might even be running some legal risks themselves because of this.

For some background information on the above, see for example https://www.ftc.gov/news-events/blogs/business-blog/2016/09/ad-libraries-app-developers-check-out-advice and https://money.cnn.com/2013/12/18/pf/data-broker-lists/ To be clear, the last time I studied the state of the smart phone add library and data broker ecosystem in detail was around 2016. At that time it was a privacy and privacy enforcement dumpster fire. I have not checked if and how much it might have improved since. My expectation is that attack scenario 2 is still feasible right now, especially in the Android ecosystem. I would be delighted if someone makes an analysis which concludes that scenario 2 is now low-risk.

Mitigation: In theory, a perfectly enforced GDPR would make the above scenario impossible. A problem here is that an ad library developer may hide behind a claim that they and all their partners will never process or forward data in a way that violates the GDPR or associated safe harbor rules. This makes enforcement very difficult. Historically speaking governments have done very little in terms of trying to develop an effective enforcement capability. The only effective mitigation that I see is the third-party service based matching I described earlier.

Concluding remarks

Reading the project white paper and other documents, there are many things to like. I like especially that the design and analysis recognize the role which the GDPR and GDPR enforcement can play in improving privacy outcomes. However, I believe the design and risk analysis are currently very unbalanced.

The expected societal result of this unbalance, if not corrected, is that infected users will be asked to take a fairly large risk that they will be de-anonymized. This is especially bad because the consequences to infected users of being de-anonymized can be very severe, compared to the consequences for non-infected users. I believe it is better to pivot the design, to create a more equitable distribution of risks and benefits among user types.

Once we accept the idea that a trusted third party should be in charge of matching, that it cannot happen on phones themselves, a large design space opens up that allows risks to be managed and re-balanced differently. In one corner of this space, it is the uninfected EphID related data that is stored for a few weeks outside of the originating phones, whereas infected EphID data is read out once and then immediately discarded once matches have been made.

Note that all of the above does not apply to the share-with-epidemiologists part of the proposal. This has a different implied risk structure, which is less problematic in my view. In other issue descriptions, I have seen suggestions that the project should more strongly decouple this part of the design. I support these suggestions.

I believe the project is partly motivated by a very valid trust-in-government question. However, there are several trust-in-government questions that need to be asked. To put it bluntly, regarding the threat model of eavesdropping and associated data processing, I have a fairly low trust in the current ability of EU governments and institutions to enforce GDPR adherence in the various private party players and commercial ecosystems described above, especially when these players reside outside of the EU. I definitely hope that GDPR enforcement abilities will improve long-term, but I see no quick fixes here. By comparison. my trust in an EU government's ability to set up, and keep honest, a third party service that will handle data from both infected and uninfected users is much higher.

monty241 commented 4 years ago

Reading attack scenario 1, I've read:

However, such access permissions represent a very low hurdle to the attacker: by running a government approved app on a rooted phone equipped with the right hacking tools, the access credentials, or the data itself, can be extracted.

In my humble opinion this is the same approach as that hinders the OAuth2 code grant flow being used on untrusted devices while protecting application secrets. Sadly enough even industry leaders now ship products with necessary application credentials packaged, by lack of a good solution.

With OAuth a man-in-the-middle attack is easy in general, since the HTTPS traffic in general can be accessed since the device is untrusted and can be manipulated such as by re-routing traffic and/or loading certificates. Risks on attack scenario 1 can be mitigated but not removed by adding additional layers of "undocumented magic" in the client's algorithm or seeded data such as a client certificate. Changing the algorithm in an undocumented way is not really desirable.

Forwarding data from the device untrusted for the application's purpose to a central infrastructure under control of the application's owner (or a delegated party to reduce control by the state) indeed seems like a solution that is able to secure privacy in the long term and reduces the number of actors that can harm privacy. Educating millions of possible malicious users and their devices to meet software supplier expectations has so far proven impossible to achieve, especially in the Android eco-system. A central infrastructure allows for easier auditing and change when security issues arise.

kennypaterson commented 4 years ago

@kholtman I think your first attack scenario is discussed in the whitepaper on pages 24 and 25, especially at the top of page 25. Please would you check there and let me know if we are talking about the same thing?

kennypaterson commented 4 years ago

@kholtman For your second attack scenario, if I am reading it correctly, are you saying that there may be third party software running permanently on the phone that is able to record and forward Bluetooth beacons to a centralised gathering point?

kholtman commented 4 years ago

@kennypaterson On the second attack scenario: yes indeed that is what I am saying, in the sense that a single free app (e.g. a free game app) may contain software from several parties. The third party I am worried about is the ad library writer. The ad library code may be able to access Bluetooth, capture beacons, and forward the beacon info to a server.

In practice, power saving related mechanisms and several other mechanisms in the OS may suppress the ability of the ad library code to access Bluetooth capturing functionality permanently, but there are modes and situations, especially if the game is being played as a foreground app, where these mechanisms may offer little practical help in mitigating the attack. My practical experience with this is several years out of date. Without naming names or model numbers, we had some experience in the past where we did an experiment where the OS provided Bluetooth scanning facility to the foreground app for several tens of minutes, and then it would be silently shut down. However, it could be restarted again by closing and then re-opening the OS-provided API.

kholtman commented 4 years ago

@kennypaterson On my first attack scenario, this is indeed similar to several of the attacks considered on page 24 and 25 of the April 12 whitepaper.

In my view, the analysis of the white paper over-estimates how tech-savvy an attacker needs to be to pull off several of the detailed attacks discussed in the last paragraph on page 24 and the first paragraphs on page 25. You can read attack scenario 1 as a case where one single tech-savvy app developer acts to empower millions of non-tech-savvy end users, allowing all of them to perform the de-anonymization attack types mentioned on those pages, and more. The app author can also organise and automate information sharing among these millions to make de-anonymisation even more likely too succeed. As the saying goes, there is an app for that.

So scenario 1 causes me to end up with a much higher risk estimate than in the white paper. Now, obviously there is no way to completely eliminate all risk of unwanted de-anonymization. The risk exists even without any app entering the picture, and to some extent we will always be asking people who self-identify as infected to accept the risk of being de-anonymized beyond their control. But I believe the current design creates too high of a risk, and that it can be improved.

lciti commented 4 years ago

@kholtman Disclaimer: I have only basic understanding of network security and encryption so my question may be completely off the mark. I work in AI and I understand the basics of homomorphic encryption and I wonder if it could be used here. In your scenario 1, would private set intersection (or something similar) allow checking whether the app's EphIDs for a given day have any intersection with the server's collection of infected people's EphIDs? This would give the non-infected user's "improved" app only a only a single yes/no bit without the non-infected users having to send all their collected EphIDs to the server.

kdagley commented 4 years ago

In your scenario 1 mitigation you require an uninfected user to upload beacon values to a remote server. This is not necessary. The uninfected user would just load their beacon values to the local official app for comparison. The same rate limiting, etc you suggested for the remote server could be implemented in the local app to prevent leakage of the infected user’s identity. This keeps the uninfected user data local. Additionally the infected user’s beacon data would first be encrypted then sent and loaded into the official app on the uninfected user’s phone. The uninfected user would never have access to unencrypted infected user data even on a jailbroken device.

gardners commented 4 years ago

Also, I think it is key to point out, that as soon as you have some centralised database, you introduce much nastier problems, including the usual suspects of governments and attackers wanting to get their hands on the data. That is, here I think your cure is worse than the disease. While it remains decentralised, the risks are all physically local: Is there someone in my vicinity who is running such a modified app? With the centralised database, you can no longer control for this.

vaudenay commented 4 years ago

Similar threats: #100, #121.

kholtman commented 4 years ago

@Iciti Yes, the use of private set intersection could be part of a mitigating solution. By itself it is not enough however to address my concerns, you need a kind of access/rate limiting mechanism that infected users who disclose their data can trust. I'll try to unpack this a bit further than I did in my original issue report above.

Say we have an attacker who is dissatisfied that, with the official government app, they will only get a yes/no signal if they met someone in the last 10 days who self-reported as infected. The attacker wants to know when exactly they met the infected user. We can see this as an attack that partially or fully de-anonymizes the particular infected user they met: it is an important step to getting closer to their identity.

Say that attacker has a set A of time-stamped EphID values collected by their phone, and the attacker has also access to a the set B of all EphID values that were sent by the all self-reported infected user's phones in the country in the last 10 days (how to get access, under the design disclosed in the white paper, is discussed in my description of attack scenario 1). Then the attacker can match each individual element of set A to set B. Say there is one A[23] in A that matches to an element in B. Then the attacker can look up the time stamp of A[23], which they also recorded, to get the time of the encounter. If there are multiple matches, yielding multiple time stamps of multiple encounters, typically this will offer greater de-anonymization opportunities.

OK, so now we try to introduce some cryptographic magic like private set in intersection to make this attack more difficult. Say the authorities (working on behalf of the infected user) never disclose the set B, but only disclose an information block B' for downloading by the government approved apps, where these apps will run the cryptographic operation doesintersect(A,B'), for the locally collected A, that returns true if one or more elements in A match elements in B. This modifies the attack as follows. The attacker also obtains B', as described in attack scenario 1, through unapproved means, and also obtains the program code needed to implement doesintersect(A,B'). Now, the attacker can run the operation doesintersect({A[i]},B') N times on their phone or computer, once for every index i in their set A. This gives them A[23] again, so nothing has improved. Using further cryptographic magic, we might construct a B' that only allows for doesintersect(A,B') operations where the size of A is N elements. However, then the attacker can run doesintersect({A[i],r2,r3,... rn},B') N times, where the rj values are random EphID identifiers that they just invented, but never actually received on their app. Again, a[23] can be identified. What we would need to do to block this is to a B' and doesintersect operation that will refuse to process the above input with randomly invented values. This can be done, however it would required that we massively change the data collection an trust regime, and/or the way that EphIDs are generated, away from how the white paper foresees things. These changes will have implications that will increase privacy risks for uninfected users.

The mitigation direction I propose in my original issue report can be interpreted as a system that does rate-limiting on the number of (cryptographic or plain) doesintersect(A,B') operations that anybody, including an attacker, is able to run. This means that B or B' need to be made less widely available: they need to be hidden in a trusted third party server (or a more complex third-party system) that constraints everybody's query access to the doesintersect(A,B') operation. Cryptographic magic can then still play a role as follows. In the baseline non-magic design, the government approved apps (and the attacker) will have to sent their sets A to the above third party server (over a secure channel) in order to get a matching result. This implies a risk that the third party server will store or otherwise process these sets A, in violation of trust. Operations like private set intersection can be used to avoid sending such sets: instead the server acts as a conduit that rate-limits communications which are further opaque to it. But note that such solutions add complexity, also complexity to the chain of trust and threat model analysis. Also, they create performance issues have to be resolved.

The DP-3T team has so far I believe stayed away from considering this type of refinement by using cryptographic magic. Based on my own design experience with this type of system, I consider this to be a valid choice, given time limitations and the need for transparency and broad review.

kholtman commented 4 years ago

@gardners You write:

Also, I think it is key to point out, that as soon as you have some centralised database, you introduce much nastier problems, including the usual suspects of governments and attackers wanting to get their hands on the data. That is, here I think your cure is worse than the disease. While it remains decentralised, the risks are all physically local: Is there someone in my vicinity who is running such a modified app? With the centralised database, you can no longer control for this.

Sorry, I think you will have to tilt the system design you are holding in your head to see the hole that I am pointing at.

In the system of the white paper, the government of each country acts as a clearing house, to publish a set of EphID keys voluntarily submitted by infected users. This set of EphID keys is a curated dataset.

Any dataset, if it is to be accessed or used by running software somewhere, implies the existence some kind of database system that mediates access. I am using database here in a very broad sense of the word: the Web is a database system in this sense.

A very distributed database system example is the distributed ledger in a cryptocurrency system, which has copies of the dataset floating around everywhere, with no attempt being made to restrict read access to these copies. On the extreme other end of the spectrum, we have a secure centralized data repository with heavy access restrictions, e.g. restrictions where every visitor has to identify themselves based on personal digital credentials given to them earlier in an office where they had to show their passport and were fingerprinted. In such systems, all access is typically logged for possible later forensic use.

The above examples represent different points along a design spectrum that makes a risk tradeoff between a) privacy and related risks for the people who the data is about (these are the self-reported infected users for the dataset of the white paper), versus 2) privacy and related risks for people who seek access to the data (the uninfected users).

The cryptocurrency distributed ledger can occupy the point in the design space it occupies only because of the implicit assumption that the data in the ledger poses only a very low risk in terms of digital currency owners being de-anonymized. This low risk assumption is not perfectly met in practice: we have seen side-channel attacks where currency owners have been successfully de-anonymized.

I can read your comment as assuming that there is no database whatsoever in the current system design of the whitepaper. This is not true: there is one and it is of the more distributed kind, consisting of a set of backend servers that provide copies of the dataset to government-blessed apps. These servers have some access protection to keep out unblessed apps, but it can be subverted pretty easily. In the phrasing of your comment, we can be pretty sure that the usual suspects of governments and attackers wanting to get their hands on the dataset will indeed get their hands on this dataset. This is not necessarily disastrous in itself, but you have to look at all the implications.

To summarize: The white paper makes a tradeoff choice which implies pretty weak access restrictions being in place for its database system. This choice definitely improves privacy for those seeking access. But the validity of this tradeoff choice rests on correctly estimating the privacy risks for the people who the data is about. I have argued that these are under-estimated, so that the current design has an unwanted risk imbalance between the two user types.

kholtman commented 4 years ago

@kdagley thanks for proposing refinements to the mitigation strategy: I definitely did not consider all possible refinement in my original comment post. However, your refinements are based on the assumption that the official app will be able to download the infected user data and then keep it sufficiently protected, via a set of measures like the following:

Additionally the infected user’s beacon data would first be encrypted then sent and loaded into the official app on the uninfected user’s phone. The uninfected user would never have access to unencrypted infected user data even on a jailbroken device.

In my analysis, such measures add hurdles, but these hurdles are too low to protect the data in practice, given the state of hacking tools available on the grey and black market. Specifically, while the infected user data can be downloaded and stored in an encrypted form, the official app will also have to have a decryption key built in somewhere, else it cannot use the data itself for the intended purpose. Hacking tools can be used to locate and extract this key as well as the data.

Note that some phones, specifically newer iOS phones, have more protection against hacking tool attacks than others. For example, phones may have trusted platform module hardware that adds a layer of protection against rooting. Trusted platform modules can also help in other ways, if available and if the OS allows the app to access their services. But these improved facilities are only available in some phones: the government approved app also has to run on phones without them, so attackers can focus on those phone models if they want the data.

lciti commented 4 years ago

@kholtman Thanks for your detailed reply. I understand the issue a bit better now.

This implies a risk that the third party server will store or otherwise process these sets A, in violation of trust. Operations like private set intersection can be used to avoid sending such sets: instead the server acts as a conduit that rate-limits communications which are further opaque to it. But note that such solutions add complexity, also complexity to the chain of trust and threat model analysis. Also, they create performance issues have to be resolved.

Please let me know if the following could address some of these concerns. Let's start from the "Unlinkable decentralized proximity testing" in the current version (a0a88c3) of the white paper. Let's consider a commutative encryption function (or two functions that commute) so that E_k₂(E_k₁(m))=E_k₁(E_k₂(m)). The key k₁ is only known to the local app while the key k₂ is only known to the backend and possibly generated new every time a new cuckoo filter is created. For each observed EphID, the local app will store H(EphID || i) as it does now. The backend creating the cuckoo filter would do so by using k₂ and insert E_k₂(H(EphID || i)) into the filter for each EphID corresponding to someone who tested positive. After receiving a new version of the cuckoo filter, each app is allowed to send an encrypted batch of EphIDs to the backend but only once. Calling A[1],A[2],...,A[n] the hashed H(EphID || i) observed over time, the app will send E_k₁(A[1]),E_k₁(A[2]),...,E_k₁(A[n]). The backend will reply with a random permutation of these, encrypted with k₂, i.e.: E_k₂(E_k₁(A[25])),E_k₂(E_k₁(A[12])),...,E_k₂(E_k₁(A[2])). The app will then decrypt them using k₁, thus obtaining E_k₂(A[25]),E_k₂(A[12]),...,E_k₂(A[2]) (note that: E^-1_k₁(E_k₂(E_k₁(m))) = E^-1_k₁(E_k₁(E_k₂(m))) = E_k₂(m)) ). The local app will then check the presence of each E_k₂(A[j]) in the cuckoo filter but since the order has been randomized by the backend it would not know which of the contacts was with an infected person.

While the installations of the "improved" app could collaborate to identify a few contacts by coordinating their requests to the backend in some clever ways, due to the rate limiting they will not be able to identify a significant number of contacts. Therefore users will have little incentive to install the "improved" app (which would waste their opportunity to interrogate the backend about actual contacts to help some remote user de-anonymize their contacts).

Some obvious problems with this solution are the increased processing and communication costs and the fact that the contacts of healthy individuals will need to leave their phones albeit encrypted (this may be a psychological barrier, which may be hard to overcome for the general population).

kholtman commented 4 years ago

@lciti Thanks for the mitigation design proposal above. I am currently allocating my time to some other system-level issues in the white paper design, so it may take me a several days before I can write a detailed opinion about this specific approach.

tom-leclerc commented 4 years ago

Hi, as it happens, at my company, Proximus Luxembourg, we are also working on a privacy proof solution of our own, and looking at the DP-3T design, we discovered the same flaw as described in this issue (which brought me here).

The design has key difference in DP-3T, instead of uploading the Sk from the infected User we actually upload the infected User's encounters ! Thereby, the server does not contain any information on infected, protecting the privacy even more than DP-3T's current design.

Please find below a document describing our design in detail. From what I understand, the EpID could be rendered much simpler (i.e. completely random without any crypto involved) since we do not need it.

The design we propose has the following key properties:

The backend (server) and the mobile devices never know any of the infected EphIDs at any point of time.
The backend (server) can not pinpoint which user had an encounter with an infected user with 100% accuracy with a high probability.
The chances of an enumeration attack, as described in the white paper by github user “kholtman” are greatly reduced. 1) and 2) together imply that even if the backend is compromised, the attacker doesn’t get access to any privacy sensitive data.

LVP_PXS_Lu_contribution_to_DP-3T.docx

monty241 commented 4 years ago

Going over the design (I like the partial obfuscation approach) I see a possibility for the government to control the value of k and thereby the partial obfuscation. Maybe as a counter measure to mitigate the risk provide lower and upper bounds into which the values should fall? These bounds could be chosen such that given the population of 1 million plus a minimum use % is necessary anyway for any general health effects, also considering the R. Also, the user interface could display k or a calculated value based upon it. This of course requires open source code to avoid manipulation at the client level.

Finally I feel certain that a single partially obfuscated value can not reliably be de-anonymized but across multiple measurements (collecting more values or other information already known) often an approach is available to increase accuracy of lifting anonymity. Have any considerations been given to such an approach to lift anonymization?

tom-leclerc commented 4 years ago

Agree with k, there must be an upper bound until which it should be done, of course. Lower bound is not really necessary since, this partial obfuscation is merely there for perfomance issue, i.e. limit the upload to the mobile rather than hiding anything. In an unlimited bandwidth scenario we could just upload the full list every time.

Yes we thought about the multiple measurements and think it is not an actual issue: So the way we thought about generating the UUIDs (or EphID) was to make them start with a timestamp (e.g. unix time) and then append the actual UUID behind it.

So the partial UUID we are sending out can be seen as merely giving out the unix time at which the ID was generated.

The most important part is that the server actually nevery gets any confirmation whether or not one of the IDs matched on the mobile side. Without any feedback, it has basically no information on whether or not the user is any of the IDs sent out.

Secondly, the server does not contain the full encounters, just the remote part of the encounter. Thus there is no way to trace encounters over time, which stops any tracing. This is actually for me a design flaw in DP-3T, by "signing" the EphIDs it also provides a way to tie (retrospecitvely) EphIDs together over more than just 1 single EpID.

Finally, the neat thing about having the list of encounters (instead of the list of infected) on the public server is that it does not give out any personal information about the user (was he infected or not, which in the end is the only personal information there is in the system). Thus the public server does actually not need any particular protection (in term of reading). Each line on the public server only shows one side of an encounter. Encounters in these list cannot be tied together in any way (know that they were encounters for the same user, nor seen by the same infected originally).

Nonetheless, we might have missed something, so feel free to counter argument! Thanks a lot for the reply!

lciti commented 4 years ago

@tom-leclerc The approach of uploading the infected person's observed contacts rather than their own EphIDs is briefly discussed in the FAQs.

Also, I am not sure your solution provides any extra anonymisation compared to the current approach as an "improved" app could keep track of when and where (using GPS) it broadcasted a given EphID. Therefore, when that EphID is found in the list of contacts of infected people it may find out that it was generated on Monday at 2pm while the user was discussing business with Ms Smith in her office. The "improved" app can still pool together information across users to create a map with red spots corresponding to the places where these contacts occurred.

tom-leclerc commented 4 years ago

For your second point, yes, the only way to know who is infected in our design is by being able to connect an own EphID with another encounter. But we reduce the impact of that tracing much more (as described in the document):

The attacker will not be able to prove that the encounter in which one of his own EphIDs occurs is indeed tied to that particular infected. Since communications are asymetrical, it may be that a bluetooth device received the attackers beacon, while the attacker did not receive that particular user's beacons (at least it is a possibility). In DP3-PT current design the attacker (with the exact same tools) has the certainty of who is the infected.

Nonetheless, yes, the problem is not gone, it is just much harder, because a single EpID does not suffice anymore, one must trace the user over time and space and be sure we are not close to any other (same users) during that time. If there is a group of people in this meeting, we will not be able to tell who exactly is infected in that meeting, it might even be several people... That's the main advantage, as soon as there is more than 1 person around (which is very likely) we ensure their privacy.

Regarding the FAQ, yes we know that our design requires much more data on the server, but as you've seen we found a solution that requires merely basic filtering (the partial UUIDs). Moreover, the more user we have the bigger k will become, hence we can easily adapt the upload size. We don't need cuckoo filters here, so much simpler to compute. The data and filter we have will be easily managed by any cloud platform.

Regarding an attacker that might create fake encounters, well let them, they will not really gain anything actually. A replay attack (e.g. replaying an ID of another user) will only be possible for the same hour (if we renew IDs every hour). Then that user for which the attacker manage to replay the ID must get infected to actually achieve a false positive. So the attacker must: 1) know a valid UUID another user is using, 2) replay it within the same timeframe as the UUID (say 1 hour) 3) be physically near the target for that hour 4) the target must actually get infected to have any impact.

So fake events are either only affecting the attacker himself or he must be able to achieve the 4 steps above.

kholtman commented 4 years ago

@lciti I have a some general observations related to the FAQs and the status and validity of the privacy design in the April 12 white paper.

The FAQs are very helpful as a documentation of the design trade-offs made by the DP-3T team for the current design, and as a reviewer I greatly appreciate the effort that is being made to be open and transparent.

However, if I put my GDPR review best practices and privacy by design hat on, ignoring for the moment that exceptions could be made for the sake of an emergency, what I see is a work in progress, not a finished system design that any EU government could hope to get past a GDPR review. The team will need to do more work if the goal is to create a protocol that can be used for up to several years. In particular, it will need to argue convincingly that no privacy improvement low-hanging fruit has been left unpicked.

There is definitely the option of trying to claim special emergency GDPR dispensation, where a V1 of the DP-3T protocol is released for immediate use in a experimental-type app: the doctrine of informed consent can do a lot here. But I expect that most country level data protection authorities would strongly reject the idea that an official government-promoted app could be released with a V1 protocol based on the April 12 white paper design, unless the government can show convincingly that it has clear and technically feasible plans to do a rolling update to an improved V2 protocol only a few months later.

The FAQ motivates many of the choices made by noticing that they lead to a reduction in the data transfer rates from backend servers to phones. This is nice, but if a design alternative is identified that can reduce the possible impact of de-anonymization attacks by (counter-intuitively) transferring say 5 times more data, then the cost/benefit trade-off should be investigated. I have not reviewed the design by @tom-leclerc above yet, but from the general description, to me this 'hiding in the crowd' approach is a promising route to reducing infected self-reporting user de-anonymisation risks, e.g. in the attack scenario of people wardrivng with an 'improved' app at a party.

In any case, even after a 5x increase in data size, we are still talking about data flow rates which are are tiny, compared to the capacity of CDNs in Europe. If data size really is limiting, then there is the option of dividing Europe or individual countries into 50x50 km regional tiles, where apps can select and download the tiles they need, and maybe some more just to satisfy paranoia about traffic analysis. This would greatly reduce data rates with a very limited impact on de-anonymization threat levels for users. (Also, I figure that are still some rural areas in the EU that have way-below-EU-average connectivity, so we may need some type of tiling approach or equivalent anyway to ensure that people in these areas are not left behind.)

ramsestom commented 4 years ago

That is what I like with the ROBERT protocol https://github.com/ROBERT-proximity-tracing/documents It does not rely into any exchange of information between users concerning their infected status or their list of contacts (which, in both cases, allows any user to have a pretty clear idea of the infected he was in contact with. Which is a huge problem in my opinion, considering all the issue we already may have seen with caregivers stigmatized because they could be more easily carriers of COVID-19. So imagine what some people would unfortunately be able to do if they could know who infected them...). People tend to think that a decentralized solution would necessarily be better than a centralized one to protect their privacy. But they don't realize that, in this case, that actually mean putting more trust into any people they crossed and to who they will send some information about their infected status if they declare themselve infected than in their own government which would run a centralised server (and who is probably more strongly required to respect GDPR than an individual lembda moreover)

tom-leclerc commented 4 years ago

@ramsestom Thanks for this! We saw ROBERT and I agree with you that it does not reveal who is infected. At some point in our designs (before finding the truncated ID idea that avoided the scalability issue) we had the same mechanism. The issue I see with ROBERT is that the public server can identify every user of the application (i.e. via connection meta-data). We had in mind that this should be avoided to avoid transforming the app into an actual people tracing app for governments with bad intentions.

Nonetheless, ROBERT is still a much better solution than the current DP-3T design. The issue mentioned in your comment or in mine have different impacts: ROBERT: The public server could completely identify users throughout the system (it has all the UUIDs a user has used), but this still is not straightforward as one must capture meta-data of the connection and tie it to the actual user. OUR DESIGN: The user may find out, with statistics, which ID (but not proven) infected the user. Here such an user must also go to some length to isolate that target user to be able to pinpoint statistically the infected user. There must be several conditions for it to actually work (isolating the user for a relevant time, etc.), it is not straightforward, but yes could be done.

That said, yes, it is easier to "trust" a government or enforce the law (GDPR, etc.) there than on individual users.

At least if DP-3T teams are considering ROBERT as the next design, it would be a step in the right direction and maybe we can add a concept/mechanism in ROBERT that resembles the truncated EphIDs and permits to limit the issue "knowing all users" on the server side I was mentioning.

burdges commented 4 years ago

ROBERT cannot be considered privacy preserving by any measure because it reports movements by all users to a central authority.

Yes, ROBERT prevents users from downloading the infected users ephids, but users can query ROBERT about a specific ephid by rotating their own configuration, using multiple device, etc. ROBERT thus provides no privacy benefits even for infected users.

DP-3T protects location privacy only for uninfected people. I agree this sucks, but you should remember that contact tracing becomes mostly useless once some non-negligable portion of the population becomes infected.

At least one Swiss organization advocates for contact tracing once Switzerland reaches fewer than 25 cases per day for all of Switzerland. Singapore had fairly aggressive contact tracing from even smaller numbers, but they still lost control over the situation and imposed a lockdown, although obviously Singapore has a much higher population density than Switzerland.

There are countries like the U.S. an U.K. in which the government, media, etc. want contact tracing merely as an excuse to ignore the pandemic and restart the economy. I'd agree DP-3T cannot be considered privacy preserving in such places.

tom-leclerc commented 4 years ago

"ROBERT cannot be considered privacy preserving by any measure because it reports movements by all users to a central authority." mmh I don't think this really correct? First, it doesn't report movements, just the a random ID of one side of an encounter. From my understanding ROBERT only upload the encounters of an infected, so the date is by far not about all users! So it is just a small portion of users in the end. At least this is the case in our design!

"Yes, ROBERT prevents users from downloading the infected users ephids, but users can query ROBERT about a specific ephid by rotating their own configuration, using multiple device, etc. ROBERT thus provides no privacy benefits even for infected users." Well yes, multiple devices might work together to better pinpoint an infected, but this is by far more complex to setup than in current DP-3T's design where it works out of the box with 1 single device, any number of surrounding devices and 1 single UUID received. Remember, the information on the server is NOT about the infected but just encounters, so even rotating IDs and querying multiple times the server etc. will only help you to statistically get a better knowledge on who is infected (which again is out of the box currently in DP-3T). This information only helps to know who might (because here it is only statistically) be infected it does not permit to know who actually participated in the encounters. So, I see this as a step forward, not perfect, but better.

I agree, it is an extreme situation, but if we can help preserve privacy even a bit better, shouldn't we do it? I mean, if in any case DP-3T will be used, than we should make it as safe for privacy as possible, but for sure, any tracing app will not be perfect, there are issues inherent to just the principle (e.g. just people spreading rumors around about an infected without any prove or app).

ramsestom commented 4 years ago

@ramsestom Thanks for this! We saw ROBERT and I agree with you that it does not reveal who is infected. At some point in our designs (before finding the truncated ID idea that avoided the scalability issue) we had the same mechanism. The issue I see with ROBERT is that the public server can identify every user of the application (i.e. via connection meta-data). We had in mind that this should be avoided to avoid transforming the app into an actual people tracing app for governments with bad intentions.

Theorically yes, it would be possible for the owner of the server (the government) to link each user ID to his real identity through metadata (the IP). But it would requiere the government to break the GDPR and ask ISPs (that are the only one to own the "IP" to "real identity" mapping) for this information for the millions of user IPs they will have. And even with this information, this government won't do much with it as the only relevant information he would be able to get from this is the list of people tagged as "at_risk" (meaning they have been in contact with an infected person). He won't be able to know who was really infected or who was in contact with who. Anyway, this problem can be quite easily solved by having a second trusted server, maintained by a trusted organisation not linked to the government and of which we would be sure that they would preserve the anonymity of the users (an association for the protection of liberties, a central hospital ...) that would only have one role: act as a proxy between users and the government server to mask their IP. All user requests would go through this trusted server that would relay them (without storing any data) to the government server with its own IP. The other solution would be to use Tor for each user connection to the government server but it would probably be a bit complicated to implement the tor client at the app level.

Nonetheless, ROBERT is still a much better solution than the current DP-3T design. The issue mentioned in your comment or in mine have different impacts: ROBERT: The public server could completely identify users throughout the system (it has all the UUIDs a user has used), but this still is not straightforward as one must capture meta-data of the connection and tie it to the actual user. OUR DESIGN: The user may find out, with statistics, which ID (but not proven) infected the user. Here such an user must also go to some length to isolate that target user to be able to pinpoint statistically the infected user. There must be several conditions for it to actually work (isolating the user for a relevant time, etc.), it is not straightforward, but yes could be done.

From what I understand of DP-3T (I did not study it in quite detail, especially since it is constantly evolving) it seems quite straightforward to identify the infected user you have been in contact with with a great confidence in many cases. I will take a simple example. Imagine that every morning I go to my baker at the bottom of my house to buy some crossants and that I use an application which records the EpIDs of other crossed users as well as the exact time and date of contact (and possibly my GPS location at this time). My baker uses the official contact tracing application (as well as others of his customers) and one day is declared positive for COVID-19. As he declared himself infected, I receive the list of his EpIDs. It is enough for me to cross this list with that of the EpIDs that I saw during the last 14 days to see that I crossed an infected person at the bakery every morning (but it is probably enaugh to have gone to the bakery only 2 times or even only one if I was the only customer at the bakery at the time) and deduce with almost certainty that my baker is infected

ramsestom commented 4 years ago

ROBERT cannot be considered privacy preserving by any measure because it reports movements by all users to a central authority.

No it do not at all. The only thing the central authority receive is the information that a given user has been in contact with an infected user at a given time. He do not know who was this infected person or the other users that where also present at this time. And the only thing a user can query is his own risk status (the "chance" he has to have been infected by another person). He can't ask any information on any other user.

tom-leclerc commented 4 years ago

@ramsestom Thanks for this! We saw ROBERT and I agree with you that it does not reveal who is infected. At some point in our designs (before finding the truncated ID idea that avoided the scalability issue) we had the same mechanism. The issue I see with ROBERT is that the public server can identify every user of the application (i.e. via connection meta-data). We had in mind that this should be avoided to avoid transforming the app into an actual people tracing app for governments with bad intentions.

Theorically yes, it would be possible for the owner of the server (the government) to link each user ID to its real identity through metadata (the IP). But it would requiere the government to break the GDPR and ask ISPs (that are the only one to own the "IP" to "real identity" mapping) for this information for the millions of user IPs they will have. And even with this information, this government won't do much with it as the only relevant information he would be able to get from this is the list of people tagged as "at_risk" (meaning they have been in contact with an infected person). He won't be able to know who was really infected or who was in contact with who. Anyway, this problem can be quite easily solved by having a second trusted server, maintained by a trusted organisation not linked to the government and of which we would be sure that they would preserve the anonymity of the users (an association for the protection of liberties, a central hospital ...) that would only have one role: act as a proxy between users and the government server to mask their IP. All user requests would go through this trusted server that would relay them (without storing any data) to the government server with its own IP. The other solution would be to use Tor for each user connection to the government server but it would probably be a bit complicated to implement the tor client at the app level.

Nonetheless, ROBERT is still a much better solution than the current DP-3T design. The issue mentioned in your comment or in mine have different impacts: ROBERT: The public server could completely identify users throughout the system (it has all the UUIDs a user has used), but this still is not straightforward as one must capture meta-data of the connection and tie it to the actual user. OUR DESIGN: The user may find out, with statistics, which ID (but not proven) infected the user. Here such an user must also go to some length to isolate that target user to be able to pinpoint statistically the infected user. There must be several conditions for it to actually work (isolating the user for a relevant time, etc.), it is not straightforward, but yes could be done.

From what I understand of DP-3T (I did not study it in quite detail, especially since it is constantly evolving) it seems quite straightforward to identify the infected user you have been in contact with with a great confidence in many cases. I will take a simple example. Imagine that every morning I go to my baker at the bottom of my house to buy some crossants and that I use an application which records the EpIDs of other crossed users as well as the exact time and date of contact (and possibly my GPS location at this time). My baker uses the official contact tracing application (as well as others of his customers) and one day is declared positive for COVID-19. As he declared himself infected, I receive the list of his EpIDs. It is enough for me to cross this list with that of the EpIDs that I saw during the last 14 days to see that I crossed an infected person at the bakery every morning (but it is probably enaugh to have gone to the bakery only 2 times or even only one if I was the only customer at the bakery at the time) to deduce with almost certainty that my baker is infected

I think you mixed comments, and my terminology "Our design" was not helping to be clear. "Our Design" refers to the design PXS Luxembourg proposed (see a few comments above, let's call it LVP from now on ;) ). So "Our Design" aka LVP is very similar to ROBERT and quite far away from DP-3T's current design.

Anyway, I agree with your comments, just the quoting of my message confused me ;). ROBERT or sth similar (storing encounters instead of infected EpIDs on servers) with a bit of Tor or Crowd hiding is better than the current DP-3T way.

burdges commented 4 years ago

ROBERT reports the secret key from which honest users generate bluetooth ids during its query phase. There are existing ad serving networks that report bluetooth ids, so many movements get exposed.

ROBERT could only avoid doing this with expensive MPCs or some PIR scheme. In fact, I suspect homomorphic hashing integrated with PIR might make ROBERT secure, but doing this sounds beyond what folks would consider right now.

If you believe "attackers must write some code" provides a defense, then obviously DP-3T protects infected user's private data too. We should not pretend that fig leaves provide measurable security.

I think DP-3T could be fixed by decrypting and searching the infected ephids database inside a TEE, but only Android devices have powerful enough TEEs, so iOS users need their device paired with a laptop, android device, etc.

There are some European countries approaching daily infection levels for which contact tracing might become effective. Those should use DP-3T if they need a contact tracing app now. In particular, ROBERT violates GDPR by not minimizing data for uninfected people, which DP-3T avoids by only processing data for people infected by a reportable disease.

In the longer term, we might improve the situation if some application can switch to a PIR or DP-#T+TEE scheme.

ramsestom commented 4 years ago

ROBERT reports the secret key from which honest users generate bluetooth ids during its query phase. There are existing ad serving networks that report bluetooth ids, so many movements get exposed.

I suggest you to read the ROBERT specifications paper because once again you are wrong about how ROBERT works. This secret key is only known by the server (and would probably rotate to achieve better security. See https://github.com/ROBERT-proximity-tracing/documents/issues/8#issuecomment-616430302) so it is not reported by ROBERT in any way.

tom-leclerc commented 4 years ago

ROBERT reports the secret key from which honest users generate bluetooth ids during its query phase. There are existing ad serving networks that report bluetooth ids, so many movements get exposed.

ROBERT could only avoid doing this with expensive MPCs or some PIR scheme. In fact, I suspect homomorphic hashing integrated with PIR might make ROBERT secure, but doing this sounds beyond what folks would consider right now.

If you believe "attackers must write some code" provides a defense, then obviously DP-3T protects infected user's private data too. We should not pretend that fig leaves provide measurable security.

I think DP-3T could be fixed by decrypting and searching the infected ephids database inside a TEE, but only Android devices have powerful enough TEEs, so iOS users need their device paired with a laptop, android device, etc.

There are some European countries approaching daily infection levels for which contact tracing might become effective. Those should use DP-3T if they need a contact tracing app now. In particular, ROBERT violates GDPR by not minimizing data for uninfected people, which DP-3T avoids by only processing data for people infected by a reportable disease.

In the longer term, we might improve the situation if some application can switch to a PIR or DP-#T+TEE scheme.

I actually don't know the details of ROBERT. Could you take a look at our design (see document here LVP_PXS_Lu_contribution_to_DP-3T_v3.docx) which is much simpler and requiring no cryptography and permits multi-server and ensure the anonymity of all involved users (not perfectly, but statistically reasonably)?

lbarman commented 4 years ago

(@tom-leclerc: your document is very hard to read on my version of LibreOffice, see this screenshot)

ramsestom commented 4 years ago

I think you mixed comments, and my terminology "Our design" was not helping to be clear. "Our Design" refers to the design PXS Luxembourg proposed (see a few comments above, let's call it LVP from now on ;) ). So "Our Design" aka LVP is very similar to ROBERT and quite far away from DP-3T's current design.

Do you have any github repository where potential issues with the LVP design might be discussed? (to avoid polluting this thread with comments not directly related)

tom-leclerc commented 4 years ago

Indeed that's unreadable, sorry for that, I will reformat it and paste it here.

For github not yet, but I think it is time we put it somewhere, will update the thread soon!

burdges commented 4 years ago

ROBERT's Exposure Status Requests are described on page 10 of https://github.com/ROBERT-proximity-tracing/documents/blob/master/ROBERT-specification-EN-v1_0.pdf In those, unifected users upload their EBIDs to the server, from which the server decrpyts some "permanent" identifier. This exposes uninfected user movements to the server, which is unacceptable.

In principle, one could avoid this permanent identifier, but this still exposes user movements to the server due to query metadata.

ramsestom commented 4 years ago

(@tom-leclerc: your document is very hard to read on my version of LibreOffice, see this screenshot)

Made a pdf version: https://easyupload.io/hwgr6o

ramsestom commented 4 years ago

ROBERT's Exposure Status Requests are described on page 10 of https://github.com/ROBERT-proximity-tracing/documents/blob/master/ROBERT-specification-EN-v1_0.pdf In those, users upload their EBIDs from which they decrpyt some "permanent" identifier. This exposes uninfected user movements to the server, which is unacceptable.

The EBID do not contain any "movement" related data (it is just an encrypted ID so that users can not be tracked through their emitted bluetooth beacon over time) so how would it expose uninfected user movements ?

tom-leclerc commented 4 years ago

(@tom-leclerc: your document is very hard to read on my version of LibreOffice, see this screenshot)

Made a pdf version: https://easyupload.io/hwgr6o

Good point.. a pdf, much simpler :) however you didn't take the latest design, here is in pdf: LVP_PXS_Lu_contribution_to_DP-3T_v3.pdf

ramsestom commented 4 years ago

just a few comments while waiting for a dedicated github to track issues with LVP:

You did not discussed the size of the database needed by the server to store contact UUIDs of all infected people (it can be quite large depending of the number of infected and the number of their contact)
You should probably mix the list of contacts UUIDs reported by infected users with some kind of mixnet (like in ROBERT) else the contact graph and link between an IP and its UUIDs (meaning potentially the real identity behind every UUID) could be reconstructed probabistically quite easily (even with truncated UUIDs used by the users to request the server)
How do you prevent an infected user to dishonestly include the UUIDs of some other users (same as here https://github.com/ROBERT-proximity-tracing/documents/issues/7) to induce wrong alerts to many users?
With your design, it is quite easy to identificate another user you have crossed as infected (see my example with the baker that still stands. Whereas with ROBERT you would be able to do it only if you have crossed one unique other user during all the period where you used the app as ROBERT do not report to the user the UUID nor the time of a crossed infected user)

tom-leclerc commented 4 years ago

You did not discussed the size of the database requested by the server to store contact UUIDs of all infected people (it can be quite large depending of the number of infected and the number of their contact)

Yes it is not discussed, but it is something that we think any CDN or Cloud infrastructure will be able to handle. Plus, we can easily setup multiple servers to do load balancing if it really becomes necessary. Even though, the Cloud will scale automatically anyways.

You should probably mix the list of contacts UUIDs reported by infected users with some kind of mixnet (like in ROBERT) else the contact graph could be reconstructed probabistically (even with truncated UUIDs used by the users to request the server)

I did not see in detail the mixnet of ROBERT, but yes, we could add a few random IDs in the truncated IDs to further conceal the requesters identity. However, from what I understood, in ROBERT the verification is done on the server side, so the server will be able to know the identities of the requester, hence we have already a "better" protection on this particular point.

How do you prevent a user to dishonestly include the UUIDs of some other users (same as here [ROBERT-proximity-tracing/documents#7](https://github.com/ROBERT-proximity-tracing/documents/issues/7)) to induce wrong alerts to many users
For now attack of copying, replaying or using fake UUIDs has been seen as only limited applications. Replayed IDs can only be done during their validity time (e.g. 1 hour) and they will only be relevant if they are replayed to a user that is later on declared infected. Fake UUIDs is even worse, because then the attacker must be able to generate the same exact UUID at the same exact unix time in the same country to have an impact.

We though however about attacks were many fake encounters would be generated, but this can be easily kept low by limiting the number of uploaded encounter to a humanly reasonable amount.

Nonetheless, these parts of our design (LVP) could definitely improved with better protection against such attacks.

With your design, it is quite easy to identificate another user you have crossed as infected (see my example with the baker that still stands. Whereas with ROBERT you would be able to do it only if you have crossed one unique other user during all the period where you used the app as ROBERT do not report to the user the UUID nor the time of a crossed infected user)

We are at the same level as ROBERT regarding this issue, it is only possible on a 1 to 1 encounter. While yes ROBERT hides some parts on the server, it is still possible to know if a specific own EpID was has been in an encounter with an infected. Thus, the same applies there, you easily can guess who it was with a 1 to 1 scenario as you describe.

There are actually many other cases, where you use a phone only for one single encounter and then wait for it to pop-up as "with an infected".

Note that we are not necessarily saying throw away ROBERT and DP-3T and take LVP. We are really just trying to improve the concepts. ROBERT seems better than DP-3T so far. Our LVP design may improve ROBERT with the aspect of better hiding the identity of the mobile users in general.

But DP-3T and ROBERT have nice features and ideas that might be useful for LVP, or the other way around :)

ramsestom commented 4 years ago

You should probably mix the list of contacts UUIDs reported by infected users with some kind of mixnet (like in ROBERT) else the contact graph could be reconstructed probabistically (even with truncated UUIDs used by the users to request the server)
I did not see in detail the mixnet of ROBERT, but yes, we could add a few random IDs in the truncated IDs to further conceal the requesters identity. However, from what I understood, in ROBERT the verification is done on the server side, so the server will be able to know the identities of the requester, hence we have already a "better" protection on this particular point.

I am not talking about using a mixnet when a user perform a request with his own set of truncated UUIDs but when an infected user upload the list of its contacts UUIDs. Without that, you can easily infer the identity between the UUIDs by crossing the information you have on: the IP used to report a set of contact UUIDs, the set of these contact UUIDs (meaning it was people that where close at a given time), the IP used to perform, given a truncated UUID, a lookup of the UUIDs reported as having been in contact with an infected user and the set of UUIDs matching this truncated UUID. I can elaborate more on this when you have dedicated issue tracker if you want.
Aditionaly I want you to notice than unless the report of contact UUIDs to the server is not done directly by the infected user device but by a trusted third party (like a hospital server), the government server have direct access to the IP of infected users just by looking at the type of request.

How do you prevent a user to dishonestly include the UUIDs of some other users (same as here [ROBERT-proximity-tracing/documents#7](https://github.com/ROBERT-proximity-tracing/documents/issues/7)) to induce wrong alerts to many users
For now attack of copying, replaying or using fake UUIDs has been seen as only limited applications. Replayed IDs can only be done during their validity time (e.g. 1 hour) and they will only be relevant if they are replayed to a user that is later on declared infected. Fake UUIDs is even worse, because then the attacker must be able to generate the same exact UUID at the same exact unix time in the same country to have an impact.

I am not talking about replaying a UUID but only the possibility for an infected person to artificially grow his contacts list before sending it to the server (see my comments on the ROBERT repository for more details)

We though however about attacks were many fake encounters would be generated, but this can be easily kept low by limiting the number of uploaded encounter to a humanly reasonable amount.

It would be really hard to implement such a rule as you don't want to miss potentially super spreaders (like a bus driver for example) that may legitimaly have had many contacts over the infectious period.

With your design, it is quite easy to identificate another user you have crossed as infected (see my example with the baker that still stands. Whereas with ROBERT you would be able to do it only if you have crossed one unique other user during all the period where you used the app as ROBERT do not report to the user the UUID nor the time of a crossed infected user)
We are at the same level as ROBERT regarding this issue, it is only possible on a 1 to 1 encounter. While yes ROBERT hides some parts on the server, it is still possible to know if a specific own EpID was has been in an encounter with an infected. Thus, the same applies there, you easily can guess who it was with a 1 to 1 scenario as you describe.

No you are not at all on the same level as ROBERT on this aspect. I don't need a 1 to 1 encounter with your protocol to identify an infected person I crossed. I only need to have crossed it more than once or to have been alone with him when I crossed him (but I can have crossed many other people the rest of the time, it doesn't matter). With ROBERT you can only identify an infected if you crossed only one person during all the time you used the app (meaning you would have to create and use a new account for every person you cross (actually every EBIDs (=your UUIDs) you see as you can't know if an EBID is from the same actual person or not), making this attack almost impossible to carry on.

Our LVP design may improve ROBERT with the aspect of better hiding the identity of the mobile users in general.

In my opinion it do not.

lbarman commented 4 years ago

Trying to de-mingle the Proximus discussion as well as I can, please freeze this thread for now...

edit: Done; I'll ask you all to continue all discussions about Proximus LVP here please: https://github.com/DP-3T/documents/issues/230. Let us try to keep this thread about the de-anonymization risk

I will now minimize/hide comments solely about Proximus; they have been copied over in https://github.com/DP-3T/documents/issues/230. The full conversation (before I hid any comment) has been archived here: https://web.archive.org/web/20200423112802/https://github.com/DP-3T/documents/issues/169.

Sorry for the complexity and thanks for your understanding.

christiano-git commented 4 years ago

Looking from a practical point. It is difficult in life to get 100%. In addition to resolve the last 5% will delay the project in large time scale. Now it is about speed and quality.

May it is okay to speak out all the measures in place against de-anonymisation. To prove it is not the intention to collect user data as Google does. If possible indicate the probability hope it is still 0.0001.

pzboyz commented 4 years ago

Should it be recommended that once a person is infected and has uploaded their ephID's, they choose a new Secret Key to help avoid any potential future tracking? Given that we believe HMAC256 is secure, this step is arguably not needed. But does it give assurance to the user.

lbarman commented 4 years ago

@pzboyz yes, it is in the protocol (and is needed to avoid tracking in the future as you say).

IcyApril commented 4 years ago

Hi there;

I posted the following response on a similar project with the same problem, but I wanted to cross-post it here as it should help address the same problem:

Hi;

I've previously done a lot of work on hash-based k-Anonymous search; I created the k-Anonymous search approach used in Have I Been Pwned, which I later worked with Cornell University to provide formal analysis for the protocols refine into new C3S protocols [ACM]. This work fed into the efforts to create Google Password Checkup by Google and Stanford [Usenix]

Before the pandemic, I was doing work on anonymising wireless unique identifiers (for example, in Bluetooth Journey Time Monitoring Systems). This work provides formal analysis and experimental data for applying k-Anonymity to hashes for the purpose of anonymisation. The pre-print of the paper is here (conference accepted): https://arxiv.org/abs/2005.06580

Recently, I've been working on using k-Anonymity to prevent de-anonymisation attacks in existing Contact Tracing protocols. I have formed a hypothesis for using a cross-hashing approach to provide a cryptographic guarantee for the minimum contact time and additionally prevent "Deanonymizing Known Reported Users". This uses a k-Anonymous search approach to reduce the communication overhead and additionally offer additional protection to data leaks from the server (using Private Set Intersection). The hypothesis can be found here alongside discussions of the risk factors - but do note there are no experimental results at this stage and this paper is not peer-reviewed: https://arxiv.org/abs/2005.12884

If anyone has any feedback on this work, please do reach out to me (my email is on the papers).

Thanks