Add plausible deniability by using randomized responses

adewes commented 4 years ago

Thinking about the privacy risk to infected people I wondered if it could be reduced further by creating plausible deniability for them via a randomized response technique. This could be done by randomly changing the reporting of a Covid-19 test, so that people which have been tested negative have a small chance of being reported as positive and vice versa. This would obviously increase the error rate but it would make it harder for attackers to learn about the infection status of people that they have re-identified by e.g. observing EphID values and correlating them with other data such as video footage.

The drawback of this method is of course that it's possible persons will be quarantined due to a person that was randomly reported as infected, or that an infected person will be reported as healthy when he/she is infected. How strongly this affects the precision of the system depends on the other errors that are present.

Threat Analysis

Scenario: An adversary obtains EphID values of an individual as well as further information of that individual, e.g. an image or other personal data that can be used to identify the individual. If the individual now gets marked as infected by the system the adversary will be able to obtain this information with 100 % confidence. Such an attack is easy to perform for any adversary that has access to a suitable BLE device and access to relevant conext information such as video footage, so it has a high likelihood of occurring in practice.

Defense Mechanism

To defend against this attack, we use randomization of reported results in order to create plausible deniability. This can e.g. be done by randomly modifying the test results of the subset of individuals that get tested for Covid-19. The randomization uses two Bernoulli-distributed binary variables with probabilities $p_1$ and $p_2$. The After performing a Covid-19 test with result $r$ (either $0$ or $1$), a value $x_1$ is drawn from the first distribution, which will be $x_1=0$ with probability $p_1$ or $x_1=1$ with probability $1-p_1$. If $x_1 = 1$, the system will report the true test result. If not, another value $x_2$ will be drawn from the second distribution, which will be $x_2=0$ with probability $p_2$ or $x_2=1$ with probability $1-p_2$. That value will be reported as the test result instead of the true value.

This procedure creates a differentially private randomization mechanism and can thereby protect the anonymity of every individual whose data is submitted to the system.

Effect on Tracing Accuracy and Privacy

The effect of the randomization on the accuracy of the tracing system depends on the parameter $p_1$. The larger $p_1$, the more random noise will be injected, reducing the precision of the tracing data. The question is, which value of $p_1$ is necessary to create enough plausible deniability for an individual? This depends on the context information that an adversary has access to. If he/she does not have any additional information about the individual, his/her best estimate that the person is infected without having any data from the tracing system is equivalent to the base infection rate of the population. Now, if the adversary observes a positive test result for this individual via the system, he/she can improve this estimate. Without any randomization, the adversary would learn the true value with 100 % accuracy. On the other extreme, if $p_1=1$ (i.e. all values are generated randomly), the adversary would learn nothing about an individual based on the test result. In practice, a balance needs to be found between these two extremes that sufficiently reduces the privacy risk to the individual and at the same time does not reduce the precision of the contact tracing scheme too much. For a given value of $p_1$, the adversary then has to take all possible combinations of ($x_1$, $x_2$ and $r$) into account to update his/her estimate. Likewise, the precision of the differentially private tracing system can be estimated based on the value of $p_1$ and the estimates of other relevant error sources.

As a first naive guess one could choose $p_1=p_2=p_r$, where $p_r$ is the non-randomized empirical frequency of positive test results (i.e. the empirical probability that a Covid-19 test will be positive). Hence, the resulting randomized test results would have the same distribution as the original results and a reported positive test result would have the same likelihood of being produced by the randomization process as by a real positive result. This parameter choice also decreases the confidence in a reported test result by 50 % though. Whether this is problematic depends on the other sources of error that are present in the system. If we choose a smaller value for $p_1$, the confidence of the adversary (and a legitimate user) in his/her estimate of the infection status of the individual will increase. Even a small $p_1$ value might create enough plausible deniability for an individual though.

Effect on Statistical Use of the Data

As the randomization mechanism is known (and does not need to be kept secret), its statistical effect can be corrected for. The randomization introduces additional noise into the system, which reduces the precision of any statistical estimate. With a large enough sample size this effect quickly becomes negligible though.

Let me know what you think! If there's a general interest to explore this approach further I'm happy to work out a detailed optimization scheme for picking the randomization parameters.

lbarman commented 4 years ago

Hi again @adewes. Thanks for the detailed suggestion !

To better understand, let me ask: when Alice feels sick and goes to the hospital, but is diagnosed negative; she still uploads her (seed of) EphIDs with some probability as if she was infected ?

In practice, a balance needs to be found between these two extremes that sufficiently reduces the privacy risk to the individual and at the same time does not reduce the precision of the contact tracing scheme too much.

would you see a sweet spot ?

Don't get me wrong, I like plausible deniability; just trying to see how many false positives this creates :)

FishmanL commented 4 years ago

Hey, in terms of the DP here, we're gonna want to eliminate false negatives (so we'll want a small probability of randomized yeses, rather than a larger one of randomized nos), since this isn't a case for equal utility of type 1 and type 2 errors!

(otherwise no issues, I've been working on DP for a while for the GPS case and this works perfectly well for the bluetooth case)

adewes commented 4 years ago

Hey @lbarman, happy if I can contribute! Yes the idea is to really mix in randomized data into the process in a way that is indistinguishable from the real data to any adversary. Usually one would design the system to generate both false positives as well as true negatives, but it's also possible to e.g. only generate false positives (depending on your definition the system might no longer be differentially private then, but might still provide protection against specific privacy risks). This probably also answers @FishmanL question.

The strength of the randomization depends on many factors, there is no unambiguous way of choosing the parameters but with a quantitative risk and utility model it's possible to derive an optimal set (for the given model). One one hand it would be necessary to take into account as many errors as possible (e.g. the accuracy of Covid-19 tests, the precision of the Bluetooth measurements and the contact tracing scheme) to see how strongly the randomization affects the overall precision of the system. On the other hand it would be necessary to model a "Bayesian adversary" that possesses different amounts and types of context information and tries to estimate whether a given person has Covid-19 or not. Based on the provided context information and the information the adversary can learn from the DP3T system one can then calculate the gain in confidence of the estimate, which in turn allows estimating the privacy risk of the system to the individual. There is no absolute guideline for this, in practice a deniability of at least 20-30 % could already be regarded as sufficient as a given user only publishes a single data point. It would even be possible to let each user choose a desired amount of deniability according to his/her perceived risk. This would allow users to precisely choose how much information they want to provide to the system instead of forcing them to choose between publishing no information at all or the complete information about their health status. Again, please note that the deniability also depends on the amount of context information the adversary has.

Developing such a model will take some effort but I'm happy to do this, I just want to make sure that you think this is worth including before doing that, so please give me a quick thumbs up or thumbs down first.

A similar randomization mechanism could be used for the epidemiological data as well, though a different type of randomized response would be necessary. Happy to elaborate on this as well if you think it's interesting.

adewes commented 4 years ago

On a related note, I think the randomization scheme could solve the problem of potentially releasing person-relatable health data into the public domain, which has been brought up repeatedly by proponents of more centralized solutions, and which I think should not be easily dismissed. The partially randomized data would have a much stronger basis for being considered anonymous even if an adversary would manage to break the pseudonymization using context information, as the randomization provides strong mathematical / information-theoretic anonymity guarantees and as the anonymity is not conditional on the secrecy of cryptographic keys or the non-observability of EphIDs. From a GDPR perspective this could make it much easier for health authorities and governments to accept this scheme over a centralized solution.

I can also provide a detailed analysis of the anonymity of the randomized data in the context of the GDPR following article 29 guidelines and existing court decisions.

FishmanL commented 4 years ago

@adewes for centralized data releasing you'd need location data, e.g. for geographic heatmaps -- thankfully, those can be approxed with a counting query, and we in the DP sphere know well how to handle counting queries :p (individual/pathbased modeling is a bit harder, but I've been working for the past 2 weeks on a modified PTR framework for distance noisifying that gets some pretty freakin good guarantees assuming people are distributed approx. uniformly)

adewes commented 4 years ago

I'm not sure I understand what you're referring to @FishmanL. If you want to use a DP noise mechanism to publish path-based GPS data (e.g. a sequence of positions of an individual) I would strongly argue against that as it's usually not secure. GPS data is extremely correlated so DP's usual assumptions of additive privacy budgets won't hold, and using DP to add noise to an entire GPS trace will (in my experience) quickly destroy all utility for a realistic privacy budget.

Maybe I don't understand this correctly though. I would be very curious to hear about your approach as I have worked on related problems for a while and don't know of any mechanisms that can produce high-utility DP anonymized datasets from high-dimensional data like GPS traces. Maybe it's better to open a different issue for that though as I think it's a different topic and it would be nice to keep this discussion about the randomization of test results.

FishmanL commented 4 years ago

So, you've got 2 options for that:

1: model GPS as a series of fixed positions, for each position generate a single noisy ring statically, then have all other distance queries/heatmap queries done by picking the closest point on the ring 2: You don't need to publish paths for modeling, you can just allow people to submit paths and publish the number of contacts, for each contact add laplacian noise to minimum distances

adewes commented 4 years ago

Ok, maybe open another issue for that though, I think it's a bit off-topic in this discussion.

FishmanL commented 4 years ago

oh, yeah, 100%, i just figured since you asked :)

adewes commented 4 years ago

Any update on this @lbarman?

lbarman commented 4 years ago

Not at the moment, sorry! I missed this message of yours actually:

Developing such a model will take some effort but I'm happy to do this, I just want to make sure that you think this is worth including before doing that, so please give me a quick thumbs up or thumbs down first.

(emphasis mine)

Sorry about the delay. In general we're very happy to see multiple extensions sprout and we encourage them ! we will probably only be able to produce a minimal working version of the concept at first and your ideas are great. At the same time (since you specifically mention "included") I cannot guarantee this will be included since our dev team is under so much pressure. But if you're willing to suggest something more concrete, there is also more chance that we include early on. Thanks!

edit: formatting

FishmanL commented 4 years ago

The simple concrete method is having every tested individual have an ~0.5% of sending their IDs to the server as infected regardless of test result. Given that high-test countries have about 4-5% positive test rates, that's only raising false positives by ~8% but it gives you pretty simple anonymity properties with enough testing.

adewes commented 4 years ago

Sure, so what would be a good way to propose something more concrete? Write a technical description and possibly some example code and open a PR? I think this weekend I might have time to work through the details, I can publish e.g. a Jupyter notebook that explains the trade-offs and contains relevant code.

FishmanL commented 4 years ago

(@adewes lmk if there's any way I can help, btw)

lbarman commented 4 years ago

Hmm, what would be helpful for everybody is (1) a short, detailed list of changes and (2) a comprehensive analysis of the implications. say we do X, what are the pros and cons (with "proof" that the cons are not destroying the accuracy of the system, in this case). One example would be concrete values for the probabilities and a comprehensive effect on the system (# of false positives).

But in general we can parse any format ;) it's just that we have very limited bandwidth to think about proposals, so the more "digested" and ready they are, the easiest :)

kollokollo commented 4 years ago

Is it necessary to send the ID automatically (so an attacker could monitor this)? Why not display the ID on the screen of the smartphone and make the user re-enter it into a WEB-Interface or something like this? Manually. If the WEB-Interface is run in another browser even on a different device (Laptop), it would be very hard to identify the person. Even the doctor (who made the test) could handle this. How many IDs are we talking about in a typical case?

(The IDs could be displayed as a QR-Code and scanned by the doctor).

lbarman commented 4 years ago

Hi @kollokollo, thanks. yes, why not! this is a "manual proxy" which in my opinion falls into "the health authorities provide a trusted proxy to upload" which we have already discussed. May I redirect this discussion to https://github.com/DP-3T/documents/issues/206 which is more relevant.

DP-3T / documents