DP-3T / documents

Decentralized Privacy-Preserving Proximity Tracing -- Documents
2.25k stars 180 forks source link

Proposal for a reduction in transfered data of design 2 #218

Open timoll opened 4 years ago

timoll commented 4 years ago

I have a proposal that could reduce the amount of downloaded data significantly. This works especially good for the proposal #66.

The Whitepaper makes an estimate of 140'000 contacts that need to be checked with 100 contacts per 15 minutes 24/7. This seems possible at a festival, disco or concert. It is reasonable to assume, that such events will be banned until new infections are really low.

As long as the infections are reasonably high, a reduction of 10 contacts per epoch seems reasonable.

The server could offer a second mode of delivering data should the case numbers be high.

A user can query the first n bits of each recorded contact. n is chosen that there is around a reasonable chance of a secret or EphID match but low enough to offer significant reduction in data transfer.

Each request would be 14'000 n bits and the response p 14'000 * 16 bytes where p is the probability of a new EphID matching n bits.

Each user would need to ask for the same EphIDs or secrets with each request. This allows fingerprinting and should be prevented by hashing all EphIDs or secrets each epoch.

lbarman commented 4 years ago

Hi @timoll, thanks for the input !

Indeed this is something that we were considering; it does reveal the first bits of each of your contacts to the server, which is not great but might be acceptable. Would you see how to quantify how many of these bits would be acceptable for privacy ?

timoll commented 4 years ago

The hashing of the secrets or EphIDs is to prevent linking a request to the next request.

I didn't consider the possibility that a re-identification could happen if multiple users ask for the same EphIDs and have some other common contacts.

With leakage over TCP/IP, there could be enough information to build a social graph if multiple users ask for the same set of EphIDs, especially if you aggregate the data over a longer period.

So this proposal would definitely not work for design 2.

However, because each secret of proposal #66 would be unique, this is not a problem for this design and would solve the higher bandwidth requirement of the design.

hitd010000 commented 4 years ago

I have a proposal that could reduce the amount of downloaded data significantly.

Not really understand why is a need of it ?

I assume, that *we divide download from upload servers. Because of that, download server could scaled up by a lot of reverse proxies and DNS round robin.

*mobile phone provider ( and other in case cellphone is connected to WLAN ) does not request paying for connections to contact tracking server. As long contact tracking is voluntary, the government should enforce a user not to has to pay for if using. Otherwise fewer user will use it.

timoll commented 4 years ago

Not really understand why is a need of it ?

As I have pointed out, this solution won't work with design 2. However as with #222 I hope that the DP-3T team will include the proposal #66 to prevent such attacks. The only downside of this proposal is, that every connection has a different secret and a higher bandwidth requirement. This proposal would solve the only downside of the secret sharing approach.

lbarman commented 4 years ago

@timoll something I'm missing: this does reveal the first n bits of my contacts to the backend server right ? if n is not extremely low, the server can match it against its (public) infected database and now can estimate whether the user doing this request is infected or not ?

timoll commented 4 years ago

My assumption would be that n is low enough that many people will ask for the same first n bits. I assume that this would be the case up to around 20 bits.

With an average of 10 contacts that are longer than 5 minutes per 15 minutes, we have to ask for around 14'000 contact events. If there were less, we would add some dummy data so that no leakage over the number of request is possible. If we assume 4 million app users in Switzerland this would mean around 56 billion requests. For 20 bits we would see around 53'000 requests, or 13.7 million for 12 bits.

For design 2, family members and close contacts will ask for the same EphIDs that they observed. This is leakage and may be used to build a social graph of such clusters. Especially if there is other leakage over TCP/IP. I would strongly recommend to not implement this proposal in a system where multiple close contacts ask for the same first n bits of the same identifier.

However, I want to point out that it still works for the proposal #66 as every secret is only shared between two contacts. There may be the possibility to combine TCP/IP leakage and frequent close contacts. To prevent this possibility, the bits of one secret could be rotated by 64 bits, using the broadcast G SA and G SB to decide who rotates the key. Now each request would be for an unique secret that nobody else asks for.

a8x9 commented 4 years ago

TL;DR: results are in bold from the middle of this post. This proposal significantly decreases the amount of transferred data with a minimal leakage of the contact graph.

@timoll This is a very interesting idea and could indeed greatly help with the larger data transfer needed in #66. The hashing of secrets between each query is a great addition to prevent linkability between different requests from the same user. Note that this "hashing epoch" does not need to be the same as the "EphID rotation epoch", i.e., if users query the backend at most once in each X hours window, then secrets can be rehashed by clients every X hours.

I've done some estimations using the following values:

Using these numbers, we can estimate the number of sent and received bytes per user when combining this desigh with #66. We assume that 2X >> N, i.e., the number of different prefixes is roughly the same as the number of local secrets.

We can also estimate the probability of the backend identifying an infected user based on their sent prefixes. We compare the difference of matching secrets between the case where only random secrets match, and the case where additional secrets of a contact match.

Note: I considered the difference of matching secrets as a representation of how much of the graph was leaked. The proper way to do it, is to estimate the statistical relevance of this difference (p-value) given the distribution of number of matching secrets. I can improve this post if the DP^3T team plans to include this proposal into their protocol.

The big caveat about the above computation, is the assumption that the server does not keep track of which IP sent which secret. Otherwise it is trivial to identify, in the case of constant contact, that a large number of matching secrets are linked to a single contact.

We can now do some comparison of the total amount of transfered data, and contact graph leak, in different scenarios.

Design 2 without cuckoo filter:

Design 2 with cuckoo filter:

ECDH design without cuckoo filter:

ECDH design with cuckoo filter:

This proposal + ECDH with 16 bits prefixes:

This proposal + ECDH with 20 bits prefixes:

This proposal + ECDH with 24 bits prefixes:

This proposal + ECDH with 32 bits prefixes:

In order to confirm that my above estimations are correct and that the few shortcuts I took (assuming no prefix collision to avoid binomial distribution) did not significantly change the results, I wrote a simulation script.

Thanks to this proposal from @timoll, the only drawback from #66 is now solved. If we chose a 20 bits prefix, the amount of transfered data would still be significantly lower than with design 2. The extremely low leakage of the contact graph is more than compensated by the benefits of having a protocol resistant against passive attackers and more resitant to replay attacks.

lbarman commented 4 years ago

Hi @a8x9, @timoll. Thanks for the detailed input, if we decide to go for this these kind of analysis are very helpful.

To play the devil's advocate (and understand better):

The proper way to do it, is to estimate the statistical relevance of this difference (p-value) given the distribution of number of matching secrets.

Wouldn't this give you an estimate/mean of the leakage rather than a worse-case ? Privacy is often not about the mean, but the worst case. Say I use your system, I connect to the server using my IP located in some specific rural village where we are not many users. Now instead of downloading public information on the global system, which leaks no information, I upload the X=20 first bits of my contacts for "efficient download" (the server sees my IP and rough location, naturally).

(1) since it contains only infected users, wouldn't this give the backend the ability to tell whether some specific contact of mine infected me (with probability X/len(sk_t)) and who (knowing at least X bits of the sk_t, and not all sk_t exists so there might even be one exact match - for which the server might also have the IP, location from its upload) ?

(2) wouldn't this give information to the backend to my social graph ? I agree, not if we manage to have "many users" for every query with X=20 first bits, but how could this be enforced ? It sounds like differential privacy, but here I'm not sure how we could give any guarantees.

But I agree that if X=1 or something like this, you divide by two the size of downloads for a (limited) leakage.

timoll commented 4 years ago

I think you are right, unless technology that hides the IP is used, this proposal has the risk of letting the server identify at risk persons.

Because data payload is dependant of the number of infected per country, it may make sense to implement design 1, 2 and ECDH. (I assume it is possible to use the same broadcast for all 3 designs)

The number of daily new infections can be used to determine which design should be used for the uploaded data.

a8x9 commented 4 years ago

Hi @lbarman, thank you for your response.

I might be looking at this problem from the wrong direction, so I'll try to explain what is my mental model of this proposal, and then you can tell me if I'm missing something obvious.

Now, here is my mental model of what happens seen from the server side.

Here the value of N is only important to determine the amount of downloaded data, a lower value does not reduce privacy or expose more information about the user. So I don't think the number of users requesting the same set of 20-bits prefixes is relevant to this model.

What I tried to computed with the "contact leak" number was, what would happen if a user was constantly in contact with an infected user during the last 14 days, and that all their secrets were somehow added after the bucket selection process. Would the server be able to detect the difference of number of secrets returned compared to the number of returned secrets when only selecting buckets?

But in practice, these shared secrets would be uniformely disstributed in the different buckets. So you wouldn't be able to differentiate a response containing a constant / frequent contact from a response containing only false positive until you start to have a high probability of empty bucket. Basically, the condition T >> 2X should be respected. When this condition is not respected anymore, the prefix length should be adapted.

Now there is the caveat I mentioned in the above post. If the server is malicious and attaches to each secret its uploader's IP, then in case of frequent / constant contact, one IP would be significantly more present in the buckets selected by this frequent contact, compared to a random selection.

Your current design assumes that the server does not keep track of the uploader's IP address. But if that changes, there might be possibilities to adapt this design to this threat, e.g., whitelisting of frequent contacts on the client side so they are not uploaded to the server in case of infection, or mixnet use during infection upload (similar to what ROBERT proposes at the top of p. 10).

Please tell me if I completely overlooked a way to differentiate between a request containing prefixes of a contact with an infected user, from a request containing only prefixes of non-relevant contacts.

timoll commented 4 years ago

I think a single case example can show a bit more.

Worst case is that 10% of the secret of a close contact of an infected person match the secrets of the infected person. The rest is a bit less than N*0.9/2^X. Where N is the Number of requested prefixes and X the length of the prefix.

So a close contact has a probability p=0.1+N*0.9/2^X for a N=4800 and X=16 this would mean p=0.167. A random person would have a match probability of p=N/2^X or 0.073

This difference will be very easy to detect with a 4800 sample, even for millions of requests.

To prevent this, you would need to split up the requests into small enough bits that can't be traced together.

a8x9 commented 4 years ago

@timoll Thank you for your response.

If I understand your calculation, you assume the caveat I put about a malicious backend is true, i.e., the backend keeps track of which infected secrets were uploaded by the same IP.

I think that assuming that the backend is malicious is indeed reasonable.

Regarding the counter-measures for this scenario that I mentioned in my previous post:

So yes, I completely agree, if the backend is malicious, then it can determine: IP1 requesting data is very likely to be a frequent contact of the infected user who uploaded via IP2.

peterboncz commented 4 years ago

Right, maybe by combining a number of measures the DL volume of https://github.com/DP-3T/documents/issues/66 can still be reduced pragmatically.

The idea to use a cuckoo filter to represent the keyset will save a factor 3, I understand

My proposal would be to partition the keys into several files (every file being a cuckoo filter), which also aids content distribution in CDNs.

Together cuckoo filters (x3), regions (x15?) and time (x4?) could reduce download volume by two orders of magnitude. This may make it comparable again in download volume to a simple EN / DP3T implementation (which could also adopt the region opt, BTW).

Still, even without any partitioning, if you live in a remote village with 10 houses, and the IPs are traceable to that, no scheme whatsoever will protect you from a malicious backend. The app is better for urban use.