Proposal for a reduction in transfered data of design 2

timoll commented 4 years ago

I have a proposal that could reduce the amount of downloaded data significantly. This works especially good for the proposal #66.

The Whitepaper makes an estimate of 140'000 contacts that need to be checked with 100 contacts per 15 minutes 24/7. This seems possible at a festival, disco or concert. It is reasonable to assume, that such events will be banned until new infections are really low.

As long as the infections are reasonably high, a reduction of 10 contacts per epoch seems reasonable.

The server could offer a second mode of delivering data should the case numbers be high.

A user can query the first n bits of each recorded contact. n is chosen that there is around a reasonable chance of a secret or EphID match but low enough to offer significant reduction in data transfer.

Each request would be 14'000 n bits and the response p 14'000 * 16 bytes where p is the probability of a new EphID matching n bits.

Each user would need to ask for the same EphIDs or secrets with each request. This allows fingerprinting and should be prevented by hashing all EphIDs or secrets each epoch.

lbarman commented 4 years ago

Hi @timoll, thanks for the input !

Indeed this is something that we were considering; it does reveal the first bits of each of your contacts to the server, which is not great but might be acceptable. Would you see how to quantify how many of these bits would be acceptable for privacy ?

timoll commented 4 years ago

The hashing of the secrets or EphIDs is to prevent linking a request to the next request.

I didn't consider the possibility that a re-identification could happen if multiple users ask for the same EphIDs and have some other common contacts.

With leakage over TCP/IP, there could be enough information to build a social graph if multiple users ask for the same set of EphIDs, especially if you aggregate the data over a longer period.

So this proposal would definitely not work for design 2.

However, because each secret of proposal #66 would be unique, this is not a problem for this design and would solve the higher bandwidth requirement of the design.

hitd010000 commented 4 years ago

I have a proposal that could reduce the amount of downloaded data significantly.

Not really understand why is a need of it ?

I assume, that *we divide download from upload servers. Because of that, download server could scaled up by a lot of reverse proxies and DNS round robin.

*mobile phone provider ( and other in case cellphone is connected to WLAN ) does not request paying for connections to contact tracking server. As long contact tracking is voluntary, the government should enforce a user not to has to pay for if using. Otherwise fewer user will use it.

timoll commented 4 years ago

Not really understand why is a need of it ?

As I have pointed out, this solution won't work with design 2. However as with #222 I hope that the DP-3T team will include the proposal #66 to prevent such attacks. The only downside of this proposal is, that every connection has a different secret and a higher bandwidth requirement. This proposal would solve the only downside of the secret sharing approach.

lbarman commented 4 years ago

@timoll something I'm missing: this does reveal the first n bits of my contacts to the backend server right ? if n is not extremely low, the server can match it against its (public) infected database and now can estimate whether the user doing this request is infected or not ?

timoll commented 4 years ago

My assumption would be that n is low enough that many people will ask for the same first n bits. I assume that this would be the case up to around 20 bits.

With an average of 10 contacts that are longer than 5 minutes per 15 minutes, we have to ask for around 14'000 contact events. If there were less, we would add some dummy data so that no leakage over the number of request is possible. If we assume 4 million app users in Switzerland this would mean around 56 billion requests. For 20 bits we would see around 53'000 requests, or 13.7 million for 12 bits.

For design 2, family members and close contacts will ask for the same EphIDs that they observed. This is leakage and may be used to build a social graph of such clusters. Especially if there is other leakage over TCP/IP. I would strongly recommend to not implement this proposal in a system where multiple close contacts ask for the same first n bits of the same identifier.

However, I want to point out that it still works for the proposal #66 as every secret is only shared between two contacts. There may be the possibility to combine TCP/IP leakage and frequent close contacts. To prevent this possibility, the bits of one secret could be rotated by 64 bits, using the broadcast G S_A and G S_B to decide who rotates the key. Now each request would be for an unique secret that nobody else asks for.

a8x9 commented 4 years ago

TL;DR: results are in bold from the middle of this post. This proposal significantly decreases the amount of transferred data with a minimal leakage of the contact graph.

@timoll This is a very interesting idea and could indeed greatly help with the larger data transfer needed in #66. The hashing of secrets between each query is a great addition to prevent linkability between different requests from the same user. Note that this "hashing epoch" does not need to be the same as the "EphID rotation epoch", i.e., if users query the backend at most once in each X hours window, then secrets can be rehashed by clients every X hours.

I've done some estimations using the following values:

new infections per day: 40K
contagious window: 5 days
local contact history: 14 days
epoch duration: 15 minutes
average close contacts per epoch: 10
prefix length: X bits
secret length: 16 bytes
cuckoo filter bits per entry: 48

Using these numbers, we can estimate the number of sent and received bytes per user when combining this desigh with #66. We assume that 2^X >> N, i.e., the number of different prefixes is roughly the same as the number of local secrets.

number of local secrets: N = 14 * 24 * 4 * 10 = 13440
sent data: (X / 8) * N = 1680 * X
server secrets per day: T = 40K * 5 * 24 * 4 * 10 = 192M
matching secrets on server per request: M = T / 2^X * N = 2580.48G / 2^X
downloaded data: M * 16 = 41287.68G / 2^X

We can also estimate the probability of the backend identifying an infected user based on their sent prefixes. We compare the difference of matching secrets between the case where only random secrets match, and the case where additional secrets of a contact match.

Note: I considered the difference of matching secrets as a representation of how much of the graph was leaked. The proper way to do it, is to estimate the statistical relevance of this difference (p-value) given the distribution of number of matching secrets. I can improve this post if the DP^3T team plans to include this proposal into their protocol.

best case (1 contact with infected user): 1.0 - (M / (M + 1))
worst case (constant contact with the infected user): 1.0 - (M / (M + 14 * 24 * 4))

The big caveat about the above computation, is the assumption that the server does not keep track of which IP sent which secret. Otherwise it is trivial to identify, in the case of constant contact, that a large number of matching secrets are linked to a single contact.

We can now do some comparison of the total amount of transfered data, and contact graph leak, in different scenarios.

Design 2 without cuckoo filter:

Total data: 16 * 40K * 5 * 24 * 4 = 307.2 MB
Contact leak: 0%

Design 2 with cuckoo filter:

Total data: 48/8 * 40K * 5 * 24 * 4 = 115.2 MB
Contact leak: 0%

ECDH design without cuckoo filter:

Total data: 16 * 40K * 5 * 24 * 4 * 10 = 3072 MB
Contact leak: 0%

ECDH design with cuckoo filter:

Total data: 48/8 * 40K * 5 * 24 * 4 * 10 = 1152 MB
Contact leak: 0%

This proposal + ECDH with 16 bits prefixes:

Matching secrets: M = 2580.48G / 2¹⁶ ~= 39.375M
Total data: 1680 16 + 16 39.375M ~= 630 MB
Contact leak: 1.0 - (39.375M / (39.375M + 14 * 24 * 4)) ~= 0.0034 %

This proposal + ECDH with 20 bits prefixes:

Matching secrets: M = 2580.48G / 2²⁰ ~= 2.46M
Total data: 1680 * 20 + 16 * 2.46M ~= 39.41 MB
Contact leak: 1.0 - (2.46M / (2.46M + 14 * 24 * 4)) ~= 0.054 %

This proposal + ECDH with 24 bits prefixes:

Matching secrets: M = 2580.48G / 2²⁴ ~= 153.8K
Total data: 1680 * 24 + 16 * 153.8K ~= 2.5 MB
Contact leak: 1.0 - (153.8K / (153.8K + 14 * 24 * 4)) ~= 0.87 %

This proposal + ECDH with 32 bits prefixes:

Matching secrets: M = 2580.48G / 2³² ~= 601
Total data: 1680 * 32 + 16 * 601 ~= 63376 bytes
Contact leak: 1.0 - (601 / (601 + 14 * 24 * 4)) ~= 69.1 %

In order to confirm that my above estimations are correct and that the few shortcuts I took (assuming no prefix collision to avoid binomial distribution) did not significantly change the results, I wrote a simulation script.

Thanks to this proposal from @timoll, the only drawback from #66 is now solved. If we chose a 20 bits prefix, the amount of transfered data would still be significantly lower than with design 2. The extremely low leakage of the contact graph is more than compensated by the benefits of having a protocol resistant against passive attackers and more resitant to replay attacks.

lbarman commented 4 years ago

Hi @a8x9, @timoll. Thanks for the detailed input, if we decide to go for this these kind of analysis are very helpful.

To play the devil's advocate (and understand better):

The proper way to do it, is to estimate the statistical relevance of this difference (p-value) given the distribution of number of matching secrets.

Wouldn't this give you an estimate/mean of the leakage rather than a worse-case ? Privacy is often not about the mean, but the worst case. Say I use your system, I connect to the server using my IP located in some specific rural village where we are not many users. Now instead of downloading public information on the global system, which leaks no information, I upload the X=20 first bits of my contacts for "efficient download" (the server sees my IP and rough location, naturally).

(1) since it contains only infected users, wouldn't this give the backend the ability to tell whether some specific contact of mine infected me (with probability X/len(sk_t)) and who (knowing at least X bits of the sk_t, and not all sk_t exists so there might even be one exact match - for which the server might also have the IP, location from its upload) ?

(2) wouldn't this give information to the backend to my social graph ? I agree, not if we manage to have "many users" for every query with X=20 first bits, but how could this be enforced ? It sounds like differential privacy, but here I'm not sure how we could give any guarantees.

But I agree that if X=1 or something like this, you divide by two the size of downloads for a (limited) leakage.

timoll commented 4 years ago

I think you are right, unless technology that hides the IP is used, this proposal has the risk of letting the server identify at risk persons.

Because data payload is dependant of the number of infected per country, it may make sense to implement design 1, 2 and ECDH. (I assume it is possible to use the same broadcast for all 3 designs)

The number of daily new infections can be used to determine which design should be used for the uploaded data.

a8x9 commented 4 years ago

Hi @lbarman, thank you for your response.

I might be looking at this problem from the wrong direction, so I'll try to explain what is my mental model of this proposal, and then you can tell me if I'm missing something obvious.

First, the above numbers are computed based on using this proposal with the ECDH key exchange design described in #66.
Each close contact produces a unique shared secret through ECDH key exchange.
In case of infection, users upload all their recorded shared secrets during the contagious window (~ 4800 secrets per infection using above numbers).
Users don't request a single prefix, but a set of prefixes corresponding to all their locally stored secrets.
Users never request the same set of prefixes, because secrets are rehashed at each "hashing epoch" as proposed by @timoll.
In this post I'll use numbers based on a 20-bits prefix length.
I'll assume that data is aggregated per day, but numbers can be adapted if the clients request data more frequently.

Now, here is my mental model of what happens seen from the server side.

Server has 2²⁰ buckets in which it puts secrets from newly reported infections based on their prefixes.
At the end of the 24h collection period, each bucket contains on average 183 secrets (192M / 2²⁰).
During the next 24h window, clients request N buckets corresponding to the prefixes of their locally stored secrets.

Here the value of N is only important to determine the amount of downloaded data, a lower value does not reduce privacy or expose more information about the user. So I don't think the number of users requesting the same set of 20-bits prefixes is relevant to this model.

What I tried to computed with the "contact leak" number was, what would happen if a user was constantly in contact with an infected user during the last 14 days, and that all their secrets were somehow added after the bucket selection process. Would the server be able to detect the difference of number of secrets returned compared to the number of returned secrets when only selecting buckets?

But in practice, these shared secrets would be uniformely disstributed in the different buckets. So you wouldn't be able to differentiate a response containing a constant / frequent contact from a response containing only false positive until you start to have a high probability of empty bucket. Basically, the condition T >> 2^X should be respected. When this condition is not respected anymore, the prefix length should be adapted.

Now there is the caveat I mentioned in the above post. If the server is malicious and attaches to each secret its uploader's IP, then in case of frequent / constant contact, one IP would be significantly more present in the buckets selected by this frequent contact, compared to a random selection.

Your current design assumes that the server does not keep track of the uploader's IP address. But if that changes, there might be possibilities to adapt this design to this threat, e.g., whitelisting of frequent contacts on the client side so they are not uploaded to the server in case of infection, or mixnet use during infection upload (similar to what ROBERT proposes at the top of p. 10).

Please tell me if I completely overlooked a way to differentiate between a request containing prefixes of a contact with an infected user, from a request containing only prefixes of non-relevant contacts.

timoll commented 4 years ago

I think a single case example can show a bit more.

Worst case is that 10% of the secret of a close contact of an infected person match the secrets of the infected person. The rest is a bit less than N*0.9/2^X. Where N is the Number of requested prefixes and X the length of the prefix.

So a close contact has a probability p=0.1+N*0.9/2^X for a N=4800 and X=16 this would mean p=0.167. A random person would have a match probability of p=N/2^X or 0.073

This difference will be very easy to detect with a 4800 sample, even for millions of requests.

To prevent this, you would need to split up the requests into small enough bits that can't be traced together.

a8x9 commented 4 years ago

@timoll Thank you for your response.

If I understand your calculation, you assume the caveat I put about a malicious backend is true, i.e., the backend keeps track of which infected secrets were uploaded by the same IP.

I think that assuming that the backend is malicious is indeed reasonable.

Regarding the counter-measures for this scenario that I mentioned in my previous post:

I don't think the mixnet approach (slowly uploading the infected secrets from different IPs in order to break the IP to infected user link) is doable when used in combination with an authorization code. The code can be used to re-link together the secrets from the same user.
I also don't think that the frequent contact whitelisting on the client side is realistic. Even if the procedure was as simple as a QR code scan, I doubt most people would make the effort to scan their family, friends and colleagues codes.

So yes, I completely agree, if the backend is malicious, then it can determine: IP₁ requesting data is very likely to be a frequent contact of the infected user who uploaded via IP₂.

peterboncz commented 4 years ago

Right, maybe by combining a number of measures the DL volume of https://github.com/DP-3T/documents/issues/66 can still be reduced pragmatically.

The idea to use a cuckoo filter to represent the keyset will save a factor 3, I understand

My proposal would be to partition the keys into several files (every file being a cuckoo filter), which also aids content distribution in CDNs.

In order to restrict the amount of info to download, one could regionalise the country (coarse degree such as province), and give the users the option to select provinces from which to download diagnosis keys. Explaining that you would miss notifications if you'd travel to a province without selecting it in that screen. But that excluding provinces will save network bandwidth and battery. Because IP's being monitored by the malicious backend are already known to be in the same region, this partitioning leaks no additional information.
we could send in the BLE packet an unencrypted time-of-day ID (e.g. in hours). And upload the keys into a partition corresponding to the hour. The phones only download cuckoo filters of the relevant hours. This admittedly leaks some privacy in sparse situations, it could be partly counteracted by uploading some fake keys such that the hours distribution of the uploader becomes more uniform.

Together cuckoo filters (x3), regions (x15?) and time (x4?) could reduce download volume by two orders of magnitude. This may make it comparable again in download volume to a simple EN / DP3T implementation (which could also adopt the region opt, BTW).

Still, even without any partitioning, if you live in a remote village with 10 houses, and the IPs are traceable to that, no scheme whatsoever will protect you from a malicious backend. The app is better for urban use.

DP-3T / documents

Proposal for a reduction in transfered data of design 2 #218