Performance considerations of the list distribution

Vanuan commented 4 years ago

Is there any design document outlining the different approaches considered wrt performance of the list distribution? E.g.:

pull vs push notifications
pagination, filtering
incremental vs full update

Any estimations on the potential traffic requirements?

The original comic suggest some "cuckoo filter" can be used. Any comments on that?

jeffallen commented 4 years ago

The current implementation produces an ETag based on the pk_exposed_id of the last item in the exposed list. This assumes that the database always returns the rows in the same order. The SQL includes "order by pk_exposed_id".

This is a good strategy to make it cacheable, because the ETag will change during the day as more exposed id's are added, then stop changing at midnight.

However, when reviewing the code regarding cacheability, I did notice that "select from" the database (dataService.getSortedExposedForDay) happens before the Etag check, so there is no reduced database load from successful ETag-based caching.

    @RequestMapping(value = "/exposed/{dayDateStr}")
    public @ResponseBody ResponseEntity<ExposedOverview> getExposed(@PathVariable
     @Documentation(description = "The date for which we want to get the SecretKey.", example = "2019-01-31") 
        String dayDateStr, WebRequest request) {
        DateTime dayDate = DAY_DATE_FORMATTER.parseDateTime(dayDateStr);
        List<Exposee> exposeeList = dataService.getSortedExposedForDay(dayDate);
        int max = exposeeList.isEmpty() ? 0 : exposeeList.get(exposeeList.size() - 1).getId();
        ExposedOverview overview = new ExposedOverview(exposeeList);
        String etag = etagGenerator.getEtag(max);
        if (request.checkNotModified(etag)) {
            return ResponseEntity.status(HttpStatus.NOT_MODIFIED).build();
        } else {
            return ResponseEntity.ok().cacheControl(CacheControl.maxAge(Duration.ofMinutes(exposedListCacheContol))).body(overview);
        }
    }

In the above code, the Etag should be calculated in a lightweight way ("select max(pk_exposed_id)", or cached in the server perhaps) and then the expensive call to getSorterExposeForDay should be inside the else.

jeffallen commented 4 years ago

I was also surprised to see that the only format for the exposed list is JSON, which dramatically increases the size of this essentially binary data (the majority of the returned data are the secret keys, which must be encoded in hex, and are not good candidates for gzip).

@ubamrein, how would you feel about a PR that arranged for the output to be MsgPack encoded when there's an "accept: application/binary" header in the request?

(And then corresponding later PRs on the SDKs to pull via MsgPack as well.)

ubamrein commented 4 years ago

@jeffallen you are totally right. We missed the one with the ETag. This will be fixed soon.

Regarding the data format:

We are currently trying to define a good way of signing the response. Hence, we will anyways change from json to something which has a defined standard for normalisation (resp. defines a canonical way of signing e.g. COSE). So I guess the PR, although very much appreciated, could not be used.

If you have any suggestions on nice and simple signing of the key list we could further discuss that, but I would open another PR specifically to enable signing.

@Vanuan We considered sending push notifications, but did not include it yet in the sdk, since we were feeling that this should be part of an app-specific backend (since everyone needs their own push server with secrets and so on).

The load on the backend though should not be considered to be too high. As in our image shown, we plan to put a CDN in front of the web service. Together with the Caching Headers (at the moment defaults to 5 minute) and the ETag, we think that we can offload most of the traffic to the CDN.

Since we are separating the data into lists per day, and since previous days, at least in the current design, won't change, we always provide a full update/list if something changed, but the client only ever has to reload the list of the current day. Ideally the payload should not be considered too large especially if a more compact representation of the payload is used.

Regarding the cuckoo filters: We are currently working on multiple fronts, and our first iteration aimed at providing the low effort approach without such filters. As soon as this implementation is further improved, we will eventually include the cuckoo filters.

jeffallen commented 4 years ago

In https://github.com/dedis/cothority (a framework for experimenting with decentralisation) we use protobuf to marshal in a canonical way before signing. Protobuf has proven stable and a good fit for this use, (except maps, which are unordered, and thus cause signature verification failures if not handed carefully).

ralfr commented 4 years ago

Have you guys ever contemplated the idea to use the BitTorrent protocol and work with signed blobs?

Vanuan commented 4 years ago

It would be nice to see some real world numbers to see whether we have any issue.

Global new daily cases is currently close to 100k. Let's say dp3t is deployed in EU country, where new cases is 3000 per day. Let's consider a scenario where tracing is mandatory. We only store 14 days of historic cases, so it would be 42k total. One Exposee size is probably 128 bytes at most. So initial download size is 5 MiB. Daily download size is 350 KiB.

Amount of users would potentially be tens of millions. Let's say an average download size takes about 5 seconds and download times are distributed. It seems that the load on the server would be 500-3000 requests per second. But maybe it's much faster after initial download and takes only 1 second. Still, a minimum load is 100 requests per second.

But this estimation doesn't consider a peak infection rate which would probably be much higher. And we're not even talking about the cross-border distribution.

To put it into perspective, per second twitter has 6000 tweets, 3000 images are downloaded.

ubamrein commented 4 years ago

The question is, on which server we have this load. The CDN should consider the caching headers and ETags. Hence, ideally (currently the expiry headers are set to 5minutes) the CDN would only make a couple of requests every 5 minutes to the actual web service.

Vanuan commented 4 years ago

@ubamrein

Since we are separating the data into lists per day, and since previous days, at least in the current design, won't change, we always provide a full update/list if something changed, but the client only ever has to reload the list of the current day

What's the date in this case? Is this a day of upload or the date of the initial secret key?

Vanuan commented 4 years ago

Another question is how often the list of the current day is updated and distributed?

ralfr commented 4 years ago

Another question is how often the list of the current day is updated and distributed?

In discussion I've had elsewhere the consensus was, that once a day would be the minimum based on mathematical models. I don't have a source for that at hand right now.

thomasmueller commented 4 years ago

cuckoo filters As soon as this implementation is further improved, we will eventually include the cuckoo filters.

I think it's a very good plan to wait with the approximate membership filter. It is an optimization that may actually not be necessary. Once you plan to implement it, please let me know, I would be glad to help. I'm the co-author of the paper Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters and co-author of various filter implementations. I guess you could use a Bloom filter, cuckoo filter, or xor filter. The xor+ filter would be about 10% smaller than the cuckoo filter. It is immutable, but for your use case that would be fine (you can re-generate it once a day or so). While the cuckoo filter is mutable, there is a risk of modification failure, and the code is more complex.

dirkx commented 4 years ago

@thomasmueller I am not sure if for country sizes (i.e. many millions of citizien, a few percent infected, etc) - Xor is already smaller - I cannot get it to work with NL and DE level numbers. Cuckoo seems to be a reasonable first cut -and- can use the raw hash/key -and- is quick on the check side.

thomasmueller commented 4 years ago

@dirkx I would love to discuss the details about Bloom / cuckoo / xor filter, maybe there is a better place than here?

Bloom filter

Allows to add (but not remove) entries (let's ignore counting Bloom filters).
Simple to implement.
Resilient: even if the load is larger than initially planned, it will degrade gracefully, meaning it will just get a slightly worse false positive rate, but won't fail.
You could compress them to transfer (they can compress better if k is small).

Cuckoo filter

Theoretically allows to add and remove entries. But there are limits on that: you need to ensure you use a very good hash function, and even then sometimes you might be unlucky, and either insert is slow, or you can not correctly remove entries. So you need to have a fallback strategy in case you need to rebuild them.
Not resilient against higher load, unlike Bloom filters. I wouldn't use load > 90%. If you want to save space, you need to rebuild them anyway, so the advantage of mutability is gone.
Quite hard to implement, specially the edge cases.
Fast lookup for some fingerprint sizes, but quite slow for others.
Sometimes a bit smaller than Bloom filters, but not always. It depends on the configuration.

Xor filters

Immutable, meaning you need to know all the keys when constructing them.
Medium complexity to implement (construction is a bit hard, lookup is trivial).
Always smaller size than Bloom and cuckoo filters. Specially the xor+ variant.
Very fast lookups for 8/16/32 bit fingerprints.

I don't think you care about speed of lookups (at all), or ability to remove entries. I would initially go for Bloom filters: those are quite easy to implement, and can get you 95% there. If you still need 10% - 20% smaller size, then I would see if specially configured Bloom filters (smaller k, and compress to transfer), or xor filters can help. I would keep things simple first.

For large sets, I think you anyway need to partition the data, for example in regions (e.g. "Bundesland", for Germany). Of course this is a slight privacy concern, but I doubt it's an issue for most people. Data of infected people near the border, or who travelled, need to be added to multiple regions (or countries). I would just go for Bloom filter, and make regions as small as needed for this to work well.

Different topic: What I consider (and can't find in the docs) is to store "meeting hashes" instead of EphIDs. Meeting hashes would be the hash of the EphIDs of two parties meeting each other. That way, you can prevents attacks by silent parties that just listen for EphIDs (without sending). Sure, it results in more data... Also, I wouldn't switch EphID on a fixed schedule, but instead use a random schedule, to improve privacy. If you just switch EphID once a day on a fixed schedule, it's more easy to track someone based on repetitions on travel patterns.

ghost commented 4 years ago

Aye - and to add - Bloom inability to delete is irrelevant in this case - the backend builds the filter every 24hours (or a set of them for efficient download of people 1, 2, 4, 8 day s behind).

Cuckoo - I would add - harder to reconstruct keys. I persoanlly found them easy to implement but harder to optimise. But easier to get under 20% than Bloom. And smaller for sizes that we are talking about for NL and DE as countries.

Xor - immutable not a concern in this model. Complexity to construct also not. My gut feel is based on simulations that I always go over 16 bits - so Xor jumps to 32 bits.

I am not too concerned about partitioning; we can do a BF/CF/Xor and then on match - fetch a 1/256 of the full list -with- any extra metadata.

On the different topic - the crux behind not sending those but the proof you could generate them is that you cannnot introduce smearing easily (e.g. cause a whole church, noisy bar or political group pain --even-- if you are the govt/state-actor).

But yes - the apple/google and PACT versions are indeed in that realm.

ghost commented 4 years ago

Do you have a Xor implementation I can compare with mine ? Just to see if I can get better numbers with jours than the CF I have at https://github.com/dirkx/DP-3T-Documents/tree/editable-version/impl/design-2-openssl-C

thomasmueller commented 4 years ago

@dirkx-reno-2020 sure the easiest is probably https://github.com/FastFilter/xor_singleheader . It only supports 8 and 16 bit fingerprints. Keys are 64 bit, and need to be unique. I wonder if that's the case? (By the way you can ignore fuse filters, they are work in progress.)

A different implementation is https://github.com/FastFilter/fastfilter_cpp - lots of filter implementations there, including cuckoo filters and many Bloom filter variants. This is the research project.

ghost commented 4 years ago

So I used the latter - and had CF come out best for NL and DE population/etc sizes (14 day assumption). I'll check 16 bits - but suspect it is not enough (because if more than 5% has to fetch extra - you lose). Yes - keys are unique - they are lovely hashes going in -no crypto need to further hash (we had a discussion on DP3T in the whtiepaer issues on that).

wouterl commented 4 years ago

Great discussion! Just a small remark: for the low-cost DP3T design (which is what the sdk currently implements), we cannot compress the list as we need the raw keys to roll them forward.

Of course, if you go with the unlinkable design, you can encode all the EphIDs in a filter. We suggested a cuckoo filter, but other filters would work as well. Now, key constraint here is that you want you false positive rate to be very low. I would argue for now against using a much higher FPR and then doing follow-up lookups. These lookups leak information to the server. And it is non-trivial to analyze this leakage.

The same holds for prefix techniques where you reveal some bits of your query to optimize download cost. These are standard techniques, but they all leak private information.

thomasmueller commented 4 years ago

@dirkx-reno-2020 thanks! I will have a look at the cuckoo filter implementation in the next days.

we cannot compress the list as we need the raw keys to roll them forward.

@wouterl well you can still compress the list a bit, if you can sort the values (which I assume you can). For example using a Golomb-compressed sequence (GCS). One implementation is here. GCS is usually used like a Bloom filter, but you can keep all the bits. GCS was used e.g. by Google for Chrome to protects users from malicious sites (not sure if it's used still, I think not).

other filters would work as well

Yes. I would probably try with a Bloom filter first, as the implementation is simpler, and allows you to configure very well (exact size, and k). If you only need compression during transfer (which I think you need), then using a low k and e.g. arithmetic coding will work very well.

I fully agree with the low false positive rate, and information leaks. You want to download the whole filter, similar to how we did the password filter. You don't want server lookups. You will need to reveal some info: the region where you live / that you visited. But that's hard to avoid; you anyway have location info via IP address / cell phone location.

ghost commented 4 years ago

@thomasmueller we tried this sorting - but at NL and DE numbers - no meaningful compression as far as we can see. Perhaps there is at pan european levels - but that is too large (and I doubt it - as even if 90% of europe is sick the key are too far apart - the space is just large compared to the world headcount).

@wouterl for useful numbers (again DE and NL population size) it seems that cuckoo with a cycle rate of > 6hours and < 14 days (so about 50 times the keys worst case at the end of a wave) versus just the keys is still not a bad tradeoff. It appears that in the worst case they are about equal but in the run up / down of a wave - much better & quicker to compare & less bits revealed.

@wouterl - false positives are very cheap to resolve - as in a lot of schemes (TCN, PACT) and possibly for any realisitic DP3T scheme additonal metadata may be fetched anyway. So < 1% of the phones needs to fetch an additional < 256th or < 1024th payload package (if you want to hide their match) or just those few groups of 10-30 bytes.

Keep in mind that for the app to work and be relatively useful to society relatively few people need to be sick. It sort of looses its benefit if it tells 20% of the people on day 1, 2 and 3 to stay home.

thomasmueller commented 4 years ago

Additional metadata: To improve privacy, I would try to avoid additional backend fetches. Can you tell me what this metadata is, and how much is needed? Metadata could be integrated into the cuckoo filter, so that no additional backend fetch is needed. To do that, one entry in the cuckoo filter would contain both the fingerprint and the metadata (as payload). This is the retrieval problem. For a lookup, the cuckoo filter would return all entries (up to 8 entries for a standard cuckoo filter), and the client would need to compare the fingerprints of those entries with the fingerprint, and then extract the metadata from the matching record (if there is one). With an xor filter, only one entry is returned. There are other ways to store the data: instead of an approximate membership filter, a minimal perfect hash function (MPHF) of the keys is generated. This requires about 1.8 bits per key. Then, an array or list of fingerprints is needed, and an array or list of metadata. For a lookup, you need to calculate the array / list index using the MPHF. Such a data structure even would allow for fingerprints and metadata to have variable size. I would love to explain this better. It would need a bit less space than a cuckoo filter.

Compressing via sorting: you are right, if keys are too far apart, then the compression rate is very low, so it doesn't make sense. I would still sort the list, to use binary search, or even interpolation sort for O(log log n) lookups.

Sure, tracing is only effective if few people are sick. But the amount of data to be stored depends on multiple factors (number of sick people for the region, amount of data per sick person). Generally better privacy requires more data per sick person.

ghost commented 4 years ago

So the papers of PACT, TCN and the Austria approach show a few 10's of bytes of extra data. If you assume cross border; that may double to 30-60 bytes. So it can be put in the CF (or other filter) -simply increasing the size by Nseeds x 10-60 bytes. Which is considerable.

Given the numbers involved (extremely few of the clients in the field need this extra data - very few are found to be infected each day (if not - the R is too high and the app is useless in days)); I could easily see a followup be a fetch for 1/256's (e.g. on the first byte of the found key) of the entire metadata list. As the < 1% that is going to have to do a followup fetch will typically just need to do one. And you then only disclose the first 8 bits of their match.

Or just take the hit and have 1% of your population fetch the full metadata list each day in countries like NL and DE; but be gentle in, say, Spain, given local infra constraints.

thomasmueller commented 4 years ago

@dirkx-reno-2020 thanks a lot! Yes 10-60 bytes of metadata is quite a lot... Yes, the data needs to be partitioned in some way, and some information is revealed by that.

One way is by region (e.g. provinces); in my view that would make sense: clients would sign up multiple provinces, matching where they were at that time. You reveal which regions you were. Another partitioning schema is by a few bits of some hash. But I don't see how this is possible here... Not sure it would be better than partitioning by region. Personally, I would prefer partitioning by region.

If you only download metadata in case of a match, then there is a risk you reveal you (probably) came in contact with an infected person. Which may or may not be OK. Sure, you could somewhat hide that fact if you also download the full metadata by some probability, even if you don't need it. But personally, I would try to avoid this approach; it just feels wrong to me.

DP-3T / dp3t-sdk-backend