ahupowerdns / covid-backend

Backend for privacy preserving corona contact tracing
25 stars 0 forks source link

Some initial calculations #10

Open spycrowsoft opened 4 years ago

spycrowsoft commented 4 years ago

We need to distribute daily tracking keys (dtks) of all infected users.

If we assume that 500k persons will get infected simultaneously (Which is 1/34 for The Netherlands), we will receive about (46 20 500 * 10 ^ 3) / (1024 ^ 2) = about 438 MebiBytes of keys for their entire disease period.

If we assume that (1/46-th) of that amount of data will come in to update the new dtks. That still means we will have an influx of (20 500 10^3) / (1024^2) = about 10 MebiBytes of new key material that we have to distribute each day.

These numbers are small enough to handle with a simple combination of nginx / apache, PHP and MySQL on the write-side of the backend.

On the read-side of the backend we will have to serve these 10 new MebiBytes each day to all 17M smartphones in The Netherlands.

This means that we have an output of ((20 500 10^3) (17 10^6)) / (1024^4) = about 154 TebiBytes every 24 hours! Which roughly translates to an output of ((20 500 10^3) (1710^6)) / (3600 * 24) = About 2 GibiBytes per second!

This is ((20 500 10^3) (17 10^6) 8) / (3600 24 * 1000^3) = about 16 gigabits/second.

If use the current situation as a benchmark (this is a bad idea, because we should expect to have underreporting) and assume that only 20k infections will be active at the same time, the numbers look like this:

Daily new dtks: (20 20 10^3) / (1024^2) = about 390 KibiBytes. Daily data output: ((20 20 10^3) (17 10^6)) / (1024^4) = About 6 TebiBytes. Average output rate: ((20 20 10^3) (17 10^6)) / (3600 24 1024^2) = about 75 MebiBytes per second

Or in megabits/second: ((20 20 10^3) (17 10^6) 8) / (3600 24 * 1000^2) = about 630 megabits/second.

So this is doable, but I think we need about 12 servers, each with its own 1 gigabit/second connection for the read side.

bwbroersma commented 4 years ago

It's way more efficient to partition by the day number. Since the amount of day numbers are limited, e.g. 14, you can provide 14 append only logs, and you can calculate with sending 16 bytes again (plus a fixed overhead for fetching 14 files).

I would advise a PostgreSQL + PostgREST solution so you get a free API if you define a schema + views.

spycrowsoft commented 4 years ago

It's way more efficient to partition by the day number. Since the amount of day numbers are limited, e.g. 14, you can provide 14 append only logs, and you can calculate with sending 16 bytes again (plus a fixed overhead for fetching 14 files).

I would advise a PostgreSQL + PostgREST solution so you get a free API if you define a schema + views.

No you can't because the amount of day-numbers is variable and not limited to a 14 day window. See: https://github.com/ahupowerdns/covid-backend/issues/7#issuecomment-613015708

Also: This is off-topic here, because it is addressed in https://github.com/ahupowerdns/covid-backend/issues/5

Also number 2: PosgREST is not a good idea, because the volume of READ-operations is enormous (greater than (new DayNumbers) 24 17 million per day), so we probably can't allow every read to hit the database.

Also number 3: We need to consider simplicity, because this code probably has to be audited in a very short amount of time. If we include PostgREST, we might become unauditable because an entire audit of the PostgREST-version we use, might have to be done.

bwbroersma commented 4 years ago

I use 14 as X, I was not my purpose to repeat that discussion. The point about partitioning stays the same. It's a easy optimization that will shave 20% of data. BTW four bytes are overkill anyway.

I agree to export on a hourly rate to a static file (e.g. to a CDN). PostgreSQL and PostgREST could even handle every READ-operation if you would want this (which is probably stupid, because an T interval static export/append), by replicating the database to some read-only database instances, it's easy to horizontal scale this.

spycrowsoft commented 4 years ago

I use 14 as X, I was not my purpose to repeat that discussion. The point about partitioning stays the same.

Noted.

It's a easy optimization that will shave 20% of data. BTW four bytes are overkill anyway.

Are you referring to the 4 bytes of the DayNumber? Those are unit32 values dictated by the Apple/Google standard.

I'm not very keen on diverging from that, but we can shave those 4 bytes off with the file-naming scheme that is also described in: https://github.com/ahupowerdns/covid-backend/issues/5

I agree to export on a hourly rate to a static file (e.g. to a CDN). PostgrSQL and PostgREST could even handle every READ-operation if you would want this (which is probably stupid, because an T interval static export/append), by replicating the database to some read-only database instances, it's easy to horizontal scale this.

I agree that It might be useful (especially for the write-side of the back end) so your input is certainly not bad.

The problem is that I don't know if the government has an audited and approved haskell-compiler that can compile PostgREST. And if such a compiler is available, I don't know whether or not PostgREST would pass such an audit.

Those are a few dangerous known-unknowns that could wrack this project.

I'd rather eliminate those by using delivering a backend-application for which an employee of the AIVD (or some other auditor) would only need a couple of hours to understand and sign off on.

Every additional line of code, product or library we add, slows this process down while increasing our potential attack-surface at the same time.

In this case, extreme simplicity is best.

almer-t commented 4 years ago

With the client distribution you are assuming that we need to distribute the updates from a set of servers to a set of operational (mobile) clients. However, there are other ways of distribution besides that approach. Firstly, updates could be written as files to a (for the clients read-only) bucket that is replicated with a CDN. This may be a less expensive approach. Secondly, another alternative is to have client devices sync among themselves in a peer-to-peer fashion, using either some structured approach for message passing or a more slowly converging gossip network (or anything in between). The point here is: we are not restricted really to a strict thick server / thin client architecture here.

spycrowsoft commented 4 years ago

With the client distribution you are assuming that we need to distribute the updates from a set of servers to a set of operational (mobile) clients. However, there are other ways of distribution besides that approach. Firstly, updates could be written as files to a (for the clients read-only) bucket that is replicated with a CDN. This may be a less expensive approach. Secondly, another alternative is to have client devices sync among themselves in a peer-to-peer fashion, using either some structured approach for message passing or a more slowly converging gossip network (or anything in between). The point here is: we are not restricted really to a strict thick server / thin client architecture here.

Good point! I think you can find an anwer for your question in what has been discussed here: https://github.com/ahupowerdns/covid-backend/issues/4

dirkx commented 4 years ago

I'd have a good look at the CF filter in Design2 of DP3T. That saves an order (or two) in size (with better privacy as a side effect).

Secondly I'd use a pyramid download scheme for this; so you can use stupid CDN servers with just 'get if modified'.

So generate day 1, day 2+3 (combined), day 4,5,6,7 combined and day 8 tot 16. For a Daily key (I'd argue for Design 1 and certain Design 2 of DP3T that you can change keys every 15 mins with no real drawbacks and great privacy benefits).

That way a client does not have do download what it does not yet have. And there is not much of a penalty (with get-if-modified) to fetch very regularly. As long as you honour the -has not changed of the CDNs.

Which CDNs are good at.

spycrowsoft commented 4 years ago

I'd have a good look at the CF filter in Design2 of DP3T. That saves an order (or two) in size (with better privacy as a side effect).

Secondly I'd use a pyramid download scheme for this; so you can use stupid CDN servers with just 'get if modified'.

So generate day 1, day 2+3 (combined), day 4,5,6,7 combined and day 8 tot 16. For a Daily key (I'd argue for Design 1 and certain Design 2 of DP3T that you can change keys every 15 mins with no real drawbacks and great privacy benefits).

That way a client does not have do download what it does not yet have. And there is not much of a penalty (with get-if-modified) to fetch very regularly. As long as you honour the -has not changed of the CDNs.

Which CDNs are good at.

Can you link to the relevant documents and elaborate further?

dirkx commented 4 years ago

@spycrowsoft so the main document is the design document (Design 2) https://github.com/DP-3T/documents/blob/master/DP3T%20White%20Paper.pdf from page 15 onwards/

There is a description of a serialisation in https://github.com/dirkx/DP-3T-Documents/blob/implementation-profile-start/implementation-profiles/profile.md and my code is at https://github.com/dirkx/DP-3T-Documents/tree/editable-version/impl/design-2-openssl-C, the python version is at and the rust implementation is at:

dirkx commented 4 years ago

It boils down to roughly 1/3 of the bits you'd normally would have to distribute (the theoretical limit is much lower; and if you tweak the params you can get it (server side driven) down quite far) for NL/BE/LU plus a bit of DE.

spycrowsoft commented 4 years ago

Thanks! I'll look into it.

Meanwhile: Please think about how we can sell the added complexity?

We have to convince regular users and medical specialists and technologists as well. I'm not too worried about technologists, but too much mathematical and/or technological "stuff" might scare the other groups off.

Note: This by no means a rejection of the idea, but it's the next hurdle we have to consider.

dirkx commented 4 years ago

Fraction of the bandwidth, less data leakage, less risk of data being abused for smearing / quelling movements of large groups.

And I do not see this as much harder to sell as a particular form of compression.

Everyone knows what 'zip' does. This is just a better zip for this type of sensitive data.