Design2 Space Saving: Reduce amount of daily transfered data by using more than one cuckoo filter

dan-blank commented 4 years ago

Hi all,

I am Daniel Blank, currently studying at the Albert-Ludwigs-Universität Freiburg and thought about a way to reduce the overall amount of data that has to be downloaded. I propose a simple extension to how hashes of EphIDs are stored and distributed that does require minimal change to the status quo (same datastructure & serialization, but now one might download 2 or 3 cuckoo filters to get a reliable answer). I wrote a rough simulation in python to give some initial credence to the idea (the simulation is far from a formal analysis though). In my conservative (= does not allow falsification of false positives for security reasons) simulation, the amount of transfered data could be reduced by at least ~55% compared to the status quo (as understood by me).

Key idea

Instead of providing one cuckoo filter with a really low rate of false positives (from now on refered to as status quo filter): Provide one status quo filter that users can rely on and possibly more than one cucko filter smaller than the status quo filter which should prevent most users from having to download the big, reliable cuckoo filter.

The gain is that most users have to download less data than in the status quo, while a minority of users has to download slightly more data (since they have to download all filters, including the status quo filter.)

The intuition behind the proposal: Right now, a single data sctructure has to do at least three jobs: "Privacy", "Reliability" and "minimizing data transfer". We can do "minimizing data transfer" better by using one or more dedicated datastructures while not having to compromise on "Privacy" and "Reliability". (Edit: Regarding "Privacy": It is not known to me whether using multiple cuckoo filter might enable ways to derive the EphIDs used in the cuckoo filters' hashes.)

Possible attacks

1: Having access to multiple cuckoo filters generated from the same original set might make inference of the original set more likely. (One mitigation might include adding small amounts of junk data to the tables.)

2: The server might be able to tell which IPs are more likely to be infected than others. Why? The IPs that download the last, biggest filter are the ones that were not filtered out yet by a negative check, henve they are more likely to having had contact with an infected person.

Concrete Description of a configuration using 3 cuckoo filters

Let cuckoo filter cf_high be a cuckoo filter with high reliability, cf_low one with low reliability and cf_lowest one with the lowest reliability. (In this example I assume that most people won't encounter an infected EphID for simplicity sake; in my simulation below, I try to account for true positives.)

0) A user wants to check whether they had contact with an infected person.

1) The user downloads cf_lowest and check whether any of their numbers in contained in this filter. With a probability of at least ~70% the check will give a negative result that can be trusted - no further downloads need to be done in that case.

2) (only if check in 1. yielded positive): With a high probability, a false positive occured. Download cf_low and check again. With a probability of at least ~99% the check will give a negative result that can be trusted - no further downloads need to be done in that case.

3) (only if check in 2. yielded positive): With a high probability, a false positive occured. Download cf_high (corrsponding to the status quo cuckoo filter) and do a final check in order to be sure about the result.

Simulation + Results

I wrote a small python program to roughly simulate configurations like this, an unpolished, work-in-progress version can be found here: https://github.com/dan-blank/contact-tracing-storage-simulator

The numbers are not meant to be taken at face value, but I think they show the promise of the approach. main.py will by default simulate three configurations, assuming a population of 80 000 000, 40 000 new infections per day, an fpp of 2,67e-13 considered reliable. a) cf2.67e-13 | 79939405 104.9MB b) cf2.67e-13#cf2e-05 | 71784013 42.2MB 8209763 104.9MB c) cf2.67e-13#cf1e-06#cf1e-05 | 75752147 44.6MB 4221860 52.6MB 25972 104.9MB How to read: Before the ' |' is the configuration (how many cuckoo filter with what rate of false positives), after it are pairs that communicate how many users had to download how much data. In b), 71 784 013 users had to download 42.2MB and 8 209 763 users had to download an additional filter of 104.9MB.

a) is using a single cuckoo filter with an fpp = 2,67e-13 and is meant to represent the status quo. b) uses a cuckoo filter with fpp = 0.00002 and a status quo cuckoo filter in case a check was positive. Note how only 8 209 763 of 80 000 000 users need to download the big status quo cuckoo filter.

Last, let's have a closer look on configuration c) that resembles the configuration described in "Concrete Description of a configuration with 3 cuckoo filters": We use two cuckoo filters for providing cheap access to negative checks and a status quo cuckoo filter for reliable answers. Let us calculate the amount of transfered data in MB for configuration a), b) and c: used_a = 79939405 * 104.9 = 8385643584.5

used_b = 71784013 42.2 + 8209763 104.9 = 3890489487.3

used_c = 75752147 44.6 + 4221860 97.2 + 25972 * 202.1 = 3794159678.76

The ratio of used_a / used_c is 3794159678.76/8385643584.5 ~= 0.45. That means that in comparison with the status quo a) we save more than 55% of transfered data when using configuration c)! The tradeoff is that 25 972 users downloaded 202.1MB instead of 104.9MB (all other users had to download less compared to status quo).

Further Investigations

Privacy: Can multiple cuckoo filters be correlated to infer the data that was used to create their hashes? I am not knowledgable enough to answer this question. One potential solution might be to include some dummy hashes that mess with the infering attack.
I presented two hand-crafted configurations (b and c) in which two or more filters were being used. An (close-to) optimal configuration of filters might be found by using linear programming or genetic algorithms (seems a good fit for this kind of problem) to optimize how many filters with what rates of false positives should be used.
The simulation assumes design2. If we allow the leakage of at least some user data to the server, even more data saving should be possible, since false positives could be verified more cheaply. (Pointed out by @garnders). Simulating these solutions combined with the proposal might be useful.

Aknowledgments

@a8x9 made several helpful remarks and corrections.

a8x9 commented 4 years ago

From the numbers you provide, it seems you assume that only 1 EphID is compared against the filter. In practice, all EphID stored on the phone have to be compared against the filter.

If we assume:

retention period: 14 days
epoch: 15 min.
average contacts per epoch: 4
false positive rate: 0.3
X = probability that at least one EphID will match against the cuckoo filter

X = 1.0 - (1.0 - 0.3) ^ (14 24 4 * 4) ~= 100%

Now with a false positive rate of 0.001 for cf_low:

X = 1.0 - (1.0 - 0.001) ^ (14 24 4 * 4) ~= 99.5%

gardners commented 4 years ago

However, in the case where the false positives are detected, this should be a relatively small set, and there may be mechanism to efficiently download smaller amounts of data that allow checking whether those values are true or false positives. The concept has merit, it's just a matter of finding the approach that materialises the most benefit.

On Tue, 14 Apr 2020 at 10:18, a8x9 notifications@github.com wrote:

From the numbers you provide, it seems you assume that only 1 EphID is compared against the filter. In practice, all EphID stored on the phone have to be compared against the filter.

If we assume:

retention period: 14 days

epoch: 15 min.

average contacts per epoch: 4

false positive rate: 0.3

X = probability that at least one EphID will match against the cuckoo filter

X = 1.0 - (1.0 - 0.3) ^ (14 24 4 * 4) ~= 100%

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DP-3T/documents/issues/164#issuecomment-613166242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFCOT5V2NTKMX3NNIKYW2TRMOXGDANCNFSM4MHJ4VAA .

a8x9 commented 4 years ago

This subset approach will however leak some bits of information about your local EphIDs to the server. So then it really depends on what place has the server, and what is considered acceptable in your threat model.

gardners commented 4 years ago

Correct. Everything is a trade-off. Allowing devices on poor connections being able to do so is where it makes the most sense. Or the subsets can be based on geographic location (see #127 as another motivating reason to divide the data in these ways), so that very little of actual concern is leaked, only which country or region a given EthID was in during a particular time frame.

dan-blank commented 4 years ago

@a8x9 Thank you for your remarks! You are correct and I adjusted my simulation to take your correction into account (mainly by lowering the guaranteed rates of false positives of the cuckoo filters to account for multiple testing). We can still save more than 55% of data, but bloom filters are now out of the equation, since they pay too much of a price for having a sufficiently low false positive rate.

dan-blank commented 4 years ago

@gardners Yes, in the scenario you describe (= that we can falsify positive checks), the proposal could be more useful and able to save more data. Bloom filters might also be viable again these scenarios. (My current simulation of the proposal is very conservative and aimed at design2.)

dirkx commented 4 years ago

@dan-blank could you elaborate on 'now out of the equation, since they pay too much of a price for having a sufficiently low false positive' -- I am finding that we get a file that is 10-20% of the non-filter size (80% win) and only revealing 50 or so bits of the 256 bits if we assume a 5-10 million contaminated - and 1:100 or better false postives for the 1-5 infected persons someone may have met in the past 14 days out of those 5-10 million.

Running code that should show this at https://github.com/dirkx/DP-3T-Documents/tree/editable-version/impl/design-2-openssl-C

We found that with CF we can get packing higher than with Bloom; and above number is for the generic case - if you tune for a number you can shave of 1/3.

dan-blank commented 4 years ago

@dirkx I will try my best - it is not unlikely that I got some things wrong ;)

Short answer

Carefull: That statement is based on my experience when running the simulation and depends on the simulation being correct. My intuitive explanation: Cuckoo filters start to beat bloom filter at around 3% fpp. Testing thousands of entries with a fpp of 3% each check makes it higly likely that at least one of them is tested positively - prompting another download with a higher reliability, denying any potential win. If we don't want most of the population to have a positive check, we need to choose a fpp less than 3%. At this fpp, bloom filters can't compete. At higher fpp, they are useless.

Long answer

First, my assumptions (following the credo "rather make one explicit assumption to much than one to few"): 1) We have to use something like cuckoo filters or bloom filter so that the user cannot see the actual EphIDs of the infected persons (instead they only see the hash of the EphIDs). 2) The size of the entry in a cuckoo filter does not matter, and can be calculated by the python function (math.log(1 / fpp, 2) + 2) / 0.955, where fpp is the rate of false positives and 0.955 is the load factor. 3) Cuckoo filters are more compact than bloom filter at around fpp = 3%. 4) There is no more server interaction than downloading cuckoo filters. So it is not possible to query a small set of EphIDs that correspond to the positively checked entries. 5) As a consequence of 4): When a person gets a positive check for any entry(false or true), they will have to download another filter that is bigger than the previous one. I assume that this will happen at already one positive check. (One might argue that 4 positive entry checks do not warrant downloading a new filter. This might save much data, but I am ignorant about whether you can make such an assumption.) 6) So the win for a person is only true when the person gets a negative check for every entry on their phone. We assume there are 14 24 4 * 4 entries.

The we can do the calculation in the first answer that @a8x9 provided: When we do check 14 24 4 4 entries against the downloaded filter and expect an fpp of 1 / 1000, then our chance of at least one check giving a positive result is 1.0 - (1.0 - 0.001) ^ (14 24 4 4) ~= 99.5%. By assumption 5), that means that 99,5% of the populationed gained nothing - and by 6), 0,5% of the population gained a very cheap check that returned negative.

If we want to actually exclude big parts of the population, like I tried in my simulation, the fpp has to be way below 3% (or even 0.1%). For example, consider configuration b) where I excluded ~= 89% of the population with the first cuckoo filter - filter that has an fpp of 2e-05. That is way below an fpp of 0.03 where bloom filter could still have an advantage.

I am sure one could calculate this rigourously, but by playing around with my simulation, I noticed that bloom filter are not worth it: When they are too loose, they don't filter out enough people for their cost. When they start to filter out than, say, 1% of the population, they get beaten by cuckoo filters.

kennypaterson commented 4 years ago

@dan-blank If I understand your initial proposal correctly, it is quite similar to what was proposed in the CRLite system by Larisch et al. from Oakland 2017: https://obj.umiacs.umd.edu/papers_for_stories/crlite_oakland17.pdf It may be interesting for you to scan this paper and see how it relates to your proposal.

dirkx commented 4 years ago

That is a good answer -- but I am trying to relate it to my numbers and the whitepaer of DP3T. I am getting the right FP's (and if I tune manually good packing, if not good enough for production at 80% or better -- so file 20% of original size).

And we leak relatively little in terms of bits that I can image that that 'is it'.

And no - this was not my idea - this is trying to be the serialisation and implementation details of "Design 2" in the Whitepaper of DP3T (2020/4/14 version).

dirkx commented 4 years ago

PS - the design did remind me of that paper - and the Merkle hashes of the Certificate Observatory pushed by google et.al. So yes - fully agreed htere !

kennypaterson commented 4 years ago

@dirkx For the avoidance of doubt, due to time pressure, let me say that we don't plan to specify or implement anything beyond a single Cuckoo filter for now.

dirkx commented 4 years ago

Thanks ! And that - with the confirmation that the H(TRUNC(H(seed ... | i) can be used as is - I think we have enough specificity for a normative implementation profile.

As we can 'cheat' by keeping the depth, hash-length and verify length explicitly in the serialisation format. Thus making doing it really well a pure backend issue which does not require the updating of the readers/apps in the field (they just benefit from better packing, lower false positives, etc - but no code changes).

dirkx commented 4 years ago

Feel free to close this one if you feel that is right - then I mark it as 'done' design decision/confirmation on my end

kennypaterson commented 4 years ago

I think the discussion of space-saving ways of doing the CF functionality is useful, and further ideas might emerge here that could be useful in a later release of the system, so I am happy to keep the issue open for now.

dirkx commented 4 years ago

Ok - thanks. For what it is worth - I think enough people have now showed with code that you get < 20% and that the exposed bits are low enough - so from a practical perspective it clears all hurdles I (personally) think.

dan-blank commented 4 years ago

@kennypaterson At the first scim, it looks indeed related. Interesting, thank you! I understand that there is time pressure to get out an initial version now. Is there any particular aspect this approach that could be interesting later on that I could focus on right now or after some time? On my mind is for example: a) Trying to simulate a design1 szenario with a filter cascade 2) polishing up my simulation and making the assumptions more transparent.

Btw, I plan to treat the initial proposal as a living artefact to discuss "space-saving by using a filter-cascade", so I will keep rewriting the issue to make it clearer - is that okay?

@dirkx thank you for another paper to look at! I would love to contribute to your efforts in the initial release, but I fear I won't be able to catch up and make a valuable contribution in time before the inital release - therefore I want to focus on potential space-saving in later versions :)

dan-blank commented 4 years ago

@kennypaterson I diggested the paper enough to be able to tell how it relates to our situation. Here are my summary & thoughts:

Their situation is that there are hundreds of millions known valid certificates, and of them, tens of millions are corrupted/expired or otherwise invalid.
Because both the set of valid certificates and invalid certificates is finite and known, they can use a data structure called filter cascade. The key insight is that when all possible queries are known (only valid or invalid certificates get queried, it does not make sense to query the revoke status of a non-existent certificate), then we know for every loopup what we are dealing with in terms of false positives and true positives - if in doubt, we can check the list of valid and the list of invalid certificates. This insight enables them to construct a datastructure of multiple bloom filters that is compact and still does not answer false positives.
Their solution is unfortunately not applicable to design 2: We only have a list of infected numbers, the set of non-infected numbers is not known and hence the filter cascade cannot be constructed.
Modification: A filter cascade could be used if we collect all numbers, infected or not, in some place accessible to the server that creates the filter cascade. This would use much more data and more importantly will likely create new privacy problems. When one is willing to compromise on privacy (i.e. in situations with very low bandwith), filter cascade could again become an option to consider.

strubbi77 commented 4 years ago

I don't get the point why 80.000.000 is the base population. For europe the base population would be 300M. Why is the data not splitted in smaller regions? Divide and conquer. In Germany Bundesland (1M-17M), Regierungsbezirk(~1M), Landkreis (100k-1M) are smaller entities where the data could be splitted into. Sure you would give addtional data where you have probably been. And you have to store the information in which entities you have been during the last days. But doing this optional would decrease the amount of data for people using this feature by (1 -(population entity/total population)) (percentage people using geo reference) (avg number of entities person has been). People who don't use it, will always download the complete data, and upload data without any reference.

dan-blank commented 4 years ago

Thank you for your remarks! :)

I don't get the point why 80.000.000 is the base population. For europe the base population would be 300M.

80M is just some point to get started (maybe I should make this a bit clearer) . But you made me consider taking europa-specific numbers into acount to this idea is more relatable.

Why is the data not splitted in smaller regions? Divide and conquer. In Germany Bundesland (1M-17M), Regierungsbezirk(~1M), Landkreis (100k-1M) are smaller entities where the data could be splitted into.

Technically, because I wanted to present my approach in isolation of other approaches. I also wanted to present my idea using the most privacy preserving assumptions possible, and that is that the server does not even know the rough location data of a hash of an EphID.

But the approaches could absolutely be combined! The idea I present is a general idea to reduce data payload for a population that has to download cuckoo filters. That would also benefit a solution that splitts data into smaller regions.

dirkx commented 4 years ago

Right - but there are also countries that are more of a whole (including the border areas just around them) and generally act as one traveling region.

So IMHO for NL, BE, LU & Adjacent Rurhgebiet in Germany - those sort of numbers are useful to test/validate your designs against.

And for what it is worth - a single CF filter seems to be good enough given bandwidth, infection rates and days needed.

dan-blank commented 4 years ago

So IMHO for NL, BE, LU & Adjacent Rurhgebiet in Germany - those sort of numbers are useful to test/validate your designs against.

Yes, I agree. The conversation with you and @strubbi77 made me decide to use EU-numbers in a future version of this proposal, since that is likely the core audience at the moment :)

And for what it is worth - a single CF filter seems to be good enough given bandwidth, infection rates and days needed.

We should certainly focus on getting a secure, working version out fast. That is priority 1. But I think that it makes sense to continue to also look for space reductions later on, even more so if it is an unintrusive extension to the protocol (e.g. if it does use the same serialization process). Esp. considering that downloading a new filter happens daily on every smartphone and that this protocoll might be used in countries with sub-optimal infrastructure. (However, I think that the basic question of "How much effort should we invest in optimizing space storage?" should be discussed in another topic, I would like to keep this issue about methods that use more than one filter and related approaches.)

Next up in my pipeline: 1) Fine-tuning the filter configuration. 2) Exploring whether two or more hashtables can be correlated to derive the original data.

dirkx commented 4 years ago

@dan-blank the CF version I put at https://github.com/dirkx/DP-3T-Documents/tree/editable-version/impl/design-2-openssl-C and the serialisation are done in such a way that it can be used for a single CF 'now'.

And from my numbers (just run above to see/tweak) I get the manageable sizes of the daily-downloads and not too many bits revealed.

I tried to do the serialisation as described in https://github.com/dirkx/DP-3T-Documents/blob/implementation-profile-start/implementation-profiles/profile.md in such a way that later versions can do better at the CF part - without instantly needing to change deployed apps.

This is as it will become increasingly harder to re-deploy apps at the current roll out schemes.

If you could confirm that this passes munster for your cases - that would be goodness.

As that would give us enough cross-border interoperable details for the countries going into production next week (and would only require redeploy in, AFAIK, the 2 countries that are now in production).

dan-blank commented 4 years ago

@dirkx First, kudos on doing the implementation! Your file sizes look both reasonable and it is great that they seem actually feasible. Yes, I can try to help. I will likely look into it again on Monday. But I fear I did not exactly understand how I can help you. I see these options how I could help you:

Assuming you want to me to help / check with your implementation:

1) Adapt my simulation so that they simulate a cross-border scenario as realistically as possible. To be on the safe side, we could just assume that the app is deployed in all of EU. So 446 Million habbitants and ~25-30k new infections per day. Result: I do some more research, change my simulation and change the text of the issue to include the new numbers.

2) I could try to match your numbers and see if there is a difference in my scenario "1 cuckoo filter only" and the number that you get. Result: I update the text of the issue and maybe find a wrong assumption either in my idea or in the implementation you linked.

Assuming you think that my idea can be implemented and should be implemented fast:

3) Implement my idea and see if it is feasible. Result: An actually running version of the idea, hopefully it would show a reduction in data transfer.

4) Something else: If I can help in another way, go ahead and tell me :)

Regarding 3): Before actually using this idea in apps that are being deployed, I would like to have more opinios regarding the privacy aspect of this. Using three different cuckoo filters that are derived from the same data set might be problematic. (One solution might be to only use the original data once, and then generate the 2nd cuckoo filter from the 1st and so on. I have to think this through.)

dirkx commented 4 years ago

I think, given that we have implementations underway, that '1 then 2' is most valuable.

Regardless of what language you implement it in (and you'll find that by tweaking how/when you change the hash-length v.s. the table size -- you can get much better performance. I am not so concerned by this - as this calculation is only done once a day in a central backend - so very easy to change later (as opposed to having to update 10's of millions of mobile phones).

And test/example cases is of course always very useful.

dan-blank commented 4 years ago

Allright, then 1 and 2 is the direction of work I will continue for now :) (Edit: Ended up working doing public communication stuff. I think for now, this optimization and further simulation work has a low priority.)

dan-blank commented 4 years ago

However, in the case where the false positives are detected, this should be a relatively small set, and there may be mechanism to efficiently download smaller amounts of data that allow checking whether those values are true or false positives.

@gardners after working through the paper https://obj.umiacs.umd.edu/papers_for_stories/crlite_oakland17.pdf, I am now convinced that there is unfortunately no way to provide a pre-processed set of false positives to check against when no data should be leaking from the user to the server. This is because there is an near infinite amount of false positives and we don't know which the users will encounter. (We could, but that would require that the server know all observed IDs) The idea is still a considerable reduction in transfered data size, but this path of additional optimization is closed I fear.

In the case where it would be okay for the server to know the IDs encountered by the user, what you suggested might be an viable option. The set of false positives will be small and could be enriched with lots of fake IDs and then be checked by the server. (My approach presented in this issue would make less sense then - a single filter would likely suffice, given the expected-to-be low number of false positives).

dan-blank commented 4 years ago

A small update: The proposed method should save more space than I thought since I did not factor in the following: 1) When downloading two or more cuckoo filters, the user can compare which IDs matched positive between the tables. When one ID is included in one table but not in the other, then it was a false positive. 2) When checking against the second filter, the user does not need to check all observed IDs, but only the ones that matched the first table. This should produce way fewer false positives. 3) The expected false positivity rate is multiplicative: If two tables t1 and t2 were downloaded and a ID id1 appeared in both, then the likelyhood of id1 being a false positive is t1*t2. That means that the last users in the chain don't need to download a high-precision filter anymore.

@lbarman @kennypaterson At this current point, would it make sense to pursue this direction further? I could update my proposal with this new insight and the data of the current whitepaper. I expect maybe a 10-15% higher increase in saved data and a minimal overhead for the least lucky users of around 5% (down from something like > 100% overhead). This approach would benefit both the hybrid and the unlinkable approach. However this costs time and if it is not helpful, I would like to spend time on other things for now.

Last, I want to thank you for the great work you all achieved!

DP-3T / documents