Cuckoo filter is probably overkill and overly complicated

snakehand commented 4 years ago

I have been looking at various implementations of the cuckoo filter, and though it is a very performant data structure for realtime caching of data, it is not what it is actually used for in Design 2. Rather it is used as a data exchange format to transfer a condensed list of hashes from the server to the client. It is conjectured that the the compactness for run time lookup automatically translates to a compact (high entropy) format for sending the data to the client. But the fact that there will necessarily be a good number of empty bins in the hash table is proof that the compactness is not optimal.

The most compact representation possible would be just a simple list of hashes. The cuckoo filter will give a faster lookup time O(1), but if the list is sorted, lookup is O(log n) with binary search.

Moreover the paper does not specify that hashing function to be used, which is strange since the hashing function can easily become more expensive than the lookup itself. What I would suggest is to do a simple xor of all 32 / 64 bits words, and use that as a hash. This should be sufficient since the input has high entropy from cryptographic mixing. Some leading bits can be stripped to meet the desired rate of false positives. This together with a sorted list will make lookup more than fast enough IMHO. It will also be much simpler and faster to implement.

EDIT( 64 bits might be a more appropriate size for the xor hashing )

burdges commented 4 years ago

I only mentioned cuckoo filters because folks were discussing bloom filters, but cuckoo filters are always more efficient with lowish false positive rates.

Actually modern cuckoo filter designs achieve like 98% fullness efficiently, but yes they get complex especially since the size stops being a power of two.

Yes, a sorted list achieves 100% fullness and is much simpler. Also, the sorted list's O(log n) penalty can be amortized across the client's list. After amortization, the cuckoo filter's 2% bandwidth penalty could well outweigh the sorted lists computational penalty.

snakehand commented 4 years ago

@burdges " the sorted list's O(log n) penalty can be amortized across the client's list" : Are you thinking that if the client keeps a sorted list(s) with observed ID hashes, then lookup for all stored IDs effectively becomes O(1) ? ( Sorted list comparison )

dirkx commented 4 years ago

@burdges So testing this - using https://github.com/dirkx/DP-3T-Documents/blob/implementation-profile-start/implementation-profiles/profile.md for the params that are not specified in the White paper - and setting it to 1:100 false positives when considering contacts list of a few 1000's, and a few million contamination people in a 14 day window with a few hits at most -- I get it roughtly as good as https://github.com/dirkx/DP-3T-Documents/tree/editable-version/impl/design-2-openssl-C -- with gives us a 58 bits of hash reveal needed & a factor 2x larger than the approach described by @snakehand. So i guess this needs a bit of thought - or some explanation in the white paper as to what we're missing here.

burdges commented 4 years ago

Yes exactly @snakehand the "depth first" comparison algorithm gives O(m+n) I think.

You achieve 80% fullness even with early cuckoo filter designs, so someone implemented or tuned that cuckoo filter wrong @dirkx which happens lots btw. Also, if you do estimates then please use only modern cuckoo table designs, like https://arxiv.org/abs/1707.06855 and https://theory.stanford.edu/~rinap/papers/erichash.pdf which give 96.5% fullness with sliding buckets of size two, although larger sliding buckets gives higher.

Also, these cuckoo filters encode like log_2 num_entries - 1 bits into the position, which impacts your false positive rate and hence bitsize, although you do not compute it that way. If you do the cuckoo filter correctly vs a naive sorted list, then the cuckoo filter takes less space because the naive sorted list wastes prefix space, but the fair comparison is a sorted list with some prefix encoding, at which point the sorted list should dominate by roughly 2% or whatever.

I agree the prefix encoded sorted list sounds best because its "tricky" components like this prefix encoding and double sourced list comparison are orthogonal, assuming an expansion phase, and individually remain simpler than a modern cuckoo filter.

dirkx commented 4 years ago

@burdges as to the tuning - we've ran through a range of scenario's -- trying to keep the occupancy very high (well above 92% & allowing for very high number of chain evictions - as we reasoned that build times could be 'endless' as they would be done once, on the backend with lots of CPU and memory). We tried the 2 way, the ~~3.5~~3(+next block) and 4 way versions of the filter.

But we did try to minimise how much of the hash is disclosed. We did not go for thelog_2 num_entries - 1 -- but rather took a low number and added a few bits to get a low enough false-positive rate. We did find though that log_2 of the buckers was fairly good ballpark to start with - and you usually only nee 1 or 2 extra bits to get the false positives down & packing up.

As soon as we drop that 'how much' - then I fully agree with you - and it gets way better.

But then the serialised volume goes up - as you need the bucket-verify bits exported as you serialise. At a few million hashes for a typical large country area -- that is a low number for this.

Or is can you share some specific variation which already has this benefit in the low millions range and still has low false positives for the `our' likely numbers ?

I'll meanwhile check - as we did not look at that arXiv:1707.06855 - and that perhaps does solve the packing issue we now have due to the low incident count.

Update: thanks for Eric's paper; it is not exactly what we used - they only have one item cross block boundaries; so that should be easy to try as well.

dirkx commented 4 years ago

Thanks.

So have tried the the two filters you suggested; they are both significantly better for very high numbers (500M - 500 G entries); but we're not understanding how they can be tweaked for low sizes ( 0.5 -5 Million entries). Esp. with high packing ratio's we seem to need a lot of bits for the discrimination.

dirkx commented 4 years ago

@burdges, @snakehand - I stand corrected; once you tweak the balance between the verify bits and the hash bits to the (rough) dataset size; allow for non-byte-aligned hashes -- I am getting results which easily are 1/5 of the normal file sizes.

And the number of bits disclosed of H(TRUNC128(H(seed)||i)) is decent: 60-100 bits. Which may give rise to https://github.com/DP-3T/documents/issues/145.

burdges commented 4 years ago

I'd expect that's because you've no prefix compression for sorted file. They really should come close with prefix coding :) although I donno the optimal prefix coding scheme.

dirkx commented 4 years ago

Agreed - but am not overly worried by that (and bzip brings it to abut 10% of the N x 256 bits sizes). We are now in the manageable domain; and this can be optimised over time (and some of it purely sever side).

The crux now is to get enough implementation detail to allow for interoperable cross-country implementations to be build & tested independently. Better versions can come later (esp. as that would be a second file format generated on the backend and (only the) newer apps picking it up).

Remember that the first call for tenders are, in some countries, due on Tuesday already.

And right now DP3T and the google/apple proposal are the only two that are close to meeting the requirements of the EU (https://ec.europa.eu/info/files/recommendation-apps-contact-tracing_en) that will likely be signed into force on the 13th of April.

snakehand commented 4 years ago

Here is a test that indicates how a sorted list can be compressed by storing the deltas.

Generate:

import hashlib
import struct

def val(i):
    m = hashlib.sha256()
    m.update(b"Nobody inspects %d"%i)
    return struct.unpack("<Q",m.digest()[:8])[0]

def save(fname, lst):
    with open(fname,"wb") as f:
        for e in lst:
            f.write(struct.pack("<Q",e))

keys = [ val(i) for i in range(1000000)]
save("unsorted.bin",keys)
keys.sort()

keys2 = [ v if i==0 else v - keys[i-1] for (i,v) in enumerate(keys) ]
save("sort_diffed.bin",keys2)

Sizes of bzip2 -9 compressed output:

6592432 Apr 13 14:35 sort_diffed.bin.bz2
8000000 Apr 13 14:35 unsorted.bin

In this case 1000000 64 bits random data was reduced to 82% of original size.

dirkx commented 4 years ago

Nice ! And simple. Though with cuckoo we're going down to 10-20% of the original size - and only reveal 40-60 bits of the 256.

snakehand commented 4 years ago

If I create 40 significant bits like this

    return struct.unpack("<Q","\000\000\000" + m.digest()[:5])[0]

I get a bzip2 -9 compressed file size of 2986637 bytes, which is 59.7% of the 5000000 bytes of unsorted entropy.

jaromil commented 4 years ago

Hi mates, just FWIW I share two opinions here about the topic:

design 2 and esp. CF may well be undesirable
when space is an issue I'd go for linear-time lempel-ziv factorization (brieflz) which shall be suitable for future IoT solutions as well

I'm grateful to the authors of DP-3T for an amazing whitepaper and for keeping all designs around.

@dirkx is right in mentioning the urgency and that decisions need to be taken this week in EU

good luck

dirkx commented 4 years ago

@snakehand brought https://github.com/dirkx/DP-3T-Documents/blob/implementation-profile-start/implementation-profiles/profile.md and the sample code in line with the info from #145 and #100.

dirkx commented 4 years ago

Could you elaborate why CF is undesirable ?

W.r.t. to 2 - is that still the case if we're talking 15-20% of the original size (so 80% less) ?

B.t.w: we've meanwhile verified IOT (and old iPhone 4 use) -- and found that the use/scan of keys in the CF filter is fast, even on an ESP32 IoT chip with a few million infected seeds in the CF. So for the Corona timescale / hardware in Europe - good enough.

jaromil commented 4 years ago

my concerns about CF are mostly about your earlier reports and the fact that is not fully deterministic, but the trade-off with performance you report now is relevant to conclusions.

I think the design 1. is also flexible enough to deliver some computation from the server-side if really needed; but admittedly I haven't studied in depth the design 2. and its implications.

DP-3T / documents

Cuckoo filter is probably overkill and overly complicated #130