Closed ladnir closed 2 years ago
Do you have a paper or write-up?
I'd like to compare your "structured linear function" impl to mine. The part that solves M * X = V. I'm not as interested in the other CRL-specific stuff. Can you point me to an example that focuses on this? I see the benchmark code, is there a specific one for this? Ideally maybe something like getting the running time for encoding n=2^20 GF128 or GF64 values.
Also, I'm a c++ dev so bare with me :)
Very interesting. I wasn't aware of the "linear OKVS" line of research. There are also Ribbon Filters (trade off speed vs size): https://engineering.fb.com/2021/07/09/data-infrastructure/ribbon-filter/ and binary fuse filters (much faster, but worse size) https://arxiv.org/abs/2201.01174.
The rows each contain two aligned 32-bit blocks at block numbers (i,j), where i-j ≈ 1 + k log n / random^2 is usually small. I have a heuristic for why this should usually be invertible up to a fairly large number of rows (> 2^32) with appropriate k, and experimental evidence, but not a proof. Note that "usually" just means ≈95% of the time because there's no security cost to retrying.
To solve, first focus on only the rows with i⊕j = 1, forming pairs of blocks. Solve these pairs each to systematic form and then project out from all other rows that match either that i or j. Then repeat on rows where i⊕j is 2-bits long, then 3-bits long etc.
The largest matrix has dimensions O(log n sqrt n) by O(log n sqrt n + epsilon n) where epsilon = 2^-10 is the overprovision factor, so solving it takes Õ(n^3/2 + epsilon n^2); but in practice the bottlenecks form smallish n are mostly pulling the rows and columns apart for systematic form computations, and for largish it's the memory consumption. Probably this could be improved slightly by eliminating some of the columns strategically to get rid of that n^2 term.
The "uniform" benchmark builds OKVS with 10^{3,4,5,6,7,8} values, but only 8-bit outputs. I changed this to 2^20 keys, 64-bit outputs. Currently it's limited to 64 for the CompressedRandomMap
map code arbitrarily, but that's easy to change. This takes 686ms on a 14" MacBook Pro (M1 Max, average over 100 runs). There's also multithreaded build code but it isn't very good currently, especially for small n: it still takes 474ms. These should be read as slight underestimates because, with the default benchmarking seed, the build succeeds every time instead of retrying about 5% of the time. Querying a row takes 71 ns with H = SipHash(1,3): the hash function is important on this scale.
In the older C version, building takes 598ms and queries take 115ns. Querying is faster on M1 than on Intel, and building is also slightly faster. The code doesn't use clustering / sharding, since it's apparently not necessary in the CRL case.
The Rust build times should actually be slightly longer. The code is tuned empirically for around 95% success rate, but with the default seed for the benchmark program it happened to succeed every time.
Overall this seems marginally slower than your code. Does your code offer (nearly) optimal space usage (eg 64 bits per kvp) for the OKVS, or does it have an expansion factor? In one place it says optimal but in the table there seems to be a 1.2x-1.3x coefficient. On the other hand, I'm not sure my structure would meet your obliviousness requirement: having a low expansion factor presumably helps just from information theory, but I'm not sure of the concrete analysis.
I'm guessing you may have seen my Real World Crypto presentation today (slides findable at https://rwc.iacr.org/2022/program.php)? Otherwise, I don't have a writeup right now ... basically I needed to rewrite everything once Facebook's Ribbon Filters were published (I had a similar construction as well and needed to rip it out and just cite them) but I was too busy at work to finish that so right now it's a hot mess.
Optimal in my paper means it is optimal at solving the specific distribution/setting we have which gives the expansion rate of 1.21 or something like that. We weren't aware of the line of work for effectively optimal expansion, 1.005 or whatever it is. My guess is that these other works do something outside our model.
I tried this idea of windowing (having the non-zeros clumped together) but for my pealing algorithm and failure Pr, it gives worse results.
Yes, it doesn't appear to be a drop-in replacement since we require negligible failure Pr. But maybe that could be addressed. Will have to dig into the details. But the running time sounds great if we could get that last 20% of bandwidth/storage.
Cheers
Yeah. The biggest issue is probably that I don't have a proof of the failure pr. It doesn't seem impossible to make one though, and it would probably help the data structure use case because you could optimize the matrix shape better with some understanding of where the cutoffs are.
You'd also of course have to change the parameters for negligible failure pr. That would hurt the perf, but maybe not disastrously, depending how much they have to go up.
FYI, there appears to be concurrent work https://eprint.iacr.org/2022/320