Nix binary cache bloom filter?

NixOS / nix

Nix, the purely functional package manager

https://nixos.org/

GNU Lesser General Public License v2.1

12.44k stars 1.5k forks source link

Nix binary cache bloom filter? #649

Open copumpkin opened 9 years ago

copumpkin commented 9 years ago

It seems like Nix's current strategy for finding if things are in the cache is to fire off hundreds of HTTP requests asking for each hash. What if binary caches could hand out periodically updated bloom filters of their contents? Clients would fetch them every so often and query the bloom filter before hitting the cache. False positives could result in the occasional unnecessary HTTP request, but otherwise it seems like it could be helpful.

Or is the blast of HTTP requests for binary cache presence really not that bad? My understanding is that Nix first queries the cache to see if a hash is there, then separately requests the full nar if it is there.

edolstra commented 9 years ago

It would be interesting to know how large the Bloom filter for (say) cache.nixos.org would be, which has on the order of a million store paths. If it's something like 10 bits per path, the filter would be on the order of 1 MiB, which would be pretty good. But you'd have to re-fetch it fairly frequently.

Long term, I think we should support HTTP/2 because it has much better pipelining support. So then we can send all those binary cache requests over a single TCP connection. Unfortunately, I'm not holding my breath for S3/CloudFront to get HTTP/2 support...

vcunat commented 9 years ago

IIRC its about 2 + log (1/epsilon) bits per path, given the probability of false positives. Quick wiki search claims 10 bits with 1% error rate. Anyway, I do remember certainly that it's relatively easy (even in practice) to get within a factor two of the information-theoretic lower bound on the required size.

Actually what we exactly want is to estimate intersection of two sets where each is held by a different party, and we want to minimize communication. We might get a bit more efficient about that. By coincidence, I researched very close things to this, mostly the theory part, so I might think about it a bit more within the next few weeks.

As for HTTP/2, I'd expect using UDP queries on *.narinfo instead might also yield good results, but I'm not proficient in such things.

vcunat commented 9 years ago

Thinking more of this, I don't see there is much to gain by bloom-like filtering of the requests. For cheap paths, e.g. nixos units, we (can) now have allowSubstitutes = false; for expensive paths we have to build them on a cache miss, so the failed request will be likely very negligible in comparison to the total work, unless in --dry-run mode or using multiple caches that supplement each other.

Another thing that is there already to help IIRC: caching the request results.

joepie91 commented 9 years ago

Quick suggestion, if a bloom filter were to be a thing: binary delta updates to the bloom filter, perhaps optionally?

I know that eg. Fedora uses binary delta-RPMs to great success, at least on slower connections - on faster connections it's not always worth the CPU trade-off, but I'd imagine that's less of an issue with something relatively small like this.

HackerFoo commented 4 years ago

I run a Hydra server that builds a lot (tens of thousands) of derivations not in cache.nixos.org. I'd like to still be able to use the upstream binary cache, but the queries take too long on startup of hydra-queue-runner. A bloom filter might avoid most of these queries.

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info