Open copumpkin opened 9 years ago
It would be interesting to know how large the Bloom filter for (say) cache.nixos.org would be, which has on the order of a million store paths. If it's something like 10 bits per path, the filter would be on the order of 1 MiB, which would be pretty good. But you'd have to re-fetch it fairly frequently.
Long term, I think we should support HTTP/2 because it has much better pipelining support. So then we can send all those binary cache requests over a single TCP connection. Unfortunately, I'm not holding my breath for S3/CloudFront to get HTTP/2 support...
IIRC its about 2 + log (1/epsilon) bits per path, given the probability of false positives. Quick wiki search claims 10 bits with 1% error rate. Anyway, I do remember certainly that it's relatively easy (even in practice) to get within a factor two of the information-theoretic lower bound on the required size.
Actually what we exactly want is to estimate intersection of two sets where each is held by a different party, and we want to minimize communication. We might get a bit more efficient about that. By coincidence, I researched very close things to this, mostly the theory part, so I might think about it a bit more within the next few weeks.
As for HTTP/2, I'd expect using UDP queries on *.narinfo instead might also yield good results, but I'm not proficient in such things.
Thinking more of this, I don't see there is much to gain by bloom-like filtering of the requests. For cheap paths, e.g. nixos units, we (can) now have allowSubstitutes = false
; for expensive paths we have to build them on a cache miss, so the failed request will be likely very negligible in comparison to the total work, unless in --dry-run
mode or using multiple caches that supplement each other.
Another thing that is there already to help IIRC: caching the request results.
Quick suggestion, if a bloom filter were to be a thing: binary delta updates to the bloom filter, perhaps optionally?
I know that eg. Fedora uses binary delta-RPMs to great success, at least on slower connections - on faster connections it's not always worth the CPU trade-off, but I'd imagine that's less of an issue with something relatively small like this.
I run a Hydra server that builds a lot (tens of thousands) of derivations not in cache.nixos.org. I'd like to still be able to use the upstream binary cache, but the queries take too long on startup of hydra-queue-runner
. A bloom filter might avoid most of these queries.
I marked this as stale due to inactivity. → More info
It seems like Nix's current strategy for finding if things are in the cache is to fire off hundreds of HTTP requests asking for each hash. What if binary caches could hand out periodically updated bloom filters of their contents? Clients would fetch them every so often and query the bloom filter before hitting the cache. False positives could result in the occasional unnecessary HTTP request, but otherwise it seems like it could be helpful.
Or is the blast of HTTP requests for binary cache presence really not that bad? My understanding is that Nix first queries the cache to see if a hash is there, then separately requests the full nar if it is there.