Open Ericson2314 opened 8 years ago
Well, if you store NARs as files in a (true) deduplicating storage/FS, e.g. in IPFS, you will get the insides deduplicated. AFAIK git packfiles aren't that efficient for binary files which is our main focus.
@vcunat Yeah I am more interested in the hashing scheme than the exact representation for exchange. I kinda also figured git had enough critical mass that if IPFS or anything else wanted to do transport for it it would want to special-case its hashing schema.
On the other hand git uses SHA1 which is dubiously, secure, and last I checked has no worked-out plan for migration. This makes me less sure whether this is a good idea.
https://github.com/ipfs/specs/issues/130 IPFS may soon support git.
What about something like SquashFS?
@spacekitteh that would be, uh, squashed? That means no space/bandwidth saving on identical files.
sha1
's into object_id
; here is the latest in the series https://public-inbox.org/git/20170101191847.564741-1-sandals@crustytoothpaste.net/.git checkout derivation-hash
(git checkout-index
could be used for custom path).If NAR files would be preferably stored with block-level deduplication, I wonder if /nix/store requires only up to file-level deduplication?
NAR: I haven't benchmarked to see the order-of-magnitudes range, but likely btrfs/zfs has the fastest block-level deduplication, but IPFS (whether accessed via POSIX api by being mounted with fuse or accessed with a POSIX-like interface through ipfs files
) still much slower. But since NAR files are used mainly for archival purpose (only occasionally accessed), IPFS-for-dedup as of now, can almost be used right away.
/nix/store: (unpacked NAR's?) I think Git packfile could potentially be used here, but only for switching nix-env's. IPFS-for-dedup would be too slow.
(... #859 is too crowded >_<)
@rht, this is good stuff, but I'd encourage you to be careful about data model vs concrete representatations (on disk, wire protocols, or otherwise).
I'm interested in git trees+blobs because the data type is exactly what we use (e.g. yes executable bit, no setuid) and it's widely used. Remember there is no way to convert hashes without access to the hashed data, so it's nice to request hashes many computers in principle already have.
Yes go-IPFS is probably slow as hell, and git supposedly is bad with large binaries (though I wouldn't be surprised if this really is no block level deduplication in conjunction keeping hsitory, we wouldn't have that problem as there is no mandatory history to keep). But we need not use either implementation long term.
Another alternative could be the catar
file format from casync
which is a content addressable storage system by Lennart Poettering: https://github.com/systemd/casync
It seems very similar in goals as nar
(it be, a reproducible version of tar
)
CASync itself can then be used as a deduplication mechanism. that's what the project is for (storing chunked versions of catar
files)
@arianvp catar
looks pretty great on paper, but it seems to have some Linux-specific bits in it (https://github.com/systemd/casync/issues/147), making it not quite viable on Darwin. desync is a project which attempts to reimplement as much of upstream casync
and catar
as possible to be compatible with Darwin, but the catar
archives it produces could have slight incompatibilities on Linux and vice versa, according to the README.
I also don't like they way it doesn't chunk across logical boundaries. The proper solution to more reuse is finer logical boundaries. That heuristic would yield results which are harder to predict and therefore rely on.
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/optimise-store-while-building-downloading-from-the-cache/11022/4
@Ericson2314 I missed this when you mentioned it, and I had the same thought and worked it out in https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871
The formatting isn't great yet and I decided halfway that it can not only work with $cas but also with $out, so that needs some changes.
Summary: by stripping store references from files, git tree objects can achieve maximum deduplication, more than would be achievable by runtime binding wrapper tricks. It would also be amazing to download updates.
@Ericson2314 It's a bit of a tradeoff I think. Logical boundaries will work only as well as your logic does, and your logic must work given an incomplete picture of the world.
You can of course do something very simple where there is no deduplication across slight modifications/patches of a file, but it's worth keeping in mind that something is being lost there.
I had the same idea and wrote it up at https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871
One difference is that I propose patching out store references so that there's more deduplication.
I once started a script to try it out but got a little stuck on streaming reference recognition.
EDIT: Argh, re-reading the thread I see i already commented on this 🙈. Shouldn't comment on phones in the morning.
I marked this as stale due to inactivity. → More info
Still interested.
@Ericson2314 thoughts on using bup instead of git? https://stackoverflow.com/a/19494211
@wmertens Well, two things:
From a brief look, bup has some interesting qualities, but I don't want to "choose a winner" -- a single best NAR successor.
@Ericson2314 is the backing store based on git that I describe at https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871 what you have in mind or something different?
Reading that SO answer I do worry that git might be a poor backing store on embedded systems, at least when adding things to the store. There are also some derivations that put 300MB ISO files in the store, I wonder how much memory git needs for that.
Furthermore, I wonder if I'm not prematurely optimizing the store path patching in my proposal, since bup for sure will find tons of unchanged chunks in binaries that only differ in embedded store paths, thanks to the rolling hash. Git OTOH needs to find a previous file to start from.
Given that bup uses git packfiles, I wonder if they can be fetched using git, limiting to only what's needed for a certain closure.
As a quick test I'll throw my store through bup to see what ratios I am getting (but not at my computer now)
I asked a question on their mailing list https://groups.google.com/g/bup-list/c/WSROvfjwz3M
@wmertens What I have is just https://github.com/NixOS/nix/pull/3635. There are no packfiles or other implementation changes. It is just does the normative part of allowing a git tree/blob hash for a content address. It also works just on the level of individual "store objects" (the things store paths point to), rather then being an entire-store design.
This I think is the right "beachhead", after which further work improving the implementation can be done transparently. It will at least allow us to start improving the way we ingest git sources right away.
NAR has gotten us along way, but one limitation is it cannot support deduplication because only the outermost directory gets a hash.
Git tree object is perhaps not greatest format, because git is so widely used, is likely to have better support among external tools. I think it makes a fine defacto standard.