NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
12.94k stars 1.53k forks source link

git tree object as alternative to NAR #1006

Open Ericson2314 opened 8 years ago

Ericson2314 commented 8 years ago

NAR has gotten us along way, but one limitation is it cannot support deduplication because only the outermost directory gets a hash.

Git tree object is perhaps not greatest format, because git is so widely used, is likely to have better support among external tools. I think it makes a fine defacto standard.

vcunat commented 8 years ago

Well, if you store NARs as files in a (true) deduplicating storage/FS, e.g. in IPFS, you will get the insides deduplicated. AFAIK git packfiles aren't that efficient for binary files which is our main focus.

Ericson2314 commented 8 years ago

@vcunat Yeah I am more interested in the hashing scheme than the exact representation for exchange. I kinda also figured git had enough critical mass that if IPFS or anything else wanted to do transport for it it would want to special-case its hashing schema.

On the other hand git uses SHA1 which is dubiously, secure, and last I checked has no worked-out plan for migration. This makes me less sure whether this is a good idea.

Ericson2314 commented 8 years ago

https://github.com/ipfs/specs/issues/130 IPFS may soon support git.

spacekitteh commented 8 years ago

What about something like SquashFS?

Ericson2314 commented 8 years ago

@spacekitteh that would be, uh, squashed? That means no space/bandwidth saving on identical files.

rht commented 7 years ago

If NAR files would be preferably stored with block-level deduplication, I wonder if /nix/store requires only up to file-level deduplication?

NAR: I haven't benchmarked to see the order-of-magnitudes range, but likely btrfs/zfs has the fastest block-level deduplication, but IPFS (whether accessed via POSIX api by being mounted with fuse or accessed with a POSIX-like interface through ipfs files) still much slower. But since NAR files are used mainly for archival purpose (only occasionally accessed), IPFS-for-dedup as of now, can almost be used right away.

/nix/store: (unpacked NAR's?) I think Git packfile could potentially be used here, but only for switching nix-env's. IPFS-for-dedup would be too slow.

(... #859 is too crowded >_<)

Ericson2314 commented 7 years ago

@rht, this is good stuff, but I'd encourage you to be careful about data model vs concrete representatations (on disk, wire protocols, or otherwise).

I'm interested in git trees+blobs because the data type is exactly what we use (e.g. yes executable bit, no setuid) and it's widely used. Remember there is no way to convert hashes without access to the hashed data, so it's nice to request hashes many computers in principle already have.

Yes go-IPFS is probably slow as hell, and git supposedly is bad with large binaries (though I wouldn't be surprised if this really is no block level deduplication in conjunction keeping hsitory, we wouldn't have that problem as there is no mandatory history to keep). But we need not use either implementation long term.

arianvp commented 6 years ago

Another alternative could be the catar file format from casync which is a content addressable storage system by Lennart Poettering: https://github.com/systemd/casync

It seems very similar in goals as nar (it be, a reproducible version of tar)

CASync itself can then be used as a deduplication mechanism. that's what the project is for (storing chunked versions of catar files)

ebkalderon commented 6 years ago

@arianvp catar looks pretty great on paper, but it seems to have some Linux-specific bits in it (https://github.com/systemd/casync/issues/147), making it not quite viable on Darwin. desync is a project which attempts to reimplement as much of upstream casync and catar as possible to be compatible with Darwin, but the catar archives it produces could have slight incompatibilities on Linux and vice versa, according to the README.

Ericson2314 commented 6 years ago

I also don't like they way it doesn't chunk across logical boundaries. The proper solution to more reuse is finer logical boundaries. That heuristic would yield results which are harder to predict and therefore rely on.

nixos-discourse commented 3 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/optimise-store-while-building-downloading-from-the-cache/11022/4

wmertens commented 3 years ago

@Ericson2314 I missed this when you mentioned it, and I had the same thought and worked it out in https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871

The formatting isn't great yet and I decided halfway that it can not only work with $cas but also with $out, so that needs some changes.

Summary: by stripping store references from files, git tree objects can achieve maximum deduplication, more than would be achievable by runtime binding wrapper tricks. It would also be amazing to download updates.

masaeedu commented 3 years ago

@Ericson2314 It's a bit of a tradeoff I think. Logical boundaries will work only as well as your logic does, and your logic must work given an incomplete picture of the world.

You can of course do something very simple where there is no deduplication across slight modifications/patches of a file, but it's worth keeping in mind that something is being lost there.

wmertens commented 3 years ago

I had the same idea and wrote it up at https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871

One difference is that I propose patching out store references so that there's more deduplication.

I once started a script to try it out but got a little stuck on streaming reference recognition.

EDIT: Argh, re-reading the thread I see i already commented on this 🙈. Shouldn't comment on phones in the morning.

stale[bot] commented 2 years ago

I marked this as stale due to inactivity. → More info

Ericson2314 commented 2 years ago

Still interested.

wmertens commented 2 years ago

@Ericson2314 thoughts on using bup instead of git? https://stackoverflow.com/a/19494211

Ericson2314 commented 2 years ago

@wmertens Well, two things:

  1. I want to un-hard-code NAR, so we can be less idiosyncratic and better integrate with other tools and communities.
  2. I want to support Git as part of that un-hardcoding because it is in wide use (for source code) today, regardless of the technical merits.

From a brief look, bup has some interesting qualities, but I don't want to "choose a winner" -- a single best NAR successor.

wmertens commented 2 years ago

@Ericson2314 is the backing store based on git that I describe at https://gist.github.com/wmertens/eceebe0fc05461ebdc8fb106d90a6871 what you have in mind or something different?

Reading that SO answer I do worry that git might be a poor backing store on embedded systems, at least when adding things to the store. There are also some derivations that put 300MB ISO files in the store, I wonder how much memory git needs for that.

Furthermore, I wonder if I'm not prematurely optimizing the store path patching in my proposal, since bup for sure will find tons of unchanged chunks in binaries that only differ in embedded store paths, thanks to the rolling hash. Git OTOH needs to find a previous file to start from.

Given that bup uses git packfiles, I wonder if they can be fetched using git, limiting to only what's needed for a certain closure.

As a quick test I'll throw my store through bup to see what ratios I am getting (but not at my computer now)

wmertens commented 2 years ago

I asked a question on their mailing list https://groups.google.com/g/bup-list/c/WSROvfjwz3M

Ericson2314 commented 2 years ago

@wmertens What I have is just https://github.com/NixOS/nix/pull/3635. There are no packfiles or other implementation changes. It is just does the normative part of allowing a git tree/blob hash for a content address. It also works just on the level of individual "store objects" (the things store paths point to), rather then being an entire-store design.

This I think is the right "beachhead", after which further work improving the implementation can be done transparently. It will at least allow us to start improving the way we ingest git sources right away.