NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
12.09k stars 1.47k forks source link

fetchGit/fetchTree reproducibility problems #5313

Open roberth opened 2 years ago

roberth commented 2 years ago

Describe the bug

Builtin fetchers have to be reproducible. Any changes to their output either break the build because of a hash mismatch or cause the sources to change, if no hash is provided. Implementation errors in the builtin fetchers invalidate Nix as a tool for reproducibility.

While a git commit hash appears solid at first, its translation to a store path is not unique and prone to impurities.

This shows in #5260 for example, where someone relied on Nix's previous behavior of applying git's smudge filters. This was arguably an implementation error that was recently corrected, but it is a very breaking change nonetheless.

For a more detailed description of the problem with git smudge filters I'll kindly refer to the motivation description of #4635, which was a PR to attempt to fix the problem, at least in most cases.

roberth commented 2 years ago

As for the solution, perhaps it'd be best to make builtins.fetchGit behave exactly like Nix 2.3 and implement the reproducibility fixes only in builtins.fetchTree.

greedy commented 2 years ago

It’s not just smudge filters either. There is at least core.eol and core.symlinks confit options that affect working copy contents. There’s also the possibility for a global gitattributes file that can cause files to be iconv’ed at checkout, and substitute RCS-style ID tokens.

For true reproducibility it might be necessary to forgo the use of any “porcelain” and use the “plumbing” directly (either as commands or using libgit)

lilyball commented 2 years ago

If ditching the "porcelain" tools, please make sure that whatever the replacement is still has some solution for authentication for fetching from private repositories. Authentication does not affect reproducibility (just reliability) so there's no issue with reading authentication information from the environment or filesystem.

thufschmitt commented 10 months ago

The GitHub tarball API suffers from a similar problem: It runs git archive behind the scenes, which honours the export-subst attribute (which can perform some nondeterministic substitutions, and gives a different tree than git clone). This isn't the case however if the API is used with a tree hash rather than a commit hash. GitLab has the same behaviour (and probably the other Git forges too since that's inherited from git archive).

I looked at this with @tomberek, and it seems that the most reliable solution would be to

An alternative for GitHub would be to just make it sugar for Git, but add some clever blob filtering to get some reasonable performances. I've no idea how possible that would be.

roberth commented 10 months ago
  • However this is problematic because getting the tree hash from a commit hash requires hitting the GitHub API which is severely rate-limited for anonymous users.

This only applies during locking and impure use. Most of the fetching will happen on locked fetchTree calls that already have a copy of the tree hash. Only manually pinned fetchTree calls will be affected, and fallback is possible, as mentioned.

Note though that by making fetching tree-based, we solve an opposite problem, which is that archive based locking can not fall back to git operations because in the status quo, we'd have to invoke or emulate git archive, which is not desirable.

Another possible way around the rate limits, which doesn't involve cloning, is perhaps to use libgit2 to fetch only the relevant packfile and get the tree hash from there. I believe that's feasible in 2 requests to the git (non-"api") endpoint. Not sure if such functions are exposed though.

Finally I consider the false equivalence between the commit tarball and normal git fetching to be a serious bug.

thufschmitt commented 10 months ago
  • However this is problematic because getting the tree hash from a commit hash requires hitting the GitHub API which is severely rate-limited for anonymous users.

This only applies during locking and impure use. Most of the fetching will happen on locked fetchTree calls that already have a copy of the tree hash. Only manually pinned fetchTree calls will be affected, and fallback is possible, as mentioned.

If we store the tree hash as part of the lock file, then yes. But it's problematic because it means that the commit hash isn't the ultimate source of truth any more (the tree hash is). So in a Flake context I could have inputs.foo.url = "github:foo/bar?rev=abcde123", but a forged lockfile that maps that to a totally unrelated tree hash. But indeed, we could fall back to plain Git if we get rate-limited.

Another possible way around the rate limits, which doesn't involve cloning, is perhaps to use libgit2 to fetch only the relevant packfile and get the tree hash from there. I believe that's feasible in 2 requests to the git (non-"api") endpoint. Not sure if such functions are exposed though.

@tomberek has been trying that (directly curl-ing the Git server, not through libgit2) with not much success. But it's probably theoretically possible indeed.

roberth commented 10 months ago

not much success. But it's probably theoretically possible

Can be done with the git CLI. Tested with https://github.com/NixOS/nixpkgs.git. master takes 3 seconds to fetch the hashes for the first time. Also takes 3 MB storage, for all historic commits of master. 1-3 seconds for syncing up when needed (ie commit hash not found locally). Network use also about 3 MB.

Crucially we only incur the cost when using a truly new commit. E.g. if you have a flake with 12 versions of Nixpkgs, you only fetch the commits once.

```console $ git init $ git remote add origin https://github.com/NixOS/nixpkgs.git $ time git fetch origin --filter=blob:none --depth=1 master remote: Enumerating objects: 25463, done. remote: Counting objects: 100% (25463/25463), done. remote: Compressing objects: 100% (4729/4729), done. remote: Total 25463 (delta 14), reused 24666 (delta 13), pack-reused 0 Receiving objects: 100% (25463/25463), 2.31 MiB | 3.46 MiB/s, done. Resolving deltas: 100% (14/14), done. From https://github.com/NixOS/nixpkgs * branch master -> FETCH_HEAD * [new branch] master -> origin/master real 0m1.808s user 0m0.394s sys 0m0.127s $ time git fetch origin --filter=blob:none --depth=1 master remote: Total 0 (delta 0), reused 0 (delta 0), pack-reused 0 From https://github.com/NixOS/nixpkgs * branch master -> FETCH_HEAD real 0m0.976s user 0m0.112s sys 0m0.044s $ git rev-parse refs/remotes/origin/master^{tree} # is instant 76a68875f4fbbe51864431a639f7413c15b2469b $ du -shc .git 3.1M .git 3.1M total $ time git fetch origin --filter=blob:none --depth=1 nixos-unstable remote: Enumerating objects: 1029, done. remote: Counting objects: 100% (1029/1029), done. remote: Compressing objects: 100% (257/257), done. remote: Total 519 (delta 174), reused 333 (delta 0), pack-reused 0 Receiving objects: 100% (519/519), 48.52 KiB | 1.13 MiB/s, done. Resolving deltas: 100% (174/174), completed with 174 local objects. From https://github.com/NixOS/nixpkgs * branch nixos-unstable -> FETCH_HEAD * [new branch] nixos-unstable -> origin/nixos-unstable real 0m1.299s user 0m0.143s sys 0m0.054s ```

problematic because it means that the commit hash isn't the ultimate source of truth any more (the tree hash is). So in a Flake context I could have inputs.foo.url = "github:foo/bar?rev=abcde123", but a forged lockfile that maps that to a totally unrelated tree hash.

That's already a problem with rev and narHash. Don't trust lockfile updates from untrusted sources. Nonetheless, we could always check that tree and rev match. It's cheap.

thufschmitt commented 10 months ago

Can be done with the git CLI.

Oh, that is great! I didn't expect there would be such an easy (and cheap-ish) solution.

Ericson2314 commented 10 months ago

Right that's why I very much want to get in my git hashing changes the same time we do this redesign: being able to

is very useful, especially has we approach a world where there is quite a lot of different ways to fetch git things.

Ultimately, in a world with signed commits being the norm, we should be writing down commit hashes and public keys in the input spec, and then everything is verified via merkle inclusion proofs form there.

roberth commented 10 months ago
git fetch origin --filter=blob:none --depth=1 master

Probably should be in git_fetch_options in libgit2, but I don't see how to do it.

It doesn't support sparse checkouts, so it wouldn't surprise me if clone filtering (fetch filtering?) isn't implemented yet either. Partial cloning is also still a feature issue. (And the code suggests that the filter options are handled by the partial cloning feature) Neither remote.h, nor fetch.h, nor clone.h mention "blob" either.

I assume that it's just not implemented yet. We might want to use the CLI for this procedure until it is implemented in libgit2.

roberth commented 5 months ago

New commands with --filter=tree:0 instead of --filter=blob:none

We can improve on https://github.com/NixOS/nix/issues/5313#issuecomment-1781773057

$ time git fetch origin --filter=blob:none --depth=1 master
[...]
real  0m1.808s

Re-running it today gives 0m1.299s, but tree:0 is even faster:

$ rm -rf .git/; git init; git remote add origin https://github.com/NixOS/nixpkgs.git;
$ time git fetch origin --filter=tree:0 --depth=1 master
[...]
real    0m0.556s

Getting or checking an arbitrary revision is slower, but this is only needed when we don't have a lock file to cache the tree hash.

$ time git fetch origin --filter=tree:0 --depth=1 ca09999d7a6d9e789e00078e660e90bd63acc7a4
[...]
real    0m2.140s

For comparison, the GitHub API responds in 0.12 to 0.4 seconds for an arbitrary commit.

curl https://api.github.com/repos/NixOS/nixpkgs/commits/ca09999d7a6d9e789e00078e660e90bd63acc7a4
thufschmitt commented 5 months ago

@roberth that's sweet :) Does that mean that for fetching github:nixos/nixpkgs/{commitHash} we can

  1. Use git to fetch the commit blob
  2. Use git to get the tree hash from the blob
  3. Use the /archive GitHub endpoint to fetch the archive corresponding to it ?
roberth commented 5 months ago

@thufschmitt That's the plan. We could elaborate that a bit:

  1. Use the git command to fetch the commit blob (including ref resolution if needed)
  2. Use libgit2 to get the tree hash from the commit
  3. Use the /archive GitHub endpoint to fetch the archive corresponding to the tree hash
  4. Verify the tree hash while unpacking the tarball (into a git repo?)
  5. (contingency) fall back to normal git-command fetch.

With @DavHau we discussed two implementation strategies

The latter appears more elegant, as we can re-frame the new fetching strategy as an alternate "git transport", somewhat similar to how git itself can deal with multiple protocols.

Even submodule support seems within reach that way, although for that we do need the slightly slow blob:none, which fetches a "wintery" tree that does contain references to submodules. But well, that's extra anyway, because github: doesn't support submodules yet. (Also probably similar story for subtree fetching, like the nixpkgs/lib flake)