NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
16.51k stars 12.99k forks source link

pkgs.fetchgit not deterministic when fetching submodules #100498

Open jakubgs opened 3 years ago

jakubgs commented 3 years ago

Description

When I try to use pkgs.fetchgit with fetchSubmodules = true and deepClone = true on a repo that has a submodule that references the same submodule as root repo I get a hash mismatch error and different hashes for different nix-build runs:

hash mismatch in fixed-output derivation '/nix/store/lyis1l5zbsi9kfhlsg16pqxs12y518jn-nix-fetchgit-debug-f0e389d':
  wanted: sha256:0qmrg2a79ynjdwcw8jgkd0qdy880rwn366p78qr9yj7hrnznpzci
  got:    sha256:0brc139p24m1b4hrxkkcra5fjlv8dl1662a9r355jlhpp41km033
error: build of '/nix/store/zz6cbs4440yiqdb007ddq7yldkzwgd38-nix-fetchgit-debug-f0e389d.drv' failed

Reproduction

You can reproduce this behavior by building the derivation in this debug repo: https://github.com/status-im/nix-fetchgit-debug

Additional context

This seems to be caused by differences in Git Packfiles when comparing results of the derivation built multiple times:

 > diff -r \
     /nix/store/lyis1l5zbsi9kfhlsg16pqxs12y518jn-nix-fetchgit-debug-f0e389d \
     /nix/store/9vylzwxxi52bvyr8dcrsaaxj5ppwrid2-nix-fetchgit-debug-f0e389d

diff -r /nix/store/lyis1l5zbsi9kfhlsg16pqxs12y518jn-nix-fetchgit-debug-f0e389d/nim-faststreams/.git/objects/info/packs /nix/store/9vylzwxxi52bvyr8dcrsaaxj5ppwrid2-nix-fetchgit-debug-f0e389d/nim-faststreams/.git/objects/info/packs
1c1
< P pack-2a516976bb18daa498c8296b7de9c3bca3a61dba.pack
---
> P pack-348a1af74c1229307cad55ddc9af3b06f4496d20.pack
Only in /nix/store/lyis1l5zbsi9kfhlsg16pqxs12y518jn-nix-fetchgit-debug-f0e389d/nim-faststreams/.git/objects/pack: pack-2a516976bb18daa498c8296b7de9c3bca3a61dba.idx
Only in /nix/store/lyis1l5zbsi9kfhlsg16pqxs12y518jn-nix-fetchgit-debug-f0e389d/nim-faststreams/.git/objects/pack: pack-2a516976bb18daa498c8296b7de9c3bca3a61dba.pack
Only in /nix/store/9vylzwxxi52bvyr8dcrsaaxj5ppwrid2-nix-fetchgit-debug-f0e389d/nim-faststreams/.git/objects/pack: pack-348a1af74c1229307cad55ddc9af3b06f4496d20.idx
Only in /nix/store/9vylzwxxi52bvyr8dcrsaaxj5ppwrid2-nix-fetchgit-debug-f0e389d/nim-faststreams/.git/objects/pack: pack-348a1af74c1229307cad55ddc9af3b06f4496d20.pack
diff -r /nix/store/lyis1l5zbsi9kfhlsg16pqxs12y518jn-nix-fetchgit-debug-f0e389d/nim-waku/vendor/nim-faststreams/.git/objects/info/packs /nix/store/9vylzwxxi52bvyr8dcrsaaxj5ppwrid2-nix-fetchgit-debug-f0e389d/nim-waku/vendor/nim-faststreams/.git/objects/info/packs
1c1
< P pack-7a7a30973ad008980ac4954722689eb55b1941df.pack
---
> P pack-b56d89b8eafd21f484dc6af50a7771bbf28ed92c.pack
Only in /nix/store/lyis1l5zbsi9kfhlsg16pqxs12y518jn-nix-fetchgit-debug-f0e389d/nim-waku/vendor/nim-faststreams/.git/objects/pack: pack-7a7a30973ad008980ac4954722689eb55b1941df.idx
Only in /nix/store/lyis1l5zbsi9kfhlsg16pqxs12y518jn-nix-fetchgit-debug-f0e389d/nim-waku/vendor/nim-faststreams/.git/objects/pack: pack-7a7a30973ad008980ac4954722689eb55b1941df.pack
Only in /nix/store/9vylzwxxi52bvyr8dcrsaaxj5ppwrid2-nix-fetchgit-debug-f0e389d/nim-waku/vendor/nim-faststreams/.git/objects/pack: pack-b56d89b8eafd21f484dc6af50a7771bbf28ed92c.idx
Only in /nix/store/9vylzwxxi52bvyr8dcrsaaxj5ppwrid2-nix-fetchgit-debug-f0e389d/nim-waku/vendor/nim-faststreams/.git/objects/pack: pack-b56d89b8eafd21f484dc6af50a7771bbf28ed92c.pack

The cause of this seems to be the fact that both the root repo and the nim-waku submodule reference the same nim-faststreams repo. Also both this repo and nim-waku use the same nim-faststreams commit: 5df69fc6.

Metadata

 - system: `"x86_64-linux"`
 - host os: `Linux 5.7.19, NixOS, 20.03.git.95d979819fa (Markhor)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3.6`
 - channels(sochan): `"nixos-20.03.3107.067d8e6c9f4, nixos-unstable-21.03pre246062.420f89ceb26"`
 - channels(root): `"nixos-20.03.3112.08d429920bc"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nix

Pings

@Mic92 @bhipple @jtojnar @MarcWeber @bjornfor

FRidh commented 3 years ago

I think that's because .git needs to be included in case of a deepClone, and so it cannot be guaranteed that it is reproducible. If this is the case, such option should not exist in a Nixpkgs builder and should in my opinion be removed.

jakubgs commented 3 years ago

Another option would be to remove the .git/objects/info/packs and .git/objects/pack files here: https://github.com/NixOS/nixpkgs/blob/d74573d8ae6f02ef0ac1299c862d1b762ba0aad1/pkgs/build-support/fetchgit/nix-prefetch-git#L244-L247

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info

jakubgs commented 3 years ago

This is still and issue and not stale.

Nihlus commented 2 years ago

I've been working on some reproducible builds in Debian which use submodules, and have based that code off of the prefetch routine referenced here. The following script successfully creates a usable and deterministic clone for a repo with submodules of any depth and reference location in my tests - feel free to test and adapt as needed.

#!/usr/bin/env bash

set -euf -o pipefail

#
# Enumerates the various nondeterministic data objects in a git repository that 
# should be deleted.
#
declare -ra NONDETERMINISTIC_DATA=(
    "logs"
    "hooks"
    "index"
    "FETCH_HEAD"
    "ORIG_HEAD"
    "config"
    "refs/remotes/origin/HEAD"
)

#
# Cleans the submodules in the given directory, removing nondeterministic data.
# @param MODULES_DIR The submodule directory to clean
#
function clean_submodules() {
    local -r MODULES_DIR="${1}"

    for module_dir in $(find "${MODULES_DIR}" -mindepth 1 -maxdepth 1 -type d)
    do
        for d in "${NONDETERMINISTIC_DATA[@]}"; do
            rm -rf "${module_dir}/${d}"
        done

        # Clean nested submodules
        if [[ -d "${module_dir}/modules" ]]; then
            clean_submodules "${module_dir}/modules"
        fi
    done
}

#
# Cleans the given git repository, removing nondeterministic data and repacking
# its objects.
# @param REPO The local git repository to clean.
# 
function clean_repo() {
    local -r REPO="${1}"

    cd "${REPO}"

    # Clean submodules, if any
    if [[ -d .git/modules ]]; then
        clean_submodules .git/modules
    fi

    # Remove files that contain timestamps or other nondeterministic properties
    for d in "${NONDETERMINISTIC_DATA[@]}"; do
        rm -rf ".git/${d}"
    done

    # Remove remote branches
    git branch -r | while read branch; do
        git branch -rD "${branch}" >&2
    done

    # Remove unreachable tags
    local -r MAYBE_TAG="$(git tag --points-at HEAD)"
    git tag --contains HEAD | while read tag; do
        if [[ "${tag}" != "${MAYBE_TAG}", ]]; then
            git tag -d "${tag}" >&2
        fi
    done

    # Do a full repack. Must run single-threaded or determinism is lost.
    git -c pack.threads=1 repack -A -d -f
    rm -f .git/config

    # Garbage collect unreferenced objects
    git gc --prune=all --keep-largest-pack
}

#
# The main entry point of the program.
# @param REPO The local git repository to clean.
#
function main() {
    local -r REPO="${1}"

    # Run in subshell to not touch working directory
    ( clean_repo "${REPO}" )
}

main ${@}

I elected not to delete .git/objects/info/packs and .git/object/pack for the sake of backwards compatibility, and it appears to work fine without it.

tobiasBora commented 1 year ago

Any news on it? It would be great to have reproducible submodule fetch. If backward compatibility is an issue, we could add an option "enforceReproducibility", or "removeDotGit" to force it to be reproducible.

kjeremy commented 10 months ago

@Nihlus when in the build process do you call that script?

Nihlus commented 9 months ago

@kjeremy Hey, sorry about the long wait - things have been busy here and I kept forgetting to check.

Essentially, I call it right after the clone from a specific tag and branch, and then I pack the results into a deterministic tarball. It boils down to this (assuming UPSTREAM_BUNDLE is a git bundle made from a detached head checkout with submodules updated and initialized):

$(UPSTREAM_TARBALL): $(UPSTREAM_BUNDLE)
    git clone ${UPSTREAM_BUNDLE} -b ${PACKAGE_VERSION} --single-branch ${TMP_CLONE}
    make-deterministic.sh ${TMP_CLONE}
    tar -C ${TMP_CLONE}/../ --sort=name --mtime=@${SOURCE_DATE_EPOCH} --owner=0 --group=0 --numeric-owner --pax-option=exthdr.name=%d/PaxHeader/%f,delete=atime,delete=ctime -cf - ${BUILD_NAME} | gzip -n > ${UPSTREAM_TARBALL}
Enzime commented 7 months ago

If anyone is using leaveDotGit or deepClone to avoid the Server does not allow request for unadvertised object error, you can maintain determinism adding a postFetch that deletes the .git folder:

src = fetchgit {
  ...
  fetchSubmodules = true;
  leaveDotGit = true;
  postFetch = ''
    rm -rf $out/.git
  '';
}
hraban commented 3 months ago

This also breaks flakes which use submodules in one of their inputs

nixos-discourse commented 2 months ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/handling-git-submodules-in-flakes-from-nix-2-18-to-2-22-nar-hash-mismatch-issues/45118/1