easybuilders / easybuild-framework

EasyBuild is a software installation framework in Python that allows you to install software in a structured and robust way.
https://easybuild.io
GNU General Public License v2.0
152 stars 204 forks source link

More reliable checksums using NAR hash #4200

Open Micket opened 1 year ago

Micket commented 1 year ago

Summarizing discussion from slack:

Background: Changing datestamps in auto-generated tarballs. Change in compression level (like github did recently, but reverted). Change in tar.gz vs tar.xz. Sources from git_config lacking stable checksums. Our checksum implementation is lacking.

NAR Hash would solve this by serializing the file tree and its contents ( https://gist.github.com/jbeda/5c79d2b1434f0018d693 )

Ref. https://nixos.wiki/wiki/Nix_Hash

I assume this could even be made opt-in? If sha256sum is used, then apply to binary (still relevant in many cases, where we e.g. have iso-files for MATLAB etc.). i.e.

checksums = ['nar:a_narred_sha256sum', 'a_normal_sha_256sum']

Ref. https://github.com/easybuilders/easybuild-easyconfigs/issues/5151

tgamblin commented 1 year ago

If you do this, you are vulnerable to malformed tarball CVEs (like nix is), and EB does not have a sandboxed environment to mitigate this risk (somewhat) like nix does.

So, please don’t do this in EB.

see https://github.com/orgs/community/discussions/45830#discussioncomment-4824886

boegel commented 1 year ago

If you do this, you are vulnerable to malformed tarball CVEs (like nix is), and EB does not have a sandboxed environment to mitigate this risk (somewhat) like nix does.

So, please don’t do this in EB.

see https://github.com/orgs/community/discussions/45830#discussioncomment-4824886

That's a valid point of course, since we basically need to start playing with an unchecked tarball (unzipping/unpacking it) before we can actually obtain a checksum for it...

Does that mean we'll just need to suffer the pain if the tarballs served by GitHub start changing (again, since this also happened in Sept'17, see https://github.com/easybuilders/easybuild-easyconfigs/issues/5151 + https://github.com/spack/spack/issues/5411)? :-/

grahamc commented 1 year ago

Please also know that it the way we've done it in Nix isn't amazing. It would be better to not do it this way. We would much rather know that the archives produced by GitHub are hash-stable over time.

Flamefire commented 1 year ago

checksums = ['nar:a_narred_sha256sum', 'a_normal_sha_256sum']

Side note: This syntax would required both checksums to match. So we'd need a tuple and likely better invert the order, so the tar checksum is checked first and only on failure the NAR is used.

mboisson commented 1 year ago

I understand Todd's point about a vulnerability, but regardless of how the Github issue evolves, having a way to check whether the content has changed seem to go through NAR hashes. Implementing it as a second step, after having checked the tarball's checksum, would get us one step closer to being ready for a possible future change in Git.

I don't see the downfall (other than a bit more time to build stuff) to add NAR hashes to reduce our work in the future. If checksums from github change down the line, we will be one step closer to being able to validate that new archives are indeed the same as previous ones.

Note that I would not automatically use the NAR one if the tarball's hash fails. I would error out, provide an expert flag to do check the NAR and suggest a way to add a secondary checksum for the tarball through a PR if the NAR matches.

mboisson commented 1 year ago

Also, moving checksums outside of EasyConfigs, into checksums.json (which admittedly would need to become a bit more complete), would make an eventual future massive update of checksums a lot easier to handle (because it would require updating a lot fewer files).

boegel commented 1 year ago

checksums = ['nar:a_narred_sha256sum', 'a_normal_sha_256sum']

Side note: This syntax would required both checksums to match. So we'd need a tuple and likely better invert the order, so the tar checksum is checked first and only on failure the NAR is used.

That doesn't seem wise taking into account @tgamblin's very valid remark above, to put it mildly.

If we get the wrong SHA256 checksum, either it's a benign change (like GitHub suddenly serving a different source tarball, but with the same contents), or something malicious is going on (someone manages to serve you a zip bomb via MITM). In the latter case, you most definitely don't want to unpack the tarball to compute the NAR hash, I think it's clear now that that would be a mistake.

That doesn't mean a NAR hash is entirely useless: if we have two source tarballs that have a different SHA256 checksum, and we're pretty sure they're both safe (both served by GitHub), we could use a function that computes the NAR hash (in an isolated environment) to easily check whether the contents of the tarball are still the same. I'm not sure that's a good enough reason to implement support for checking a NAR hash though, since the output produced by diff -ru is usually more informative, since it not only tells you whether there is a change, but also what has changed (see https://github.com/easybuilders/easybuild-easyconfigs/issues/17233 for example).

mboisson commented 1 year ago

I'm not sure that's a good enough reason to implement support for checking a NAR hash though, since the output produced by diff -ru is usually more informative, since it not only tells you whether there is a change, but also what has changed (see easybuilders/easybuild-easyconfigs#17233 for example).

diff -ru implies that you still have the "original" copy, which may not be the case. Computing the NAR hash and storing it would allow you to validate with confidence without needing to have the old archive.

tgamblin commented 1 year ago

Computing the NAR hash and storing it would allow you to validate with confidence without needing to have the old archive.

I think this is a nice benefit because you don't have to maintain a mirror (or fetch from it). I added number 5 here after @mboisson mentioned it. You still have to keep around two hashes though, and you have to train your contributors to NAR stuff (or automate that in CI)

mboisson commented 1 year ago

The way I could see NAR hash used is:

  1. Automate their calculation and storage when adding checksums (similar to --inject-checksums & co)
  2. Always test the tarball's checksum first. If it matches, and you already have the NAR hash, no need to do anything else.
  3. If the tarball's checksum does not match, error out. Someone has to make a decision on re-validating the NAR hash or not, through some expert flag.
  4. If someone decided it is ok to compute the NAR hash:
  5. If the NAR hash match, inject a new alternate checksum for that tarball
  6. If the NAR hash don't match, someone has to decide again, investigate with diff -ru to figure out what has changed
boegel commented 1 year ago

One potential safe (?) use of NAR hashes could be stuff that we download via git_config, since then a SHA256 checksum of the tarball is not possible (cfr. #2726)