Integrity field for downloaded repositories #198

Closed mosteo closed 4 years ago

See last conclusions in #66. The idea is to use git archive | sha512sum or equivalent.

Open questions:

[x] What happens with ends of line in Linux vs Windows?
- Verified that different hashes are obtained with the default git configuration
- Can be overriden with git -c core.autocrlf=false
[x] Implement for git
- [ ] Study using https://stackoverflow.com/questions/10118302/how-to-fix-the-line-endings-in-checked-out-files-after-changing-the-autoclrf-set
[ ] Implement for Mercurial (hg --config extensions.eol=! checkout -C)
[ ] Implement for SVN (using .svn/pristine)

I'm sorry for being this obtuse, but after a night's sleep:

If the index repo is trusted --> VCS will check crate integrity for us
If the index repo is untrusted --> What is the point in checking integrity if the hash in index is already suspect.

By this reasoning, the whitelist is the only thing that is useful.

Here's a scenario I have in mind: someone hacks a GitHub account and changes the content of a repository used in a crate. If we don't have a hash of the content, no one will be able to detect that the sources have been tampered with. In this case the whitelist is not enough.

Another scenario: I write a crate and host it in a git repo on my private server with my domain name. After a couple year, I forget about this project, I forget about the server and domain name. Someone can hack this server, or maybe acquires the domain name if I stop paying for it. This someone is now able to replace the code in the repo or provide another repo with different sources. Again, no one will be able to detect that the sources have been tampered with.

The hash is here to guaranty that the sources that have been checked out are really the ones specified by the author of the crate.

Which makes me also think that it should be forbidden to change the hash of a version already published in a crate.

Thanks for these examples. They help me focus the argument.

Here's a scenario I have in mind: someone hacks a GitHub account and changes the content of a repository used in a crate. If we don't have a hash of the content, no one will be able to detect that the sources have been tempered with. In this case the whitelist is not enough.

It's my understanding that the commit hash we use as origin is enough to detect this during cloning.

Another scenario: I write a crate and host it in a git repo on my private server with my domain name. After a couple year, I forget about this project, I forget about the server and domain name. Someone can hack this server, or maybe acquires the domain name if I stop paying for it. This someone is now able to replace the code in the repo or provide another repo with different sources. Again, no one will be able to detect that the sources have been tempered with.

Same as previous one; you cannot change/serve sources that do not match the commit hash, and the commit hash is coming from our trusted index.

The hash is here to guaranty that the sources that have been checked out are really the ones specified by the author of the crate.

Which is what the origin commit hash is doing (and that's why we added it for source archives that lack it). Unless I have some fundamental misunderstanding on how git works, I don't see these as factible attacks. Now, my only doubt is whether we need to perform an additional git fsck after git clone or not even that. According to this we shouldn't need it.

At present we are only using commit hashes in origins (for this very concern); if we allowed branches or tags that would be another matter.

Which makes me also think that it should be forbidden to change the hash of a version already published in a crate.

Yes, that seems advisable.

It's my understanding that the commit hash we use as origin is enough to detect this during cloning.

Just to be clear, we are talking about the origin field that is provided in the manifest:

origin = "git+https://github.com/alire-project/eagle-lander.git@5a3bcc61eff4d60d2b741add7841410ce539d0b8"`

At present we are only using commit hashes in origins (for this very concern); if we allowed branches or tags that would be another matter.

Ok, I didn't know that branches and tags were forbidden. It would be convenient to have them, but if this solves the hash problem then it is a good deal.

Then I would say that we have to check if git's sha1 can be trusted, I remember reading about forged sha1 collisions.

@senier do you mind giving us your opinion on the matter?

Yes, that's the field.

While we await senier's input, I read a bit about git and the sha1 attack and the general consensus seemed to be that git is not readily vulnerable due to some extra fields in its internal structures. They're working on superseeding sha1 too (not sure on the state of this).

Since git 2.13.0 also a hardened version of SHA-1 is in use. So in theory, referencing a full SHA-1 commit ID should be safe. You never know which future attacks will be found that cannot be mitigated, though. There are also those statements that the additional metadata in git prevent an attack, but honestly, I'm skeptical and wouldn't base my security decisions on that.

Contrary, the initial git archive | sha512sum is straight forward and arguably secure. What's the rationale for favoring the VCS hash over this solution?

Contrary, the initial git archive | sha512sum is straight forward and arguably secure. What's the rationale for favoring the VCS hash over this solution?

My concerns (probably minor, but I wouldn't want to spend time on a non-solution):

It isn't as straightforward as it seems, since e.g. depending on how you configure git for checkout of text files it returns a different hash in linux vs windows (out of LF differences, is my guess). Also, we would need a similar solution for Hg/SVN (or deprecate those).
Is git archive designed in purpose to give repeatable results across OSs?
- Perhaps we could go around those issues with something like git ls-tree -r HEAD and hash that manually.
If a hash collision is achieved by careful crafting of a malicious file, how our computing of the hash guards about that? Is that because it would require to craft an attack that simultaneously fools two hashing functions?
- In SVN, revisions are given as numbers (so not cryptographically secure like git/hg). Our custom hash would put us in the current situation we have with git/hg (one hash, perhaps vulnerable in the future). Should we deprecate it entirely?

I was suspecting something like your first point and I see the issues you bring up in the other points. Thinking about it, a git-specific solution sounds problematic. Either, we restrict upstream repos to few, blessed VCS' or we have the same security discussion for every new VCS. Some VCS' may just be unsuitable. I'd try to avoid those issues.

How about a completely format agnostic content hashing? The shell pseudo-code would look like that:

$ your_favorite_vcs pull http://some.where/source/reference/ output_dir
$ cd output_dir
$ tar -cp . | sha512sum

The only system I've a closer insight to, which does this kind of content security, is the Genode OS Framework. They don't use tar and sha2, but its similar:

$(_DST_HASH_FILE): $(HASH_INPUT) $(MAKEFILE_LIST)
    @$(MSG_GENERATE)$(notdir $@)
        $(VERBOSE)cat $(HASH_INPUT) | sha1sum | sed "s/ .*//" > $@

The benefit of this scheme is that it's transport/VCS agnostic and works with anything that gets the source onto your disk. Also, we separate fetching and validating the source and have the security discussion only once.

Caveat: I have no idea how well the tar construct works on other OSes like Windows or OS X. We may have to leave out the -p option and potentially add some other options to make the result stable for different platforms (something like --numeric-owner comes to mind). Other archivers like cpio or some random zip implementation could also be an option.

* configure git for checkout of text files it returns a different hash in linux vs windows (out of LF differences, is my guess). Also, we would need a similar solution for Hg/SVN (or deprecate those).

Is it possible to configure git locally or temporarily to disable the autocrlf? Like read the value of autocrlf config then change it to false for the checkout and revert to the initial value.

For the git archive vs tar question, there is the possibility of implementing in Ada a function that will recursively hash a directory.

Is it possible to configure git locally or temporarily to disable the autocrlf

It should be possible to override it using .gitattributes. I would still prefer if we can think of something usable for all VCSs...

The directory hash function will be usable for all VCSs, it's just that git has a special case during checkout because of this autocrlf thing.

The directory hash function will be usable for all VCSs, it's just that git has a special case during checkout because of this autocrlf thing.

Yes, I agree that's simple enough to go that way. I'm exclusively thinking of the end-of-line issue. I actually don't know what hg and svn do in that regard.

I like the idea of calculating the hash in an Ada function. Regarding the CR/LF issue: How about converting every occurrence of CR/LF to LF for the purpose of hashing (inside this Ada function)? This would avoid special treatment/configuration of the various VCS'.

How about converting every occurrence of CR/LF to LF for the purpose of hashing

We feared that this might open the door to some attack... if this is safe it would be certainly straightforward.

For reference, I found this about svn/hg:

http://svnbook.red-bean.com/en/1.7/svn.advanced.props.file-portability.html https://www.mercurial-scm.org/wiki/EolExtension#Configuring_a_repository_using_the_.hgeol_file

In short, SVN has something similar to git. The Mercurial website is down for me at the moment.

We discussed this once more and I now feel uneasy about translating CR/LF to LF. While we see no specific attack, giving an attacker the capability to insert a CRs arbitrarily before LFs sounds exactly like a thing that's going to blow up in our face later.

The unproblematic, but annoying solution is to provide two hashes, one for systems with CR/LF and one for systems with LF. You could either check both or select the one that is appropriate for the current platform. Of cause, the tools then should enable the maintainer to create the hash for the other platform easily (e.g. I never use Windows but my Ada software very likely also works there). This unfortunately brings as back to the point where we exactly need to know how to configure the VCS.

@mosteo You've looked into other package managers very closely. I wonder, how do other systems that run on Windows and UNIX-like systems cope with that? Are there even any that are cross-plattform and care about integrity? cargo?

The unproblematic, but annoying solution is to provide two hashes, one for systems with CR/LF and one for systems with LF.

I still think that disabling the autocrlf feature before doing the checkout is the way to go.

I wonder, how do other systems that run on Windows and UNIX-like systems cope with that? Are there even any that are cross-plattform and care about integrity? cargo?

From what I've seen, both opam and cargo rely on downloading archives instead of checking out commits (but note that some sites like github provide a way of downloading an archive for a particular commit -- this is used extensively in opam). opam allows supplying several hashes, instead of only one as we have now. Still, it's common to see only md5 (in old packages, I guess). Cargo uses a single sha256 from what I've been able to find.

Here's a survey of other PMs trust features (I wasn't aware until today, very interesting): https://github.com/package-community/discussions/issues/5

This one I guess is a must read too: https://github.com/rust-lang/crates.io/issues/75

Which lead me to: https://theupdateframework.github.io/

As a sort summary from my cursory read of these issues, it seems the only entirely safe way is to go with GPG signatures, and not many packagers are doing it already... and sanely managing that is what the above project addresses. I still have to read through the website in detail, though.

I still think that disabling the autocrlf feature before doing the checkout is the way to go.

I just re-read the autcrlf description and think I misunderstood your intention earlier. So the purpose is (correct me if I'm wrong) to force all checkouts to UNIX-style endings, also on Windows. If we can ensure this, hashes should be identical on all platforms. I agree that this should be the preferred solution.

Will there be any issue on Windows with UNIX-style line endings?

Will there be any issue on Windows with UNIX-style line endings?

There might be if one opens source files. Depending on the editor you might see no line breaks but a single huge line. If we go this route, I'd propose to check out once for hash computation, reset, and check out normally; time should be negligible compared to download time.

I'll experiment with this.

And additionally I just saw that git archive accepts remote repositories. Depending on how that behaves we could use it instead of a regular clone and have our cake and eat it.

Here's a survey of other PMs trust features (I wasn't aware until today, very interesting): package-community/discussions#5

This one I guess is a must read too: rust-lang/crates.io#75

Which lead me to: https://theupdateframework.github.io/

As a sort summary from my cursory read of these issues, it seems the only entirely safe way is to go with GPG signatures, and not many packagers are doing it already... and sanely managing that is what the above project addresses. I still have to read through the website in detail, though.

Very interesting links, especially the first one. After skimming over it, it seems like everything is leading to the Update Framework. Coincidentally, I met someone at a conference last Sunday who ported a complex system (Genode) to TUF. I could get in touch with him if we have any practical questions.

Regarding signatures versus central trusted index: IIRC we discussed that earlier and deliberately left out signatures for simplicity reasons. Eventually, a TUF-like approach where developers can opt-in to provide a signature would be nice. While some people don't bother or want to avoid the hassle, we would certainly sign our own software.

Will there be any issue on Windows with UNIX-style line endings?

There might be if one opens source files. Depending on the editor you might see no line breaks but a single huge line. If we go this route, I'd propose to check out once for hash computation, reset, and check out normally; time should be negligible compared to download time.

I'll experiment with this.

Maybe this is a dumb question: Is ALIRE meant to be a tool I use during development (then, the above would be an issue) or do you consider pulling via git just a step in the software distribution process (and no human touch the code thereafter). If the latter is true, wouldn't the only issue be that GNAT, gprbuild etc. can cope with the wrong line endings on Windows?

And additionally I just saw that git archive accepts remote repositories. Depending on how that behaves we could use it instead of a regular clone and have our cake and eat it.

Unfortunately this approach is susceptible to a TOCTOU attack. If an attacker can change the source location between hashing and the real checkout, we could check the correct version but then use a modified one. We must hash a downloaded copy that is under our control.

I'm unsure whether whether the checkout/reset/checkout procedure is safe in that respect. It depends on whether we check out from a trusted local copy. This makes me think of a modification that should be safe:

$ git clone --bare https://remote.url/repo local_repo
$ check_hash local_repo/
$ git checkout local_repo local_workdir

The local_repo does not have the crlf issue (we don't even have to configure anything) and the second check out comes from a trusted local location. I cannot tell whether other VCS' have a similar mechanism.

I just re-read the autcrlf description and think I misunderstood your intention earlier. So the purpose is (correct me if I'm wrong) to force all checkouts to UNIX-style endings, also on Windows.

Actually if we disable autocrlf it means that git will not mess with the endings and just checkout the files as they are in the repo.

Unfortunately this approach is susceptible to a TOCTOU attack. If an attacker can change the source location between hashing and the real checkout, we could check the correct version but then use a modified one. We must hash a downloaded copy that is under our control.

I don't think it's the case; with image we obtain a local copy that is then hashed and unzipped (kind of similar to the bare clone approach -- which I prefer -- or what we already do with source archives).

The local_repo does not have the crlf issue (we don't even have to configure anything) and the second check out comes from a trusted local location. I cannot tell whether other VCS' have a similar mechanism.

So you mean to hash the git internal structures instead of the actual source files? I wonder if compression/optimizations that git does on blocks can affect that (I will experiment). We could even reuse the bare repository (you can simply do a checkout in it after hashing and we'd be done).

it seems like everything is leading to the Update Framework.

So it seems, and it seems a hard wheel to reinvent. I still have to read through that, but IIUC with TUF we would ensure decentralized index integrity, which in turn guarantees that hashes for the local check are also trustable, right?

Is ALIRE meant to be a tool I use during development (then, the above would be an issue)

When I started working on it, this was certainly one of the use cases: that's the reason that when you do an alr get <crate>, that crate is "privileged" over its dependencies and treated as the "working folder". You enter it, and are in a clone of upstream. From another perspective, doing an alr init and an alr get both leave you with a folder in which you can immediately work (all of this has ties with the "sandbox" idea that was thrown around -- which basically removes the "root" crate from the picture).

Perhaps the alr get use case is more contrived, since developers will have their own workflows, and in the end it turns out it's not useful. Neither opam nor cargo have a similar command (which is puzzling, since I got the get idea from somewhere -- but can't locate it).

To conclude something from all this: I will experiment with bare cloning, overriding crlf settings and the other ideas that have arisen, and will post back my findings. I guess the TUF stuff (haha) is a more long-term issue.

I'm back with some results. It doesn't matter if done in bare or normal repo, doing this:

git -c core.autocrlf=false archive <commit> | sha256sum

produces the same hash in both linux and windows (-c overrides any global or repository configuration). So this seems the simplest approach at the moment for our current requisites.

git -c core.autocrlf=false archive <commit> | sha256sum

produces the same hash in both linux and windows (-c overrides any global or repository configuration). So this seems the simplest approach at the moment for our current requisites.

Awesome! It's even simpler than I thought.

Cool! Good to know the -c trick.

Closing since we decided in #212 to rely on source archives only, which are integrity-checked since #195

alire-project / alire

Integrity field for downloaded repositories #198