Integrity hash field in archive sources

senier commented 6 years ago

I had an offline discussion with @pmderodat about ALIRE and I'm delighted that a source package manager for Ada projects is being worked on! I'm happy to join security-related discussions of the project.

I agree that going for SHA-512 is a good idea. Depending on who you ask shorter hash lengths may become an issue sooner or later. Supporting multiple hash algorithms (as proposed in #33) is probably a very good idea to enable a transition to newer algorithms in the future. Btw, have you considered using SHA-3 (here is a SPARK 2014 implementation)? It's probably not necessary right now, but when building a new system this may be a consideration...

I'm not sure whether support for multiple source files is anticipated in ALIRE (e.g. a tarball and a number of patches or a project consisting of multiple tarballs). If so, of cause all source files need their own hash value.

While the SHAttered attack is relevant to git, mercurial and SVN, all of those projects have implemented mitigations, transitioned or plan a transition to a modern cryptographic hash function. IMHO a package manager is not the right place to fix security issues that upstream repositories. It would need to calculate a hash over a directory tree (e.g. using a Merkle tree) - but what would be gain? If the upstream repository got compromised we may just calculate our hash on a compromised tree...

onox commented 5 years ago

IMHO a package manager is not the right place to fix security issues that upstream repositories.

Usually the idea of providing a checksum is to protect against compromised mirrors from which packages are downloaded.

If the upstream repository got compromised we may just calculate our hash on a compromised tree...

Git commits can be signed with GPG (signatures can be shown with git log --show-signature). So Alire could verify a repository if it has a list of valid public keys.

senier commented 5 years ago

Usually the idea of providing a checksum is to protect against compromised mirrors from which packages are downloaded.

I should clarify what data i was referring to, maybe we're talking about different things. For a sorce package management I see three types of data that need to be considered:

Package meta data: defines where software can be downloaded and how its built
Upstream sources: the actual upstream source code
Binary packages: the outcome of a build installed on end-user systems

If I understand correctly, Alire's current focus in on type 1., i.e. an index of software sources that can be downloaded, built and installed easily. I would not expect these files to be distributed among many mirrors, but rather assume some trusted central repository to get those build recipes. For this scenario, pulling the index using a TLS-secured connection would probably be sufficient (assuming the site has a valid certificate and you trust the operator).

Once you have the authentic software index, the question is how to verify the authenticity of the upstream sources. As sources come from many parties, the above model is not feasible. However, as we trust the operator of the software index, he/she can provide the binding between the external sources and the index. The simplest mechanism for this are cryptographic hashes that come with the index.

I'm not sure whether binary package, i.e. type 3., are anticipated for Alire. For those, digital signatures are probably the most practical solution. Maybe @mosteo can comment on this.

Git commits can be signed with GPG (signatures can be shown with git log --show-signature). So Alire could verify a repository if it has a list of valid public keys.

Of cause that could be done and, as I mentioned above, for binary package signatures would be the mechanism of choice. For source repositories I see a number of problems. Not every upstream project uses git or is willing/capable/interested to sign their repositories. Sometimes the software just comes as a zip archive.

I can imagine another issue: If you establish trust through the possession of a PGP private key, the holder of that key can alter the software at will and it will get accepted by clients. When the versions is fixed by the hash in the index, only the maintainers of the index define what is an accepted software version.

mosteo commented 5 years ago

Yup, @senier, your observations pretty much summarize the main points. I don't think that scenario 3 is at all in the radar; for that we have the portability of Ada.

mosteo commented 5 years ago

@Fabien-Chouteau, just to be sure we are on the same page in regard to the features we want for the first beta, and in the context of the above discussion. My current understanding is that:

We want the whitelist for places that provide indexes (hence we trust their operators -- basically github at this point). This ensures index integrity.
- In our discussion we talked about the whitelist for sources of crates, but see next point.
If the index source is trusted, the hashes it provides are trusted. Thus:
- Integrity hashes work for source archives (the archive-hash field).
- Commits work for git/hg/svn, since these check integrity during retrieval (the origin field).

The thing is, and sorry for re-raising the issue -- it seemed clear at the time but not anymore --, I'm not sure why we wanted to hash directory contents, when the VCS is going to check it for us. If someone is messing with the network to try a MITM, as long as the index comes from a trusted site, we would detect the corruption on retrieval.

Also, have you some reference on hand for what you told me about Arduino? I'm going in circles with high-level trusted partners which I don't think is what you alluded to.

Fabien-Chouteau commented 5 years ago

* We want the whitelist for places that provide indexes (hence we trust their operators -- basically github at this point).

If we do a hash of the repo content (git archive | sha512sum) I am not sure that we need the whitelist.

This ensures index integrity.

This was more for crates integrity than the integrity of the index itself.

The thing is, and sorry for re-raising the issue -- it seemed clear at the time but not anymore --, I'm not sure why we wanted to hash directory contents, when the VCS is going to check it for us. If someone is messing with the network to try a MITM, as long as the index comes from a trusted site, we would detect the corruption on retrieval.

I guess it is one or the other, whitelist or content hash.

Also, have you some reference on hand for what you told me about Arduino? I'm going in circles with high-level trusted partners which I don't think is what you alluded to.

Have a look here: https://github.com/arduino/Arduino/wiki/Library-Manager-FAQ#how-can-i-add-my-library-to-library-manager

They accept GitHub, BitBucket or GitLab.

mosteo commented 5 years ago

I guess it is one or the other, whitelist or content hash.

In that case I'd go for the content hash, which I see as more general.

Fabien-Chouteau commented 5 years ago

Me too.

Also because even with trusted hosting platform. Someone can hack into an account and push a different content to a branch.

alire-project / alire

Integrity hash field in archive sources #66