Suggestion: SRI-hashes for content integrity APIs like `download_file`

thoughtpolice commented 11 months ago

Consider the following parameter to the download_file call:

ctx.actions.download_file(sha1 = "deadbeef")

The sha1 attribute has been superseded by attributes like sha256, but remains for backwards compatibility. In the future, there could be other hashes like blake3. But this is awkward because every single rule that wants to mention a content hash has to have every possible hash option exposed in its interface, which is a lot of plumbing for nothing.

Many tools suffer from this problem; for instance Nix had a similar design for a long time. However, recently, Subresource Integrity ("SRI") in browsers, and now more recently Nix, instead use a "self describing" hash scheme that attaches the name of the hash algorithm directly to a base64-encoded copy of the hash. So, the above example would look more like this, in a hand-wavy way:

ctx.actions.download_file(hash = "sha1-" + base64enc([0xde,0xad,0xbe,0xef]))

That is, every function that checks content integrity simply takes a generic hash parameter, and the name is instead embedded into the literal directly, followed by the bytes of the hash as a base64-encoded string. This means that supporting new hash algorithms is centralized completely inside the implementation of download_file and does not need to be reflected inside Starlark at all. The algorithm which consists of "split on - and decode base64 into bytes" is simple enough to not meaningfully increase the chance of errors in the implementation.

I suggest that this is implemented for all buck2-core functions that (currently) take sha1 or sha256 parameters. I'd like this to be used in my own Prelude, for instance.

Note that this only needs to be implemented in the core buck2 executable in a few key spots; all existing APIs that are exposed from buck2-prelude can be ported to use this core API without breaking any client code (assuming a base64 encoding builtin.)

ndmitchell commented 10 months ago

This API means you can't provide both a sha1 and a blake3 hash, which can be useful to allow the download file to hit the right fast path regardless of whether your Buck2 is working with sha1 or blake3.

Is there anything stopping you from writing a wrapper that does this transformation in the Starlark layer?

How come the move to base64 rather than hex? Generally hashes are always written in hex from what I've seen.

thoughtpolice commented 10 months ago

Hrm, that's an interesting sounding feature, but I'm not sure I understand it. Is the idea you just provide both hashes and try to race both of them against each other to get the first winner, once the file is downloaded? I would assume that with CPU/thread parallelization, BLAKE3 would just be the best choice in ~all cases where the files aren't extremely, ridiculously tiny, right? Like would SHA1 ever beat BLAKE3 on any non-tiny input? Even then SHA1 might still lose. I'm just pointing this out from a pragmatic POV; maybe we should ignore the performance and say that BLAKE3 isn't a good hypothetical example to use. :]

But even then, you could of course just have multiple hashes for that as a list, right? e.g. hashes = [ "blake3-...", "sha1-..." ] is the interface?

Also, from a forwards/backwards compatibility standpoint, I think the usage of the SRI version is actually better for users? Consider these two API calls, compared:

ctx.actions.download_file(sha1 = "...", blake4 = "...")
# versus
ctx.actions.download_file(hashes = [ "sha1-...", "blake4-..." ])

Let's say there's a version of buck2 that adds support for a hypothetical blake4. The first example will not work unless your version of buck2 supports the blake4 parameter to download_file, it will immediately error during analysis. While in the second example, the implementation of download_file itself can decide which hash to use, based on what it supports, making it more forwards and backwards compatible during a migration for your users. Don't support Blake4? Use SHA1. Once all your users are migrated, you can delete (or leave) the old hash.

Is there anything stopping you from writing a wrapper that does this transformation in the Starlark layer?

No, I don't think there's anything preventing it (well, except base64 functions in Starlark, I think?), but it would just be nice because other people have to write rules too, so they'll get exposed to stuff like download_file anyway that use the existing API, no matter if I have wrappers or not.

How come the move to base64 rather than hex? Generally hashes are always written in hex from what I've seen.

Size optimization, I guess? The W3C spec doesn't actually justify the base64 decision, but that's my best rationalization — a 50% increase in information-per character; base64's 6-bits versus hexadecimal's 4-bits. I'd also probably just use hex if deciding myself but, if we wanted to implement this, I'd suggest just sticking to the spec and using base64 as told.

thoughtpolice commented 10 months ago

Just a nit I realized now and should have said earlier: if we did do this, we'd probably want to add a buck2 download command that could download some file from a given URL and then give you the SRI hashes for it, e.g.

$ buck2 download --hashes=sha1,blake3 https://githubusercontent.com/...
Downloading $URL...

Blake3 SRI hash: blake3-...
SHA-1 SRI hash: sha1-...

This is because if a user wants to add a hash for a file, typical tools like sha1sum or blake3sum won't work without extra piping shennanigans, or running an action and having it fail, and copying the good hash out of the error message.

ndmitchell commented 10 months ago

I think we have a feature where if the RE system is set to SHA1, we can check if the download is already in the RE server, and skip the download. That only works if you have the RE system being the same hash.

I like following standards. I don't love that basically everyone generates hex values easily, and then these are base64. It means that it is probably worse for many users in a practical sense.

thoughtpolice commented 10 months ago

I'd be interested in what people at Meta do in this example I'm about to give, but in practice my workflow looks like this, basically 100% of the time:

I write an instance of a rule like http_download
I give a parameter sha256 = ""
I run buck2 build
It fails loudly, and copy the expected hash of the file from the error message
Run buck2 build, now it works

This is identical to what I do when using/writing Nix code. In other words, the format of the hash doesn't matter in practice, actually; I simply use the hash the tool returns to me, "Trust-on-first-use" style. It is pretty much never more efficient for me to manually download + run a separate hashing tool[^1]. So, if it was base62, base64, base38, it would all be the same at the end of the day. (I have another bug to file about the error messages too, but that's for another time.)

I guess they could have e.g. attached a SHA256SUMS file, in which case you could copy/paste the hashes from there? Which would be convenient? But you can also convert between formats, in that case, if it's really so onerous. I can't say I've ever copied hashes from a source that way; downloading and failing the hash check is the fastest way, 99% of the time.

The buck2 download suggestion is so you have a simple utility for e.g. writing scripts. And there are other tools that can produce SRI-compliant hashes. But the TOFU-style "let it fail, then copy/paste it" approach is what people would actually do when writing BUCK files day to day.

In any case, this one is probably a bit unquantifiable without some actual user testing, I'd guess...

[^1]: Unless the file is ridiculously, ridiculously large, in which case a "pre-flight" check might be useful, but in Buck's case, the configuration hash is part of the local file path for a download, so unless you knew what the filepath was in advance, you'd have to download it all over again anyway.

ndmitchell commented 10 months ago

We have plenty of things that automatically generate hashes, e.g. the Reindeer tool for Rust dependencies. It's those which would be inconvenienced by base64 (although it's not the end of the world, they could be adapted). I'll ask around internally - SRI does seem like a reasonable direction to go though.

thoughtpolice commented 10 months ago

Ah, yeah, I forgot about Reindeer and other tools. FWIW, Nix still allows the old-style hashes it used, along with SRI hashes, and doesn't throw warnings or anything like that because — yeah, in practice you're basically chained to backwards compatibility here forever, and it's not the end of the world to support both APIs. So a third option is just "live with it forever" I guess, which works for us.

cormacrelf commented 2 months ago

I think SRI is pretty annoying, entirely because the special characters in base64 make it hard to double-click select them, which is about the only thing you ever have to do with a hash. The spec does seem to allow for extensions though. So better than nothing.

My more important comment is this:

I am testing a little Buck version of Aspect's rules_js, and buckifying anything with NPM is horrifically slow because of having to query NPM's API for the shasum (i.e. SHA-1), as the sha512-... SRI hashes stored in all Node lockfile formats can't be used. In comparison to the 30 to 120 second install times for the kibana benchmark on https://blog.aspect.build/rulesjs-npm-benchmarks, I'm already up to 15 minutes and it's only queried 15% the 4400 hashes. These are cached with an sqlite database, of course. But still. 100 minutes projected! And I will probably exhaust the rate limiter at some point, or maybe the serial nature of the python script here is going to save me from that.

So whatever format you want to encode them in, I think Buck needs to support more hash algorithms.

Couple of notes about using SRI, regardless:

- [Bazel's `http_file` rule](https://bazel.build/rules/lib/repo/http) calls the attribute `integrity = "sha512-..."`, which matches e.g. HTML ` Githubissues.

Githubissues is a development platform for aggregating issues.

facebook / buck2

Suggestion: SRI-hashes for content integrity APIs like `download_file` #439