V3 idea: Let the CAS be encrypted + the AC be encrypted/signed

EdSchouten commented 4 years ago

Right now the CAS and the AC are not encrypted and/or signed. This means that systems that store the data have full access to the entire data set. This is bad for confidentiality. Even though the CAS is immutable, the AC can easily be tampered with.

At first glance, it would be trivial to encrypt the CAS: simply apply some symmetrical encryption on top of it and only let clients and workers have access to the key. Unfortunately, this wouldn't allow storage infrastructure to implement GetActionResult() anymore, as ActionResult messages reference Tree objects. GetActionResult() is supposed to touch everything referenced by the Tree. The Tree would need to be decomposed into an encrypted and a non-encrypted portion.

Encrypting the CAS also doesn't allow us to build schedulers that don't have the encryption key, as those need to parse the Action and the Command to be able to extract do_not_cache and platform properties to route the request properly.

EricBurnett commented 4 years ago

Interesting idea. Can you start by being clear about the threat model you want to defend against? To evaluate a set of changes and say whether they'll be sufficient to remove an attack vector, we'll need to be more specific about who should/shouldn't be trusted to do what.

From what you're describing, client-side encryption would probably suffice for the CAS - encrypt blobs and store those in the CAS, and decrypt post-retrieval. Then the server in the middle has no access to the data, and can only see the blobs you upload unencrypted for server use (e.g. Command messages, tree nodes, etc).

For ActionResult / action cache, I could see attaching some sort of signature from wherever the action was actually executed. I don't exactly know how to implement that, since in most Remote Execution contexts the worker doesn't see any REAPI messages to be able to sign them, but seems solvable. Some sort of scheme to sign command+inputs+outputs with context and signer? I think something of that sort could be added to V2 probably, since it's mostly optional?

EdSchouten commented 4 years ago

Interesting idea. Can you start by being clear about the threat model you want to defend against? To evaluate a set of changes and say whether they'll be sufficient to remove an attack vector, we'll need to be more specific about who should/shouldn't be trusted to do what.

There are a couple different cases:

Assume that you operate a build cluster that makes use of third-party storage infrastructure (e.g., a cloud provider, but it could also be a different department within your organisation). You may not necessarily want to disclose the contents of the CAS and AC to the team maintaining your storage infrastructure.
Assume that you set up multiple build clusters, each for a separate tenant. From a cost perspective, it may be attractive to move all data into a single CAS. Workers you can easily scale up/down, but storage is harder. By using a single CAS, per-tenant peaks may be flattened out. Still, you don't want everybody to see each others'. You could virtualise storage to have per-tenant namespaces, but another solution would be to just encrypt the blobs.
Assume you have a remote build cluster somewhere, but accessing it is slow for people in a given location. You decide to speed up the builds for people in that location by installing an office cache. Theft of such an office cache could be a disaster. Furthermore, the office cache may currently mutate ActionResults.

From what you're describing, client-side encryption would probably suffice for the CAS - encrypt blobs and store those in the CAS, and decrypt post-retrieval. Then the server in the middle has no access to the data, and can only see the blobs you upload unencrypted for server use (e.g. Command messages, tree nodes, etc).

The problem then becomes how you identify these objects then.

Do you still use the original digests as identifiers for them? If so, you are somewhat disclosing the contents of the unencrypted files. Just take a whole bunch of commonly used Open Source projects, SHA-256 all of their source files and you've got a huge catalogue that would allow you to (dis)prove whether a build action depends on certain codebases.
Do you compute a new digest based on the encrypted data? If so, how will you be able to find these objects again?

For ActionResult / action cache, I could see attaching some sort of signature from wherever the action was actually executed. I don't exactly know how to implement that, since in most Remote Execution contexts the worker doesn't see any REAPI messages to be able to sign them, but seems solvable. Some sort of scheme to sign command+inputs+outputs with context and signer? I think something of that sort could be added to V2 probably, since it's mostly optional?

A worker could sign the ActionID+ActionResult, allowing a client to validate that the action was indeed executed on a kind of worker that was intended to run these actions.

ulfjack commented 4 years ago

You can encrypt the digests with a symmetric algorithm and use that as identifiers to prevent a dictionary attack. It seems preferable to layer encryption on top of storage than to embed it into the CAS protocol. Use a CAS proxy that has the encryption keys and encrypts in and out of the underlying infrastructure.

EricBurnett commented 4 years ago

Do you compute a new digest based on the encrypted data? If so, how will you be able to find these objects again?

Assuming a stable encryption scheme (same input bytes + same key = same output bytes), it'd work pretty much as now: to download a blob, you get a reference to the bytes by hash, download them, only now you also decrypt post-download. To upload, you encrypt, hash the encrypted bytes, then upload those. To check existence of a file you know you encrypt, hash the encrypted file, and check existence based on that.

I'm no security expert, but I'll note offhand this misses at least some desirable properties: no session keys and so no forward security, same key across all clients and so high-risk secret, needs to be fast encryption because it's on the critical path, etc. You could maybe do better than that if you had a secondary system for managing keys that the client talks to ("I want the key to use for unencrypted blob X" -> per-blob key Y, plus arbitrary identifier Z to embed in the encrypted blob; then on the decryption side "this encrypted blob embeds arbitrary id Z; please give me the key to use to decrypt it" -> Y), but that also sounds like it's have severe performance issues at scale, and requires yet another system, so maybe that's a bad idea.

If you're willing to trust your Cloud provider to see unencrypted bytes in-flight but want them encrypted at rest, you could also ask they implement something equivalent to Customer-supplied or customer-managed encryption keys (Google; I'm sure other Clouds have equivalents). But that doesn't sound like it addresses the attack model you're interested in here.

A worker could sign the ActionID+ActionResult

That requires making ActionID and ActionResult known to the worker. For RBE at least we use the RWAPI and don't use REAPI messages with the worker, so this is net new coupling. I'm also not sure it's quite correct - the ActionID implies things like Platform which are scheduling details that the worker is not necessarily able to validate. All the worker can confidently assert is who it is, what properties it has, and what it ran; Platform is more about whether this worker was supposed to run the action, which should be separated out and signed by the scheduler instead I think?

sluongng commented 11 months ago

Reading this from the V3 doc, I think we should split the problem into 2 halves:

Support for RBE encryption.
Action Cache provenance.

For (1), my current thinking is biasing toward a new RPC similar to rpc PrepareWorkspace(ppwRequest) ppwResponse. The idea is that the client should be able to notify the scheduler regarding upcoming action executions, and the requirements for the workers that will pick up these executions, among which are security encryption/decryption keys and unhermetic secrets inputs. The server will prepare these workers and return an affinity ID to the client to use in ExecuteAction RPC. This is a formalized version of what people already doing today: encoding a big blob of JSON into Platforms' exec_properties. Except that with this dedicated API, these "side effects" could be opt-out from the client's computation of action keys for AC.

For (2), I am specifically interested in integration with SigStore as it is the current emerging solution in this space with many container and npm registries already started to adopt the solution. At least, the underlying storage and verification model based on Google's Trillian project is very appealing as it comes with existing tooling for independent audits and monitoring of AC entries. Additional reading here: https://docs.sigstore.dev/about/threat-model/ for those who are curious.

Splitting the 2 problems into separate issues will allow us to ship these independently.

bazelbuild / remote-apis

V3 idea: Let the CAS be encrypted + the AC be encrypted/signed #133