deislabs / bindle

Bindle: Object Storage for Collections

Apache License 2.0

262 stars 37 forks source link

Improving the Cryptographic Integerity of a Canonical Bindle Invoice #292

Open fibonacci1729 opened 2 years ago

fibonacci1729 commented 2 years ago

The Problem

TL;DR: The canonical invoice omits semantic metadata relevant to the integrity of a bindle thus losing the guarantee of preserving the signed intent.

Currently, from the bindle spec, the merkle root of an invoice is computed by hashing the concatenation of the following pieces of data together in a line-separated (\n) UTF-8 string:by, name, version, role, at and the label.sha256 of each parcel.

E.g:

Gnome Grumpsky <gnome.grumpsky@example.com>
mybindle
0.1.0
creator
1611960337
~
e1706ab0a39ac88094b6d54a3f5cdba41fe5a901
098fa798779ac88094b6d54a3f5cdba41fe5a901
5b992e90b71d5fadab3cd3777230ef370df75f5b

What is not included in this construction is the integrity of the metadata that is semantically relevant (i.e. groups, features, conditions, annotations, and parcel metadata).

A possible attack vector arises for injecting/deselecting parcels from a bindle whereby an attacker that compromises a Bindle can change a bindle's invoice to augment how parcels are selected without corrupting the bindle's merkle root.

To preserve the cryptographic integrity of the invoice, all relevant semantic information should be included in the canonical invoice before computing the merkle root digest.

Proposed Solution

TL;DR When creating the canonical invoice of a bindle, include all semantically relevant metadata when computing the merkle root digest.

As currently defined by the spec, the canonical invoice of a bindle is structured as the concatenation of:

<by>
<name>
<version>
<role>
<timestamp>
~
<parcel digest 1>
...
<parcel digest N>

Omitting top-level metadata groups and annotations, as well as the following parcel metadata: annotations, conditions, features, media-type, name, and origin.

To guarantee integrity of an invoice, we should include these omitted pieces of metadata.

I.e,

<by>
<name>
<version>
<role>
<timestamp>
*<annotations>
*<groups>
~
e1706ab0a39ac88094b6d54a3f5cdba41fe5a901
*<annotations> 
*<conditions> 
*<media-type> 
*<name>
*<origin>
...
~
098fa798779ac88094b6d54a3f5cdba41fe5a901
*<annotations>
*<conditions>
*<media-type>
*<name>
*<origin>

Where lines marked with * are the proposed additions.

Reproducibility & Ordering

Reconstructing the canonical invoice from a bindle's invoice.toml should be reproducible. To achieve this, the ordering of properties in each "block" (delimited by \n~\n) should be included in lexicographic order. As well, the ordering of blocks themselves is according to the lexicographic ordering of each parcel's content-digest, e.g. "DEADBEEF...".

Example

WIP

technosophos commented 2 years ago

This makes total sense to me. Do you want to add an example? Also, I think we might want to define how to determine the order of parcel fields. You show them in alpha order... that seems right to me.

itowlson commented 2 years ago

@technosophos As an implementer, I would find an example useful!

@fibonacci1729 When you talk about "the merkle root," am I right in understanding you are referring to the input of the invoice signing specification (https://github.com/deislabs/bindle/blob/main/docs/signing-spec.md#signing-on-the-invoice)? So this is describing an improvement to the signing algorithm, but not to any other part of the invoice structure or wire protocol? (Forgive me if this is obvious!)

fibonacci1729 commented 2 years ago

@itowlson That is correct! The invoice structure is unchanged. I am just proposing we include certain omitted invoice metadata into the signing operation (specifically the construction of the logical canonical invoice).

fibonacci1729 commented 2 years ago

@technosophos Do you have any strong objections to dropping the signing information from the canonical invoice, namely by, role, and timestamp? Perhaps i missed this in the spec, but is it the case that a canonical invoice is defined such that it has already been signed prior to construction and thus required semantically relevant info?

technosophos commented 2 years ago

I'll summarize here, and if we need to go into more details I can do so. But the idea with including that info was that I want to assert not just something about the content, but about the relationship between the content and the entity that signed it.

As pretext: When I verify a bindle, I can choose a variety of methods to apply. I can choose, for example, that I will only accept packages that were created by an entity that I trust. Or I will only trust packages that were hosted by an entity that I trust. Or I could make a really stringent claim like "I will only accept bindles where i can verify every single signature on the bindle". But it is up to the end user to decide what their verification rules are.

The idea of a role, in Bindle, is to allow an entity to assert (with constraints) that they are acting in a particular role. A creator is the one who first created the bindle. So they sign with that key. When a bindle is uploaded to a host, the host signs as a host role (and also verifies that there is exactly one creator whose public key is known to the host). [Note that the creator key verification is currently disabled, but will get enabled once we have all of this ironed out]

That brings us to the question you asked. When I sign, I am making the assertion that "I, Matt Butcher, am signing as a Proxy at , and this is what the package looks like when I sign." During verification, you should be able to make the assertion that "When Matt Butcher signed at X with role Y, the package looked the same as it looks right now to me"

It is probably the case that timestamp could be dropped. I'm not sure that buys us too much when it's in the signature block. The point there was to make it possible to construct an audit trail where one said "Matt's key was compromised at X, so don't trust any packages that Matt signed after X." But admittedly that is a leaky case anyway, and probably shouldn't be considered a good idea.

npmccallum commented 2 years ago

I understand the need for a canonical invoice. What I don't understand is Bindle wants to have non-canonical invoices.

Clients will need to grok the canonical invoice in order to validate the crypto. And they should never trust the semantic data without validating the crypto. This means that all clients need to understand the canonical invoice plus some other format. And further, because Bindle doesn't distribute the canonical invoice, the clients first need to parse another format before validating the cryptographic authenticity of the format. This means clients who have parsers with security issues are parsing untrusted input. This is the cause of nearly all the most serious security issues.

A much stronger cryptographic model is for Bindle to only, ever distribute the canonical invoice. Clients can validate its cryptographic integrity before parsing, reducing the exposure to security sensitive parser bugs.

fibonacci1729 commented 2 years ago

Hey @npmccallum! Thank you for the input!

This proposal is simply trying to formalize the ordering and set of relevant metadata that imply the semantic identity of a bindle, i.e. the canonical invoice.

You can't achieve semantic identity of a structured input that can't be normalized (e.g.. json, toml, etc.). Because, for example, consider the following json documents:

D1 {"x":42} and the it's white-space delimited version D2 { "x" : 42 }.

It's obvious these are semantically equivalent however structurally different byte-for-byte, i.e. H(D1) != H(D2)

Therefore, we can't rely on the structural identity of an invoice.toml to imply integrity.

To achieve semantic identity we need to define an ordering of elements of which comprises the semantically relevant metadata. The semantically relevant metadata of D1 and D2 is "x" and "2". Naturally, H("x", 2) == ID(D1) == ID(D2). Now we can establish an equivalence relation over the set of json documents of form{"x": <int>}.

Could you open a separate issue about the client working with the canonical invoice rather than the invoice.toml. This deserves it own discussion as currently the canonical invoice isn't really a file in the "on-disk" sort of way. As I understand the spec, it serves to illustrate the inputs into an identity function (a la secure hashing function).

npmccallum commented 2 years ago

@fibonacci1729 What use case requires semantic identity?

technosophos commented 2 years ago

I think we discussed the semantic identity thing at length on a call a while back, but here's the summary:

We want the server to be able to negotiate content type (JSON, TOML, YAML, etc). We have existing use cases for TOML and JSON
Invoice has a combination of immutable data (see above) and mutable data (yanked, yank signatures, signatures)
One floated use case has been to store the actual invoice data inside of an RDBMS not as a static document, but as tables.
We have seen the "hash the bytes of a file" thing fail repeatedly with OCI and other systems, and are trying to figure out a way around this. Semantic identity preserves what we actually care about (semantic immutability) without forcing us to care about irrelevant information (whitespace, hash ordering, etc).

To flip the question the other way... how do you envision yanking, signing, storage, etc. working if we hash the bytes? I don't see a way to do those things without introducing a new set of content types. (Yanking, signing, and verifying are not optional for our use cases, so we can't simply punt on them.)

npmccallum commented 2 years ago

I think we discussed the semantic identity thing at length on a call a while back, but here's the summary:

We want the server to be able to negotiate content type (JSON, TOML, YAML, etc). We have existing use cases for TOML and JSON

If all the clients must understand the canonical invoice content type in order to validate the negotiated content type then the negotiated content type doesn't have any value and adds additional complexity.

Do the existing use cases need to learn about the canonical encoding in order to validate the negotiated content type? If so, then they don't need TOML/JSON.

Invoice has a combination of immutable data (see above) and mutable data (yanked, yank signatures, signatures)

Combining mutable and immutable data in the same file seems to me like a choice fraught with security issues. It makes it very difficult for clients to know which data is reliable and which isn't. Security issues will arise from this as clients misunderstand which data is which. IMHO, these should be cleanly delineated to avoid confusion and compromise.

One floated use case has been to store the actual invoice data inside of an RDBMS not as a static document, but as tables.

This is an implementation detail. The RDBMS data can always be generated from the on disk invoice. It is an optimization cache, not the canonical record.

We have seen the "hash the bytes of a file" thing fail repeatedly with OCI and other systems, and are trying to figure out a way around this. Semantic identity preserves what we actually care about (semantic immutability) without forcing us to care about irrelevant information (whitespace, hash ordering, etc).

Today, an invoice is a mixture of semantic and non-semantic data (SHA256 hashes are non-semantic). The invoice also contains semantic data that confounds semantic equivalence. For example, two bindles that are semantically identical except for their name/version appear from Bindle as two distinct entities. This strikes me as actually contrary to the goal of determining semantic equivalence.

Researchers have been studying semantic equivalence and isomorphism to cryptography for decades now. This remains an unsolved problem. Deterministic encodings are the best solution we have. But trying to force semantic equivalence on a mixed semantic and non-semantic data doesn't actually solve this problem.

To flip the question the other way... how do you envision yanking, signing, storage, etc. working if we hash the bytes? I don't see a way to do those things without introducing a new set of content types. (Yanking, signing, and verifying are not optional for our use cases, so we can't simply punt on them.)

Signing is just: sign(name:version = invoice_hash). This creates an association between the name and version (i.e. the tag) of a bindle and the contents of an unnamed bindle. Use any of the existing signature types, including JWS. I'm not picky about the format. Doing it this way also means that the invoice is extensible over time.

Names and versions don't belong in the invoice especially if you're trying to do semantic equivalence. Two bindles with differing names, versions and signatures that are otherwise identical are semantically equivalent. And legal entities can, and have, forced renaming. If I've been deploying a bindle for years and a court orders me to change its name, under Bindle's current design I can no longer test for historical semantic equivalence. On the other hand, the signing as I've proposed it retains this property.

We should distinguish between yanking and deleting. Both need to be supported.

Yanking is the disassociation between a tag and the invoice hash. However, the invoice, and its parcels, can still be fetched by hash. This implies that systems attempting to deploy a bindle by name won't get the broken version. But the invoice/parcels that were previously associated with that tag are still available for forensic reconstruction. Yanking is typically done by the owner of the content in Bindle (i.e. the user). If signatures are required, then you basically have the user submit a sign(name:version = NULL). This revokes the previous association.
Deletion is the complete removal of all objects associated with a bindle. Examples of this include DCMA takedown letters, court orders and distribution of other legally complicated data (think: child pornography). Deletion is typically performed by the administrator of the Bindle server and is not available via the public API.

squillace commented 2 years ago

Just catching up here. @npmccallum some elaborations.

For example, two bindles that are semantically identical except for their name/version appear from Bindle as two distinct entities. This strikes me as actually contrary to the goal of determining semantic equivalence.

This I need to hear more about. I like your explanation of how you see it when you say:

Signing is just: sign(name:version = invoice_hash)

but when you get to

Names and versions don't belong in the invoice especially if you're trying to do semantic equivalence.

For example, I do not foresee forced renaming; a court may decide to argue that an invoice be deleted and recreated for naming reasons, but for renaming a thing that can only be recreated differently seems fine by me.

When you want semantic equivalence, you are arguing that naming should be a contrivance; I disagree, at least right now. It is true that otherwise the bindle would be "semantically equivalent" but I'm trying to understand what feature that gets you when you determine that by ignoring the names/versions_. I'm assuming that the feature there is "I can do the semantic equivalence without parsing a subtree of some sort?"

If so, I would think that we need to rethink the relationship between the bindle identity and computational semantic equivalence. If I understand correctly, the current proposal seeks to make the bindle the object of semantic equivalence, whereas you're arguing we should remove the name and version from that calculation.

Is that ☝️ understanding correct, to your way of thinking?

squillace commented 2 years ago

as to deletion, deletion is done by anyone who owns an instance of the server, and therefore must be a public API. Deleting is a thing anyone may choose to do with their own instance. Or may wish to offer. We might profitably assert that a delete call that could trigger a cascading effect might error out, but have an override (say, if someone didn't care whether another user broke).

Most prominent hosted versions might not expose this, I entirely agree. But a producer must not be prevented by the oss version from deleting their material. If someone else wanted it, they can use it another way. I was not aware that users of other people's code must be protected against the other producer deciding enough is enough and deleting (not merely yanking). With great power of usage comes great responsibility, IMHO. LeftPad 4 evah.

Is that an objective? I find DCMA a great example of external force being applied in a "lawful" way. But a single developer deciding that the artifact they built and control must be deleted now for no reason at all I find a completely reasonable scenario. Or perhaps I'm misunderstanding the deletion issue, which is also possible.

I would imagine, however, that a prominent service might have different terms of service, and that's where I think that division of responsibility lies, or should lie.

squillace commented 2 years ago

I'm not sure in the end whether these arguments are about this PR, or about how this project should work more generally. As name and version are already there, this PR doesn't modify that objection in any way that I can see. And as there were mutable and immutable values in the invoice over which bindle worked anyway, I'm not clear how that is modified by this PR.

If these thoughts are correct, I'd support implementing this approach and then opening issues on where mutable values should go if they should not live in the invoice along with immutable values as well as whether name/version should be handled a different way.