automerge / automerge-classic

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.
http://automerge.org/
MIT License
14.75k stars 466 forks source link

consider identifying changes via multihashes as opposed to SHA-256 #310

Closed Gozala closed 3 years ago

Gozala commented 3 years ago

multihashes tag hashes so that it is easy to upgrade or change hashing algorithms without having to break the protocol / data format. I think it might be a good idea for Automerge 1.0 to use them instead of baking sha-256 decision for all the reasons that provided link goes into.

There are implementations of it available in both JS and Rust.

ept commented 3 years ago

Hi @Gozala, this is an interesting idea, thanks for pointing it out. We definitely need the ability to change hash functions, because it could happen that one day SHA-256 is found to be broken.

Automerge already has a version tag in its data format, which allows us to make future changes to the format. However, a version number bump breaks forward compatibility: that is, older software cannot read data in a newer version. Thus, I understand the desire to be able to change the hash function without having to change the format, and thus retaining forward and backward compatibility.

However, I don't think the approach taken by multihashes is really the right way. In particular, I believe having a long list of supported hash functions is a bad idea, for reasons explained by Filippo Valsorda: the more options are supported, the more opportunities there are for things to go wrong. A cryptographic system should support only as many options as absolutely necessary, and keep those options well maintained. Too many choices can easily lead to downgrade attacks, where an attacker chooses an insecure combination of algorithms to break the security of a system.

The multihash approach tags every hash with an algorithm identifier and length. In theory, this allows changing the hash function without having to change the data format. However, in practice, if you want to change the hash function, you also have to change the software to support that hash function (simply treating the hash as an opaque string of bytes without computing the hash function is not sufficient for our purposes). Thus, supporting a new hash function unavoidably requires upgrading the software anyway. Thus, you might as well bump the data format version and declare the support for the new hash function in the updated version. Having each hash individually tagged with the type of hash function doesn't really help.

Tagging each hash with a hash function type would only make sense if we preemptively add support for multiple hash functions from the start. However, as long as there are no known weaknesses in SHA-256, supporting multiple hash functions just adds complexity for no benefit. If one day we need to replace the hash function, I think we should do it by bumping the data format version number (and that new format can, if necessary, include support for using multiple hash functions side-by-side for a transition period). For this reason, I don't think we should adopt multihashes.

Hope that makes sense — but please let me know if you think this reasoning is flawed.

alexjg commented 3 years ago

I agree with this. I have a few thoughts to add that I hope might be interesting. Multihashes are designed for allowing generic traversal of hash linked graphs of data on IPFS (via IPLD). Whilst I agree with Valsorda's "registries are a code smell" stance I think having a registry makes sense in this context because the goal is explicitly to allow many different kinds of hash identified data to link to each other, you've sorted of accepted the security risk as a basic premise of the entire system and hopefully that can be mitigated by sharing a small number of well audited multihash implementations between all applications. This is a very different design goal to the automerge data format.

Not using multihashes internally does slightly constrain us if we ever want to put automerge data on IPLD. We can imagine writing an IPLD multicodec which parses automerge files and allows traversal through to dependent changes stored on IPFS. In this scenario we would depend on all the dependent changes being uploaded to IPFS using an SHA256 CID. IMO this isn't really much of an issue though because a) Even if we used multihashes internally we would depend on the changes having been uploaded using the same hash as in the document, which seems very error prone b) Uploading each change individually is probably not going to be feasible for many applications, so this is sort of an edge case of an edge case.

Gozala commented 3 years ago

However, I don't think the approach taken by multihashes is really the right way. In particular, I believe having a long list of supported hash functions is a bad idea, for reasons explained by Filippo Valsorda: the more options are supported, the more opportunities there are for things to go wrong. A cryptographic system should support only as many options as absolutely necessary, and keep those options well maintained. Too many choices can easily lead to downgrade attacks, where an attacker chooses an insecure combination of algorithms to break the security of a system.

@ept I absolutely agree with this, however I think there is a bit of misunderstanding here. multihashes don not mean supporting all the hashing functions that are in that table. It just provides an abstraction in which you could supply set of implementation without having to change the system. E.g. In IPFS ecosystem you can have nodes that are deployed configured with several common hashing algorithms, but there are also customs setups where entirely different hashing algorithms are used.

In terms of automerge I see it as a way to enable each system decide what algorithm(s) to support without having to alter a library or a format extension. I imagine default table would just be SHA256.

ept commented 3 years ago

Yes, I understand that it's not necessary to adopt the default table of 461 hash functions. But my arguments still hold: even supporting two hash functions is, in my opinion, too many at this stage. When in the future we need to add support for a second hash function, we'll bump the format version number, since the software will need to be updated in any case.

Gozala commented 3 years ago

When in the future we need to add support for a second hash function, we'll bump the format version number, since the software will need to be updated in any case.

Argument I was making is that multihashes enable individual deployments to:

  1. Update hashing algorithms
  2. Swap hash algorithms (for reasons different from. sha256 becoming inadequate)
  3. Support multiple algorithms (to ease upgrade or support interop with other systems)

All of the above otherwise would require either consensus + format update or forking. Additionally future format change would either imply supporting older hashing algorithm for interop or upgrading all the prior data, which tends to be difficult in non centralized systems.

In my view value of multihashes is enabling users of automerge to make choices about when to upgrade and what algorithms to support, although admittedly it creates room for incompatibility as groups can different choices

ept commented 3 years ago

It comes down to a difference in philosophy, I guess. I prefer to keep the number of user-configurable options as small as possible, since every option makes the documentation longer and the system harder to use. (For example, many CRDT libraries give users a choice between add-wins and remove-wins semantics, but Automerge only provides add-wins, because I have not seen any compelling need for remove-wins.) This is especially true when it comes to cryptographic primitives, which require a bit of expertise to configure correctly. There have been lots of insecurely configured HTTPS servers because OpenSSL supports too many insecure options (e.g. weak export ciphers), and some administrators don't know how to disable them. This is something I want to avoid. For this reason I don't think the hash function should be configurable.