Blockstream / asset_registry

registry for liquid assets
5 stars 2 forks source link

Remove requirement for lexigraphic ordering of JSON file. #9

Open allenpiscitello opened 4 years ago

allenpiscitello commented 4 years ago

There's no reason to have a strict requirement of how someone constructs the JSON file as long as it has the valid fields. This leads to errors where people create assets that cannot be registered due to arbitrarily strict rules.

stevenroose commented 4 years ago

Well it IS very important that the fields be sorted lexicographically when calculating the contract hash.

If the asset registry now requires submitting contracts in correct ordering (I don't know if it does), it would be to avoid users calculating the wrong contract hash. If they would hash the JSON they inputted but it's not ordered correctly, they would have a different contract hash (and thus asset ID) as the registry.

The Elements RPC currently takes a contract hash as input. An alternative would be for the registry to have some /validateContract endpoint like I detailed at the bottom of the asset decimals document that returns the correct contract hash for the user to input.

allenpiscitello commented 4 years ago

The contract hash needs to be calculated on the full contract's hash, byte for byte. The only requirement is that the registry is able to store the full text that is hashed, and that it follows JSON standards.

Adding arbitrary undocumented requirements reduces usability.

shesek commented 4 years ago

If the asset registry now requires submitting contracts in correct ordering (I don't know if it does)

It does not, the lexicographical ordering requirement is only necessary for the purpose of calculating the hash. You can submit the JSON contract itself in whatever order you want. (maybe there are some instructions somewhere that say otherwise? I didn't write these so I'm not sure.)

The reason for this requirement is that JSON doesn't normally have any defined field ordering (from the JSON specs, RFC 7159: "An object is an unordered collection of zero or more name/value pairs").

In practice the ordering varies between different JSON parser/serializer implementation. Some of them order them lexicographically, some retain insertion order, and for some its just undefined and arbitrary. It's possible that de-serializing and re-serializing the same JSON object with the same implementation will spit back different ordering, which makes this very prone to errors when (de)serializing and moving JSON objects around.

Requiring an explicit well-defined order avoids these issues by having one canonical serialization format that everyone should use, regardless of their specific environment and parser. Without a canonical format, people trying to validate the contract commitment are very likely to get tripped by this.

Another consideration related to the registry implementation itself is that we use serde to represent JSON objects, which orders fields lexicographically and doesn't retain their original order. This would mean that the registry API would have to expose two contract fields on the asset JSON object, one with an actual JSON object that gets sorted by serde, and another one that contains the original JSON object as a nested serialized JSON string, i.e. {"asset_id":...,"contract":{"ticker":"FOO",...}, "committed_contract":"{\"ticker\":\"FOO\",...}"}, which doesn't seem very elegant.

An alternative would be for the registry to have some /validateContract endpoint like I detailed at the bottom of the asset decimals document that returns the correct contract hash for the user to input.

This repo actually has a CLI utility that can do that, documented here. liquid-asset-registry contract-json <json> would give you the canonicalized JSON string, adding --hash would give you its sha256 hash (in reverse). But this hasn't been open-sourced yet so its not really available for issuers.

Another option is to use a standalone library like json-stable-stringify, like so: node -p 'require("json-stable-stringify")({"ticker":"foo","name":"bar"})'

allenpiscitello commented 4 years ago

It does not, the lexicographical ordering requirement is only necessary for the purpose of calculating the hash. You can submit the JSON contract itself in whatever order you want. (maybe there are some instructions somewhere that say otherwise? I didn't write these so I'm not sure.)

When the hash does not match what the user created, then we have an issue.

Requiring an explicit well-defined order avoids these issues by having one canonical serialization format that everyone should use, regardless of their specific environment and parser.

There is no need to serialize, only deserialize. As long as the contract is saved in its original format, and the useful information is extracted.

This adds compelxity and burden to the user for the benefit of potentially simpler code (until a newer version of the library breaks all of this anyway by making some minor but valid JSON change).

Having a command line tool doesn't help use cases where people are generating themselves and are not calling from the command line tool.

This behavior means everyone must match EXACTLY what serde is doing for serialization (and all versions of serde need to match it in the future, something I highly doubt they guarantee since JSON is not intended to be canoicalized.

shesek commented 4 years ago

There is no need to serialize, only deserialize.

When only considering specifically what the server-side registry has to do during asset registration, yes. But I'm also considering third-parties writing code to register assets or to validate them, in which case they'll be dealing with JSON objects that move around, and are very likely to get tripped by a JSON implementation that uses different sorting.

Retaining the issuer's original JSON ordering would basically mean that everyone will always have to store and move the JSONs around as strings rather than as actual objects. The moment you try to deserialize the string into a JSON object, it'll lose the original ordering information with no way to recover it.

So for example if you're storing the contract JSON in a database, you would have to use string fields and not the specialized json field type, meaning that you can't utilize any special json capabilities. Or you'll have to store it twice and make sure the two versions are in-sync. And as mentioned in my previous comment, the same goes for the API layer, or any other layer that deals with these JSONs.

This adds compelxity and burden to the user for the benefit of potentially simpler code

The benefit is not simpler code, it is having canonical encoding to reduce the foot-gun potential of undefined field ordering leading to errors.

This behavior means everyone must match EXACTLY what serde is doing for serialization (and all versions of serde need to match it in the future, something I highly doubt they guarantee since JSON is not intended to be canoicalized.

No, it means that everyone has to use the typical lexicographical sorting (which is quite common with JSON for purposes like these and has implementations in many languages, like json-stable-stringify for javascript), and that if serde ever changes to do something different then we'll have to change the registry implementation to do lexicographical sorting using some other way.

allenpiscitello commented 4 years ago

This behavior has already resulted in footguns, with users getting unexpected behavior, or issuing assets that are perfectly valid JSON contracts that cannot be registered.

So for example if you're storing the contract JSON in a database, you would have to use string fields and not the specialized json field type, meaning that you can't utilize any special json capabilities. Or you'll have to store it twice and make sure the two versions are in-sync.

Since the JSON can have arbitrary extra fields created by the user, there's no need to store this as a database - we only need the 5 or so fields that we actually care about.

Right now this tool is unusable by most people who deviate from our simple example.

stevenroose commented 4 years ago

@shesek, serde actually doesn't do lexicographical ordering at all. For Value, it usesHashMapwhich scrambles ordering. But you can deserialize into aBTreeMap` (Riccardo pointed out to me) and then reserialize to canonicalize it.

But f.e. if you have a Rust struct, it might just use the struct's field order, so you'd have to serialize, deserialize as BTreeMap and reserialize to get the canonical JSON.

For the record, I'm quite indifferent on this. I think there's merit to either side.

Regardless of what we decide, I think a /validateContract endpoint is really valuable for other reasons as well. But remember that we want Liquid to be an ecosystem. Our registry might not stay the predominant registry. Other registries can have laxer/stricter rules. But this is also not a rule set in stone, we can always relax our rules later if we see fit.

shesek commented 4 years ago

Since the JSON can have arbitrary extra fields created by the user, there's no need to store this as a database - we only need the 5 or so fields that we actually care about.

Again, I was thinking about this more broadly, not just our own usage but also third parties dealing with this contract json in ways that are unknown to us.

Right now this tool is unusable by most people who deviate from our simple example.

Right now there are clear and explicit instructions that people can follow to make sure they use the same serialization as everyone else will be using. There are plenty of ways to lexicographically order JSON objects. IMHO the alternative introduces more complexities.

serde actually doesn't do lexicographical ordering at all

As far as I can tell, serde (when compiled without the preserve_order feature) uses a BTreeMap as its map implementation for the inner Value::Object type.

shesek commented 4 years ago

To summarize: the current implementation lets you store/process/transport the contract as an actual json object, with a convention that allows everyone to serialize it in the exact same way using information available from the object representation alone.

The alternative you're proposing would require to always store/process/transport the contract as a plain string and be aware that you're not treating it as an object in places where you shouldn't, causing it to unintentionally lose the ordering information. This would either result in having two duplicated fields (in storage, in structs definitions, in the API layer, etc) or in additional implementation complexity/inefficiency/inconvenience to go from strings to objects on-the-fly instead of having the object representation directly available for use.

alessandro-saglimbeni commented 4 years ago

Right now there are clear and explicit instructions that people can follow to make sure they use the same serialization as everyone else will be using.

I think here's where the problem lies. We don't have a clear documentation of what the registry does and checks, we only have an illustrated bash example. I think we need some registry technical documentation, if we want people to be able to issue and register assets for their own arbitrary applications

shesek commented 4 years ago

We don't have a clear documentation of what the registry does and checks

I gave a shot at documenting the current expectations here, including some snippets for canonicalizing the json and getting the contract hash.

shesek commented 4 years ago

Ah, I just realized I didn't document details about the domain ownership proof and asset deletion. I can complete that too if needed.

alessandro-saglimbeni commented 4 years ago

the current expectations here

@shesek there's no mention to the requirement of ticker uniqueness under each domain, or did I miss it?

details about the domain ownership proof and asset deletion. I can complete that too if needed.

I think it's worth having those too, if you can add them. Then we should publish these docs, either with a monkey publishing the docs on to docs.blockstream.com (could be me), or by opensourcing this repo.

shesek commented 4 years ago

@alessandro-saglimbeni Added to https://gist.github.com/shesek/94ae7774906107c2369d3c577a0d70c3

shesek commented 4 years ago

Another thing not documented there is how to independently verify that the asset issuance commits to the metadata and that the domain ownership proof exists. This can be done with the CLI utility, but its not open-sourced yet...

alessandro-saglimbeni commented 4 years ago

I'd prefer from our sample registered asset doc to reference directly to this repo and specifically to that document that Shesek prepared, rather than replicating info, so that issuers can also understand what's the registry at all.

@allenpiscitello @greenaddress is there any blocker to open-sourcing this repo, other than reviewing before release and perfecting a deployment process?