ethereum / consensus-spec-tests

Common tests for the Ethereum proof-of-stake consensus layer
MIT License
69 stars 22 forks source link

Write binary data in yaml canonical format instead of hex encoded strings #5

Open prestonvanloon opened 5 years ago

prestonvanloon commented 5 years ago

We're finding that unmarshalling yaml is very cumbersome with hex encoded strings as binary data.

Example: ssz: '0x00000000' cannot be inherently unmarshaled to a byte slice in go. go playground link.

Looking at the spec for yaml, this data should be represented as a base64 binary string with the prefix !!binary.

Edit: I no longer support adding !!binary as it is entirely unnecessary with modern libraries understanding that Base64 encoded strings are binary data. However, the tag resolution in the yaml spec for this scenario is unclear. It's probably safer to add !!binary explicitly.

paulhauner commented 5 years ago

We configured our parser to handle this quite easily, so it's not a problem for us.

I'm not opposed to switching if it's a common problem, but my preference is that we keep the YAML and JSON representations of binary data the same, as to reduce total system complexity.

EDIT: When I refer to JSON, I'm referring to the minimal validator API spec. Here signatures, roots and other binary data are represented as 0x00...

prestonvanloon commented 5 years ago

Still investigating this proposal. We have been using hex encoded strings in eth1 for some time now and have existing functionality for that with JSON encoding in go-ethereum.

Example block returned from json-rpc in eth1.

{
    "number": 3,
    "hash": "0xef95f2f1ed3ca60b048b4bf67cde2195961e0bba6f70bcbea9a2c4e133e34b46",
    "parentHash": "0x2302e1c0b972d00932deb5dab9eb2982f570597d9d42504c05d9c2147eaf9c88",
    "nonce": "0xfb6e1a62d119228b",
    "sha3Uncles": "0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347",
    "logsBloom": "0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
    "transactionsRoot": "0x3a1b03875115b79539e5bd33fb00d8f7b7cd61929d5a3c574f507b8acf415bee",
    "stateRoot": "0xf1133199d44695dfa8fd1bcfe424d82854b5cebef75bddd7e40ea94cda515bcb",
    "receiptsRoot: '0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421',
    "miner": "0x8888f1f195afa192cfee860698584c030f4c9db1",
    "difficulty": '21345678965432',
    "totalDifficulty": '324567845321',
    "size": 616,
    "extraData": "0x",
    "gasLimit": 3141592,
    "gasUsed": 21662,
    "timestamp": 1429287689,
    "transactions": [
        "0x9fc76417374aa880d4449a1f7f31ec597f00b1f6f3dd2d66f4c9c6c445836d8b"
    ],
    "uncles": []
}

Suggesting a shift from hex encoded strings to base64 encoding would be annoying to say the least.

Will follow up with a bit more before closing this out, but it's looking like it should be hex encoded.

prestonvanloon commented 5 years ago

It looks like go-ethereum uses type overrides (ex) which uses gencodec to create custom json/yaml parsers to map to go structs (ex). So, it looks like we can generate some extra code to support hex encoded strings to represent binary data and close out this issue.

The follow up, if its worth following up, is why are we using hex encoded strings to represent binary data in the first place instead of the seemingly more intuitive Base64 encoding?

arnetheduck commented 5 years ago

one reason to use one encoding over another is that it should ideally match whatever is used in canonical value printers so it's easy to compare logs with yaml files..

prestonvanloon commented 5 years ago

Good thinking. Clients are more likely to print hex encoded strings for binary data.

prestonvanloon commented 5 years ago

Reopening this. We've found it too infeasible to writing special decoding/encoding support for these byte arrays as hex encoded strings. Instead, we overwrite the hex strings to base64 in the yaml, then ingest that data to the tests.

With regards to "do what the JSON API does", I recant my earlier statement and I don't support hex encoded strings in JSON going forward. I see little benefit to hex encoded over base64 and it means we have to write a special work arounds to support it.

The latter comment is probably for a different discussion topic.

prestonvanloon commented 5 years ago

Another note in support of base64, we've found that these yaml files are saving about 30% space on disk as base64 rather than hex encoded.

protolambda commented 5 years ago

For size, we could also compress, encode suites in SSZ (or other), or split mainnet suites (huge) from minimal suites (reasonable) in another repo. If there's an issue with size, clients should put that forward, and describe their resources / caching. But I don't think there's much of a problem (although yes, yaml encoded mainnet tests are very large).

Generally I think it's perfectly fine to load tests in a convenient way, i.e. load into a map, translate python names and hex values, and load into the appropriate types. Better than describing all new types just for testing. In Go specifically, I just load them through https://github.com/mitchellh/mapstructure and do some translation, then put them directly in the types I use for state-transitioning. No extra code, or per-type conversion. And while I agree it's far from perfect, it's affordable (tests loading doesn't take that long) and easy.

You could generate structs with decoder definitions or use a custom yaml decoder, but I really just think running (and then passing) the tests at this stage is worth much more than (over-)engineering the test runner. Eventually, yes, it would be very nice. But first things first.

prestonvanloon commented 5 years ago

It's not necessarily an issue with size, but difficulty of using existing tools with hex encoded strings for seemingly no benefit.

protolambda commented 5 years ago

If other clients are in for a change too, we can change it. But don't want to break the working setup of anyone right before spec freeze.

arnetheduck commented 5 years ago

The point of yaml seems to be to optimize utility for humans, where hex is more commonly used - ie echo hello | sha256sum prints in hex.

Else we might as well go with a binary format that can be memory-mapped instead of parsed - that would at least net a simplification and speedup of test-running.

From nimbus, we're fine with keeping things interoperable and simple with hex too, and not too much yaml magic.

prestonvanloon commented 5 years ago

Thanks for the context. I now understand some of the reasoning that went into choosing hex for yaml, but I disagree with this decision. When we need to read the data as humans, we would likely do that in some form of test runner. I.e. Method failed. Wanted 0xf2d, got 0xabb. This formatting could be done with whatever stdout printer is being used without subjecting others to the cost having yaml data as hex.

We (and others) have had to write custom yaml processing logic or workarounds to support hex encoded strings in over 1 million of lines of yaml in this repository.

Maybe it is too late to undo this decision before spec freeze, but we think it is a technical debt that we'll be paying for a long time.

protolambda commented 5 years ago

@prestonvanloon Can you form a concrete counter proposal, to be executed after spec freeze, if it works for everyone?

prestonvanloon commented 5 years ago

This issue is my proposal. Base64 is the canonical format for binary data representation in yaml. Nearly every library supports ingesting base64 as binary.

Sources: https://yaml.org/spec/1.2/spec.html https://yaml.org/type/binary.html

Canonical: Base64 without white space and line breaks.

protolambda commented 5 years ago

Was thinking of something more radical. If we give up on readability, there should be much better options.

prestonvanloon commented 5 years ago

We have no preference on readability or format as long as we don't have to waste time implementing things that already exist. YAML is fine if we use YAML as intended. JSON is fine because the JSON spec suggests Base64 as well.

This isn't giving up on readability. YAML is readable.

djrtwo commented 5 years ago

We don't intend to use any of the fancy !! YAML notation due to the more simple parsers not supporting this usage. The 0x is super common in ethereum and human readable and speakable.

I think we give something up if fields in tests such as pubkeys become difficult to discuss due to the base64 notation, especially considering pubkeys will almost certainly be represented in hex to users.

prestonvanloon commented 5 years ago

@djrtwo For some additional context, we have wasted at least one eng week to write tooling to support non-canonical formatting with our standard libraries. I think it is a bad idea to use hex for everything.

It makes sense when you need representation to be human readable for end users, like a wallet address in UI, but binary data for test ingestion is not one of those use cases. This data is not end user facing and is causing a lot of pain for little/no benefit.

protolambda commented 4 years ago

So pretty much all data, unless it's not a SSZ object, will also be available as ssz-encoded raw bytes in the next testing release. See https://github.com/ethereum/eth2.0-specs/pull/1320 An example can be found here: https://github.com/ethereum/eth2.0-spec-tests/tree/tests-v082-reorg/tests/minimal/phase0/operations/deposit/pyspec_tests/success_top_up

So effectively no yaml / hex encoded data, unless for a small test metadata detail (which shouldn't touch ssz/production code anyway). Does this work for Prysm @prestonvanloon ?

prestonvanloon commented 4 years ago

unless for a small test metadata detail

What does this mean? There will still be some required yaml?

Switching to ssz encoded data will most likely work for us. Thanks for the recent improvements.

protolambda commented 4 years ago

@prestonvanloon there are a few small files that just say things like "there are 10 deposits to read from ssz files", "there are 6 blocks to read", "BLS required here". Etc. etc. Nothing hard to parse really.