Rationale for RLP alternatives in Discovery v5?

FrankSzendzielarz commented 5 years ago

A couple of people have commented that it would be somehow more convenient to use SSZ at the Discovery layer. Right now I don't see any reason to switch off RLP, but I was curious anyway....does anyone agree with this notion and what is the motivation/rationale?

hwwhww commented 5 years ago

SSZ has many better features in consensus layer, see: https://notes.ethereum.org/s/rkhCgQteN#SSZ and https://github.com/ethereum/eth2.0-specs/issues/582#issuecomment-461605143
For the discovery layer and network layer, the question becomes what do we want to use for messaging when the underlayer is libp2p. Some implementers advocate using protobuf (https://github.com/ethereum/eth2.0-specs/issues/129, https://github.com/ethereum/eth2.0-specs/issues/503). Personally, I don't see convincing arguments of requirements of adding another serialization scheme that makes protocol stack more complicated.

raulk commented 5 years ago

2. For the discovery layer and network layer, the question becomes what do we want to use for messaging when the underlayer is libp2p. Some implementers advocate using protobuf (ethereum/eth2.0-specs#129, ethereum/eth2.0-specs#503). Personally, I don't see convincing arguments of requirements of adding another serialization scheme that makes protocol stack more complicated.

libp2p doesn't impose a particular serialisation format. It exposes plain byte-level readers and writers, so you are free to choose whichever wire format you prefer ;-)

EDIT P.S.: libp2p protocols like gossipsub, kademlia, etc. use length-delimited protobuf, but to the eyes of the rest of the libp2p stack, that's just an implementation detail.

FrankSzendzielarz commented 5 years ago

Most eth implementations already have RLP so is there any reason to add a further serialization format? I am told SSZ is rather bloated....with somewhat differing aims....

Mikerah commented 5 years ago

Most eth implementations already have RLP so is there any reason to add a further serialization format? I am told SSZ is rather bloated....with somewhat differing aims....

In Phase 0, I see keeping RLP as a short term solution so that the v5 nodes are backwards compatible with the v4 nodes. However, in the long term, this is not necessary. After all, ETH2.0 is a separate chain.

FrankSzendzielarz commented 5 years ago

OK...in that case what would you propose as a wire serialization format? SSZ? If so, why.

Mikerah commented 5 years ago

There's already a proposal for the Wire API that uses SSZ. There's has been a little bit of discussion lately about the design rational of SSZ and whether it should be changed. @karalabe suggested an alternative serialization scheme SOS (Simple Offset Serialization).

Perhaps a game plan that might be reasonable is the following:

Keep RLP in order to have backwards compatibility with the legacy ETH chain to facilitate validator deposits
Once we have settled on whether to keep SSZ or use some other scheme, then we can redefine the Wire protocol using that serialization format.

Thoughts?

karalabe commented 5 years ago

Seems to me that both RLP and SSZ are kind of arguing that "hey, we have this existing code, lets use it instead of figuring out what the best solution is for this particular task".

SSZ was just a quick idea from @vbuterin for how to do trie hashing. I really don't see why we'd want to enshrine something meant for a completely different purpose - and even not finalized - into the networking layer.
RLP is a bit better because it's proven to work + it was mostly designed to be compact, but it's again arguing existing code vs. figuring out why it's good.

The only meaningful way forward I see is to write up a list of requirements that the discovery wire format requires, and then we can pick a solution from there. My initial thoughts would be:

The discovery protocol is running on UDP. We need a message format that is tiny since we're limited to some 1500 byte MTU limits on the routers. Anything larger and the protocol messages get dropped.
Since the messages are tiny, I'm unsure adding compression is possible. Any compression algo will drop some boilerplate in there, and when you only have 1500 bytes, I'd need to see exactly what data which algo drops before I'd accept that compressing messages is viable.
If compression is not viable, then we need a format that packs data as tightly as possible. This IMHO automatically rules out SSZ which pads integers to their canonical sizes, and uses 4byte lengths for dynamic types. We need at least varints or some similar aternatives.
Lastly, if someone says that implementing a binary format encoder/decoder is too hard so lets stick to some existing code... they shouldn't be implementing Ethereum 2.0 in the first place.

FrankSzendzielarz commented 5 years ago

Discv5 allows multiple messages to be sent as a message "stream" (each message contains a message M of N) if the response is expected to span MTUs. Eg FindNode -> Neighbors may result in multiple messages with lists of ENRs. This means some compression can be used in some places. But on the whole it is not of important utility.

Discv5 (currently) aims to be agnostic of if the transport is streamed or not. RLP does help with that in that it offers read look ahead hints.

On the whole I think it is now up to people to propose alternatives to RLP and say why. Right now it's RLP by default.

fjl commented 5 years ago

My perspective: we don't gain much from changing serialization formats for discovery. AFAIK SSZ was proposed because decoding RLP is annoying in the EVM. But those concerns with RLP in consensus layer don't apply to p2p because there is no need to process network packets inside the EVM.

RLP has advantages for networking because it is a free form format that can be decoded without a schema. It also allows forward-compatible encodings where we can say "just skip over this part, we'll define what goes here later".

The disadvantage of both RLP and SSZ is that they aren't "standard" encodings (i.e. they're not included in programming language standard libraries). RLP is widely supported though and has implementations in 15+ programming languages.

pipermerriam commented 5 years ago

I don't have the expertise to weigh in on the networking level components/reasons for choosing one over another.

I agree with Peter's assertion that we should have a list of things that we care about and make a decision using that as a framework. Here's a starting point.

Efficiency: Is the serialization format compact
Language support: Are there quality libraries available across the many languages clients are being written in.
Datatype support: What data types/formats do we need/want to express and how easily can we express them. (is the list below accurate/complete?)
- integers
- dynamic length byte strings
- fixed length byte strings
- fixed size lists of things
- dynamic sized lists of things

arnetheduck commented 5 years ago

here's a few things I'm missing in a wire encoding for eth2:

forward/backward compatibility / upgradeablility - there are 3 levels one can consider: none, adding fields, adding and removing fields - each level adds complexity but allows low-impact changes to be deployed without breaking the world and requiring the larger community to upgrade (hardware wallets, monitoring tools etc etc). note that having version numbers is not sufficient for good compatibility - when a new message type/version is added, both ends need to be upgraded for it to work - the objective of this point is to avoid that
fixed offsets for fixed-length fields - the ability to know the offset of fixed-length fields without parsing - generally solved like @karalabe suggests in SOS with an offset table (flatbuffers / capnproto are prominent examples to study), handy for efficient reading
schema - unambigous and machine-readable specification of the message content, such that it's possible to write tooling for reading the data - this helps build a thriving community around the specification

one possibility is to have two levels of support - one being a subset of the other. the more strict version would be used for consensus whereas the other would be used for wire. of the standard "formats" I've seen in discussions, flatbuffers comes close. The advantage of doing the subset/superset approach is that it allows accessing the data without reencoding, and with a single decoder at that. it should be fairly easy to turn ssz into a subset of flatbuffers, it's very close already. this would solve the "standard tooling" question for anyone wanting to just consume the data (implementations can easily code up use custom encoders, while promoting easy consumption)

protobuf was discussed and discarded several times (in the eth2 repo / issues) for several reasons, including its poor support for the data types we often use, most notably hashes / fixed-length arrays, and poor encoding determinism.

pipermerriam commented 5 years ago

Just did a quick read of @karalabe 's SOS proposal

I'm only just starting to think about this so maybe someone else already knows. It seems like the following might be loosely mutually exclusive.

streaming encodability/decodability
O(log(N)) access times for arbitrary nested data

Alternatively, my understanding is that we're talking about wire serialization protocol. Can we not use one serialization scheme for wire transport and a completely different serialization scheme for hashing?

For wire we probably want: streaming, compact, first class support for our desired data types

For hashing we probably want: fast access times, compact, first class support for our desired data types

SOS seems to fit the bill for our hashing serialization needs. I'm not yet aware of a candidate for our wire needs.

FrankSzendzielarz commented 5 years ago

@arnetheduck I just realized I should edit the title. The question is aimed at working out if we need to change off RLP for Discovery "v5" and if so why. The premise is that Eth 2.0 will need to talk to Eth 1.X for quite some time anyway.

IMHO, the Discovery protocol is not just likely to change across versions in terms of message format, but also in terms of message exchange pattern. Clients implementing Discovery should consider the use of Strategy-like design-patterns, I think.

For Discovery v5 I am leaning towards the following scheme (though this is still a topic for discussion, and I will update this comment with a link to an issue on this) :

Nodes are discovered and described as ENRs (Ethereum node records)
ENRs can hold various key/value pairs providing info on the node
The Discovery wire protocol supported by a node can be described in its ENR
Newer versions of nodes will decide on their own cut-off points on how many versions of a protocol to support. During upgrades, there always needs to be a transition period where nodes should support at least 1 version earlier than the upgrade in order to be able to join the network.
For incoming messages from as-yet unknown nodes, the Discovery protocol mandates that the recipient send back a "WhoAreYou" message, which returns the ENR of the sender. This one WhoAreYou message type must be pretty much unchanged across versions for this scenario to work (though some RLP list suffix padding could offer some forward compatibility) .
All the above allows for complete changes to the serialization format. So eventually, Discovery message wire format can be changed to any other type of encoding or compression . If in future Eth 2.0 clients would like to have a single wire format this could be accommodated.

ENRs may assist with higher level protocols in a similar way.

pipermerriam commented 5 years ago

I did some research and I can't find anything that has the following three properties:

compatible with streaming encoding/decoding
efficient/compact serialiation
support for the data types we need
- fixed size integers up to 256 bits
- booleans
- dynamic length byte strings
- fixed length byte strings
- dynamic length arrays
- fixed length arrays
- heterogenous Struct-like types with ordered fields.

So I made one:

https://github.com/ethereum/bimini/blob/master/spec.md

I'd be curious to get some feedback on it. I'm working to get it to a point where I can provide some comparison numbers between it, RLP, and SSZ.

FrankSzendzielarz commented 5 years ago

@karalabe ^^^^ @pipermerriam FYI https://github.com/ethereum/eth2.0-specs/issues/692

pipermerriam commented 5 years ago

I've opened up this EIP with a more formal proposal for the SSS serialization scheme. It includes rational for why Protobuf, MessagePack, and CBOR are not suitable to our needs as well as a breakdown of RLP vs SSZ vs SSS on the various axis that I think a networking serialization scheme should be evaluated.

https://github.com/ethereum/EIPs/blob/71098b1c2760f2ae557a7bab91770eb8cf72fed5/EIPS/eip-sss_serialization.md

And did some very detailed analysis of SSS vs RLP vs SSZ which can be found here:

https://github.com/ethereum/bimini/blob/7c26efec585742ef870bf58ea5d96e2deb242775/report.md#sss-vs-rlp-summary

pipermerriam commented 5 years ago

Further evolution of this topic: https://github.com/ethereum/eth2.0-specs/issues/754

jannikluhn commented 5 years ago

The disadvantage of requiring a schema are becoming very apparent in the discussions on the wire protocol. With SSZ, whenever a node tries to decode a message they received, they need to know the schema already. As different message types contain different data, we need a single envelope schema, embed the body as a data blob, and deserialize it in two steps. We might even need multiple levels, e.g.:

Message {
    type: uint8
    serialized_body: bytes
}

GetHeadersResponse {
    id: uint64
    success: bool
    response_body: bytes
}

HeadersSuccessfulReponse {
    headers: []BlockHeader
}

HeadersFailedResponse {
    error_code: uint8
}

I don't really like this. With RLP, we would avoid this to some extent because we can deserialize everything in a single step, then walk through the different elements we got, and only update our interpretation of the data at every step. And, if the message structure contains information about the message type, we can even get rid of nesting (e.g. distinguish between HeadersSuccessfulResponse and HeadersFailedResponse depending on if it contains a list or not).

FrankSzendzielarz commented 5 years ago

Yes concerns have been raised by different people about SSZ on the wire, but regardless it is still included in the draft protocols there. For Discovery we're just going with RLP for now and upgrade mechanisms are simple once the ENRs are in place. The wire protocol for Eth 2.0 conflates message format with encoding/serialization. What wire formatter (media formatter in the web world) is used could easily be something established by rules in the ENR and/or handshake. If client implementers want to make a private network using BSON why should they not be able to?

pipermerriam commented 5 years ago

@jannikluhn after the last call I'm leaning towards defaulting to any/one-of-the minimal wire protocol proposals that were talked about which treat the Message part as raw bytes and delegate to a second layer of decoding to decode the actual message.

So no SSZ at the wire level but I still like the idea of using an SSZ variant of some sort fort he message component.

ethresearch / p2p

Rationale for RLP alternatives in Discovery v5? #15