ethereum / beacon_chain

MIT License
209 stars 65 forks source link

simple_serialise - not possible to parse the structure without type information #94

Open AlexeyAkhunov opened 5 years ago

AlexeyAkhunov commented 5 years ago

Due to the special encoding of 'hash32' and 'address' without any prefix, it will not be possible to parse any simple_serialise messages without knowing the type. This might seem to be OK, but precludes writing of generic parsers that do not depend on the type being parsed. The consequence that any tools/libraries that look at traffic and analyse it has to be type-dependent and will have to be updated any time types change. That will create unnecessarily tight coupling between serialisation format and the tooling.

Also, list and 'type' encodings look like a length-prefixed blob, and without type information it is impossible to see how many elements/fields they have and the boundaries of these elements/fields.

RLP does not have this drawback, for example. But it has its own issues, of course.

Another concern is that serialised messages can mean different things depending on the schema. That is more of a safety/security concern - if we have software of two slightly incompatible version, we can have very hard to find bugs due to misinterpreted messages. Generally, it should be possible to have a high assurance that changing the schema will not successfully parse any messages that were produced by a different schema (or at least a schema of the previous version).

paulhauner commented 5 years ago

Hi @AlexeyAkhunov, thanks for the input.

I think it's fine that you cannot parse a SSZ message without already knowing its intended type.

We're sending Blocks and AttestationRecords across the network, so there's not many different types we need to accommodate for. Also, the nature of this application (a blockchain) means that all nodes already have a very strong agreement upon the data types and serialization format. Any change in the types or their serialization format is a "big deal" and all nodes will need to explicitly agree upon this -- if a node changes the either of these at any point they will hash to different values and we will lose consensus. This is distinct to something like protobuf on some RPC where it's fine to have different byte-level representations across different versions as long they describe the same object.

The message on the network is going to have some byte(s) to identify the message type as a block , attestation record, etc. Off the top of my head, accidentally mislabeling one of these objects and having it de-serialize as a valid, consensus-influencing object of a different type would either be impossible or similar to finding several hash collisions.

I think it's important to note that there are at least two different requirements for serialization we're discussing:

What I have been discussing relates to consensus serialization. I think you're discussing p2p serialization @AlexeyAkhunov which could be distinct to consensus serialization, but I don't think there are many gains to this -- especially considering that one of the first things a node will do with some block when they receive it is hash it according its consensus serialization.

There's also a serialization format that will be involved with RPC endpoints in the node and I think that should be something like protobuf. However, that's not really being discussed at this stage.

jannikluhn commented 5 years ago

We're sending Blocks and AttestationRecords across the network, so there's not many different types we need to accommodate for.

This is true for now, but probably not in the long term. Additional types from the top of my head:

mratsim commented 5 years ago

The alternatives we considered all have their trade-offs:

I'm not familiar with the field but network analysis tool should allow plugins, so we could create add that to a bounty/wishlist: SSZ plugins for common open-source network analysis tools.

I.e. have a robust base layer and build tooling on top.

AlexeyAkhunov commented 5 years ago

@paulhauner Thank you for your comment! I had an impression from the Eth 2.0 call on 13.09.2018 that simple_serialise had been proposed for p2p serialisation (since p2p serialisation was on the agenda).

As for consensus serialisation (serialise things before they get hashed and signed - and I assume you are saying that deserialisation will not actually happen), one of the main requirements there is to be able to avoid excessive dynamic memory allocations during the serialisation. To achieve that, serialisers might want to compute the resulting serialisation size efficiently, before starting the serialisation. Or, alternatively, if we are using sponge-based hash functions (like Keccak) that can consume streams of data as input. In the latter case, prepending the serialisation with the length prefix actually makes it harder, because you cannot start the stream until you have computed the whole serialised string.

Therefore, if simple_serialise is to be used for one-way serialisation (input to hash functions, therefore no deserialisation is required), then I suggest dropping length prefixes, because that would allow streaming input to the sponge hash functions like Keccak

AlexeyAkhunov commented 5 years ago

@mratsim Yes, I agree: make schema mandatory serialisation/deserialisation has many problematic consequences. It is not memory efficient, as you said, it precludes generic analysis tools and libraries, it introduces potential for mis-interpretation, and makes the consideration of backwards compatibility much harder. And it also mingles syntax and semantics of messages into one, inseparable into different layers of codes/decoders

AlexeyAkhunov commented 5 years ago

RLP is actually good serialisation format for wire encoding. I do not agree that it is overly complicated - but I have seen some implementations that make it look complicated.

I do agree that RLP is not great to generate input for hashing. Therefore, I suggest using RLP for wire messages, but for making inputs for hashes use simple serialisation streams without length prefixes (because length prefixes force us to buffer input instead of simple start hashing it). Since no-one is going to undo the hashes and deserialise, it is not important to have length prefixes

hwwhww commented 5 years ago

@paulhauner Do you have the benchmark of SSZ v.s. RLP?

mkalinin commented 5 years ago

Do we want to support two serializations? Does the outcome of introducing e.g. SSZ is so big to maintain both RLP and SSZ or ProtoBuf and SSZ?

paulhauner commented 5 years ago

@jannikluhn good point about future data types. Thanks.

@AlexeyAkhunov no worries, happy to talk about this. With regards to the ssz length prefixes, I don't see a need to remove them before we hash. I think the entire SSZ byte-array should be hashed.

@hwwhww we updated the serialization_sandbox notes with RLP message sizes :)

@mkalinin I am in favor of using SSZ for both p2p serialization and consensus serialization because it's pretty small and I think one of the very first things we want to do with some block/attestation record when we get it off the wire is hash it. Having two formats would require some translation. I'm totally open to discussion on this though.

mkalinin commented 5 years ago

We are currently using RLP as it has already been implemented in EthereumJ. I agree that RLP is over complicated in terms of implementation and open for changing it. But there is no sense to me in supporting several serializations, it sounds even more complicated than implementing RLP. So, it's OK to change serialization algorithm but the new one should handle all the cases and replace RLP all over the codebase. IMHO, if something that we want to switch to doesn't fit some of the cases then we should consider another algorithm.

paulhauner commented 5 years ago

@mkalinin I totally agree.

paulhauner commented 5 years ago

Hi @AlexeyAkhunov, I didn't make the connection between yourself and TurboGeth earlier. Knowing this, I'll give your proposal much more thought as your knowledge on the subject is more extensive than mine.

jannikluhn commented 5 years ago

The two arguments in favor of using the same serialization format for networking and hashing are

I'm wondering if both arguments are valid if we go with the libp2p daemon: I assume libp2p has its own transport serialization format and considering that it's implemented in Go this should also not be our bottleneck. It's also fully abstracted away from us, so it doesn't affect simplicity of the Eth2.0 client itself.

So the relevant question becomes which serialization format do/should we use to communicate with the daemon? There we have of course different trade offs, in particular we don't need to care about size so much as the messages aren't transmitted via the network. @raulk Has this been specified already?

mkalinin commented 5 years ago

@AlexeyAkhunov could you, please, outline your thoughts on what requirements both algorithms should satisfy? Maybe you even have some algorithms to take a look at? It would help to get why SSZ is not suitable for network serialization from your point of view and raise further discussion on alternatives.

raulk commented 5 years ago

@jannikluhn libp2p is agnostic of the wire format of messages. The control API of the daemon does use protobuf, but that should be well encapsulated inside the binding, e.g. py-libp2p-binding. In other words, it's only used to exchange commands and control signals like: "open connection", "open stream", "new incoming stream", DHT operations, etc. between the binding and the daemon.

The application itself (Eth2.0 implementations) can select whatever wire format is best for them. The daemon just pipes raw reads and writes done on the local unix socket representing the stream, into reads and writes on the underlying libp2p stream.

jannikluhn commented 5 years ago

Thanks for clearing that up, @raulk! Disregard my post then as we still have to decide on a wire format by ourselves.

fjl commented 5 years ago

IMHO the network should relay consensus objects without re-encoding them. These objects can be represented as byte arrays in any serialization format. A network message defined as a protocol buffer can include a 'block' object as a blob.

To solve the issue that started this discussion, it would be best to define consensus objects as an ssz (or rlp) blob with a known header including a version number, e.g. a block would be encoded as "block-v1-..." with ... being the ssz blob.