Data availability fields in block header

liamsi commented 4 years ago

This issue drafts changes of a first iteration of modifying the block header to support data availability proofs.

Basically be a variant of what we spec'ed out in https://github.com/LazyLedger/lazyledger-specs: with as little changes as possible to tendermint core.

As a side note, we should track and document theses decisions and changes in the repository itself. We can track them in the code directly via so called ADRs (we used these at tendermint: https://github.com/tendermint/tendermint/tree/master/docs/architecture).

Note that the spec in it's current form expects immediate execution (ref: #3) which doesn't really affect this issue besides that the mentioned intermediate state roots included in block of height h+1 reflects the state induced by transactions included in block of height h (as described in https://github.com/LazyLedger/lazyledger-core/issues/3#issuecomment-614530279).

liamsi commented 4 years ago

Status quo

Block and Header

Current Block and Header: https://github.com/LazyLedger/lazyledger-core/blob/da745371227f54aa90c609845cd4cc2f36a152f1/types/block.go#L38-L44 https://github.com/LazyLedger/lazyledger-core/blob/da745371227f54aa90c609845cd4cc2f36a152f1/types/block.go#L323-L348

We can use the existing Data field for what is called availableData in the current LazyLedger spec.

Current Data type: https://github.com/LazyLedger/lazyledger-core/blob/da745371227f54aa90c609845cd4cc2f36a152f1/types/block.go#L809-L819

Proposed changes:

Block

Changes to the Block are simple. Basically, move over Evidence field into Data as it needs availability too. And add the availableDataHeader:

type Block struct {
    Header
    AvailableDataHeader [][]byte
    Data
    LastCommit *Commit
}

Header

The Header does not need to change at all 🎉 (for now at least)

There are some fields that can be removed if we switch to immediate execution. As this proposal tries to keep the changes as small and contained as possible, we will keep the field names as well as some fields used in the current tendermint (light client) verification logic. With the same reasoning we keep EvidenceHash in the Header.

I added NOTEs where this deviates from the spec (cc @adlerjohn).

type Header struct {
    Version version.Consensus
    ChainID string
    Height  int64
    Time    time.Time

    LastBlockID BlockID

    LastCommitHash tmbytes.HexBytes
    // NOTE: becomes available Data root
    // (not renaming for now)
    DataHash

    // NOTE: keep the current hashes / pointers
    // to keep the current light client logic intact
    // (will address removal in separate issues):
    ValidatorsHash     tmbytes.HexBytes       // validators for the current block
    NextValidatorsHash tmbytes.HexBytes       // validators for the next block
    ConsensusHash      tmbytes.HexBytes       // consensus params for current block
    // NOTE: different from the current spec!
    // this is not the `stateCommitment` for the
    // Tx in this block but for the Tx of the previous block:
    AppHash            tmbytes.HexBytes           // state after txs from the previous block
    // NOTE: this field is part of the deferred
    // execution model
    // (DeliverTx results on the previous blocks merkelized)
    LastResultsHash tmbytes.HexBytes

    // NOTE: this is kinda duplicate now because the `availableDataRoot`
    //basically includes a commitment to the evidence data
    // (will address removal in a separate issue)
    EvidenceHash    tmbytes.HexBytes     // evidence included in the block
    ProposerAddress Address              // original proposer of the block
}

Data

Data will contain everything that requires data availability. Adding fields does not cause any trouble with the current logic. The semantic of that hash changes though (currently merkle root of all fields).

Moving over evidence should be a simple change (and could be done separately later).

The newly introduced IntermediateStateRoots can probably be computed using the ResponseDeliverTx.Data without changing the current logic inside tendermint at all.

Note: The only thing I'm not sure yet is how to handle the separation of "Transactions" and "Message" in the spec's sense. ~~Maybe we can also include this in ResponseDeliverTx.Data: executing a Tx returns the intermediate state root associated with that Tx as well as the corresponding message.~~ Maybe we can include this in the response of checkTx in ResponseCheckTx.Data (DeliverTx would be one off again). Note that tendermint does not know of the distinction we make between Transaction and Message. The way they would be sent to a node would be in a single (network) message / (tendermint) Tx. Somewhere these need to be split up to be encoded in the block as described in the spec. At least if we don't want to add another sperate pool for messages and a separate reactor.

For the other fields in data we can hard code namespace ids.

type Data struct {
    // Txs that will be applied by state @ block.Height+1.
    // NOTE: not all txs here are valid.  We're just agreeing on the order first.
    // This means that block.AppHash does not include these txs.
    Txs Txs

    // Note: As Tendermint does not execute the
    // Tx itself (the app does), we need a way to
    // bring compute these intermediate state roots.
    // Simply using the `Data` field in `ResponseDeliverTx`
    // for this looks promising.
    IntermediateStateRoots []tmbytes.HexBytes

    // NOTE: moved over from Block
    Evidence   EvidenceData

    // Note: Tendermint Tx are just blobs. Tendermint core itself does
    // not understand how to interpret a Tx or it's validity.
    // That is part of the App logic.
    //
    // Maybe we do not need a separate message field
    // and can leverage the generality of Tx in
    // tendermint instead?
    Messages   []Message // TODO: This needs more thought!

    // Volatile
    hash tmbytes.HexBytes
}

Message (new struct)

type Message struct {
    // TODO: figure out constrains and
    // introduce dedicated type instead of just []byte
    // namespace id is necessary as this dictates where
    // in the namespace merkle tree this will go.
    NamespaceID []byte
    Data        []byte
}

adlerjohn commented 4 years ago

This seems reasonable. The availableDataHeader (which contains a matrix of commitments to the data in RSMT2D form) looks like it's missing from the block though.

Note: The only thing I'm not sure yet is how to handle the separation of "Transactions" and "Message" in the spec's sense.

For how they're encoded in the commitment(s), I was thinking reserve different namespace IDs for 1) transactions 2) intermediate state roots 3) evidence. Then Messages can have any other namespace ID.

For how they're handled at the networking layer, yeah you're right it looks like it'll take a bit of effort to make changes required to pass them around as a single blob then split them up. But shouldn't be unreasonably difficult (I hope!).

liamsi commented 4 years ago

The availableDataHeader (which contains a matrix of commitments to the data in RSMT2D form) looks like it's missing from the block though.

Ah, thanks. Good catch!

For how they're handled at the networking layer, yeah you're right it looks like it'll take a bit of effort to make changes required to pass them around as a single blob then split them up. But shouldn't be unreasonably difficult (I hope!).

Exactly, not yet sure how to handle this part. I'm confident we can use checkTx for that. Haven't thought that through yet though.

liamsi commented 4 years ago

Regarding the availableDataHeader: does it matter to you, if the underlying data type be []byte like in the spec or [][]byte for the row/colums semantics? We should align the spec here too (but I guess that was on your radar anyways).

adlerjohn commented 4 years ago

does it matter to you, if the underlying data type be []byte like in the spec or [][]byte for the row/columns semantics?

Is there any inherent downside to using a 2D matrix representation instead of a 1D array? I guess with the 2D matrix you'd need to specify two indices all the time. It would also be more work to Merkleize a matrix instead of an array (maybe not, if the iterator of the matrix is implemented nicely). Given that, I'd lean towards using a 1D array, and just doing pointer arithmetic to map {row, col} <-> {index}. Thoughts?

liamsi commented 4 years ago

It would also be more work to Merkleize a matrix instead of an array (maybe not, if the iterator of the matrix is implemented nicely)

Not really. An iterator as you mentioned is kinda equivalent to that (row, col) -> index bijection you mentioned. I'm leaning towards 2-dim (for the encoding) and defining a type AvailableDataHeader that hides away that detail (in the implementation) and returns an iterator / ordering and methods ( RowRoot(i), ColumnRoot(j)` or whatever they need to be accessed). In the end it doesn't matter too much.

adlerjohn commented 4 years ago

Fair enough. Will there be any serialization issue with matrices? If not then 2D is fine. Will be easier to reason about too.

liamsi commented 4 years ago

Will there be any serialization issue with matrices?

Not that I'm aware of. For binary/proto seriliazation this will be:

message  Block {
    // [...]
    repeated bytes AvailableDataHeader = 5; // or whatever field number
}

Or we can flatten the matrix for serialization (and have it only bytes) basically the way you described.

liamsi commented 4 years ago

Updated above

adlerjohn commented 4 years ago

Ugh, it looks like protobuf doesn't have native support for multidimensional arrays. So it would have to be encoded as a 1D array for serialization. It doesn't actually matter how the Go client stores this internally, so long as it serializes it properly.

liamsi commented 4 years ago

It doesn't actually matter how the Go client stores this internally, so long as it serializes it properly.

True, still would like the spec and the code match as much as possible (where it makes sense).

celestiaorg / celestia-core