ethereum / beacon_chain

MIT License
211 stars 65 forks source link

Considering the database scheme design - ActiveState part #81

Closed hwwhww closed 6 years ago

hwwhww commented 6 years ago

Thanks to @jannikluhn for discussing it with me.

I was thinking about how to implement and test the fork choice rule in the Python PoC codebase, hmm, it would be nicer if we have a database to rebuild/rollback the state instead of just having some dicts. 😬

The first thing occurred to me is AttestationRecord; the story I think is:

  1. The attesters create AttestationRecords and then broadcast them.
  2. The proposer collects the AttestationRecord messages, creates a block, and broadcasts the Block.
  3. Other nodes receive the Block, apply the state transition, and update local ActiveState. (Maybe also CrystallizedState, up to the slot number and other conditions in the spec)
    • post_active_state.pending_attestations = prev_active_state.pending_attestations + block.attestations

1 - How to store AttestationRecord and ActiveState?

It makes sense to change active_state.pending_attestations: [AttestationRecord] to active_state.pending_attestation_hashes: [Hash32] if the AttestationRecord have already been stored in the local DB.

↑ It affects how we serialize ActiveState and how we generate active_state_root.

2 - How to store Block?

For the DB storage design, we can consider treating the AttestationRecord like the PoW main chain Transaction for avoiding double-storing from both Block and ActiveState. That said, the Block consists of BlockHeader and Attestations (i.e., BlockBody). There are those k-v pairs:

↑ The clients can implement DB scheme in various ways, it doesn't need protocol changes; but, maybe we can optimize some structure in protocol to make great improvement?

BlockHeader is not a perfect name here...


Haven’t dug into CrystallizedState yet.

rauljordan commented 6 years ago

Working on this in https://github.com/prysmaticlabs/prysm/pull/440 at Prysm

rauljordan commented 6 years ago

So just to recap, we're not storing active state details because they can be computed from blocks anyways. Since the crystallized state transition only occurs on each new cycle and involves more computation than determining an active state, I don't think it's a big problem to persist most of it. At the very least, should we be persisting the validator set with a lookup key?

djrtwo commented 6 years ago

I'm leaning towards limiting dependencies in this poc repo and not going in the direction of bringing in a db if possible. Instead just having clear data structures and pure functions.

That said, if the forkchoice becomes to messy to implement, I could be convinced, but would maybe rather just start working on a fork of pyevm

jannikluhn commented 6 years ago

We could copy the db interface from PyEVM and just use their MemoryDB. Then we would not have any additional dependencies and integrating it at some point with PyEVM or switching to a proper db would be more or less trivial.

That said, it probably wouldn't make much of a difference compared to just using a dict directly.

hwwhww commented 6 years ago

@rauljordan Ah, I saw your point, AcitveState is not hard to be rebuilt by CrystallizedState and Blocks. (I thought PryLabs stores ActiveState when I read this line two days ago: https://github.com/prysmaticlabs/prysm/blob/366e5168ba03832a6b423d300524b76268af8242/beacon-chain/blockchain/core.go#L168, but you only stored the head ActiveState, and will change to only in-memory?)

mkalinin commented 6 years ago

EthereumJ stores all data structures in key/value DB. It has an interface that abstracts storage from specific KV engines. For instance, RocksDB is used to store real data on disk. But in tests EthJ uses in-memory DB which implementation is pretty similar to MemoryDB. I agree and support @jannikluhn https://github.com/ethereum/beacon_chain/issues/81#issuecomment-417226082: some abstract interface + in-memory DB is a good thing to start from.

Schema. This is a database schema definition that EthereumJ has so far: https://github.com/ethereum/ethereumj/blob/research/sharding/ethereumj-core/src/main/java/org/ethereum/sharding/config/BeaconConfig.java#L149-L204.

Keys. This is either a hash of encoded object or some index, it depends on the way the data are accessed. Sometimes, e.g. for blocks, there should be an opportunity to access data by both hash and index, depending on the case. It's achieving by introducing another one kv-storage that maps object index on object hash: [index: hash] + [hash: object].

Encoding. EthereumJ uses RLP. But it can be easily replaced with any other encoder when we will be agree on some.

Tries. Each trie is backed by own kv-storage that is a source of trie nodes: [node_hash: node].

P.S. the above came from eth1.0 and works well for the beacon chain and shards.

terencechain commented 6 years ago

prysmatic labs has implemented saving attestation and attestationHashes to local DB in this PR. Feel free to check it out

https://github.com/prysmaticlabs/prysm/pull/472

rawfalafel commented 6 years ago

Hey, I have a few questions about this suggestion!

The attesters create AttestationRecords and then broadcast them. The proposer collects the AttestationRecord messages, creates a block, and broadcasts the Block.

Is there an intermediate step here, where some entity (possibly the proposer) collects individual attestations and bundles them into the aggregated attestations, aka AttestationRecord?

if the AttestationRecord have already been stored in the local DB.

Will this necessarily be true? If proposers are responsible for aggregating attestations and including them in the block, a node that's processing the block won't have those attestations in the DB.

Finally, is this motivation for this proposed schema change to save on storage space, specifically to avoid duplication of AttestationRecord in the block and the active state?

In general, I'm wondering if the difference between this schema and the naive schema (where attestations are duplicated in the block and active state), is about trade-offs in storage size and lookup time. What I mean is that your schema would prevent duplication of AttestationRecord in the block and active state, but with the additional cost where scanning the attestations in a block or active state will cost N+1 lookups, where N is the # of attestations.

The attestation hash itself doesn't seem useful. When looking for duplicate attestations, the hash isn't sufficient because 2 attestations with the same slot and shard but different oblique_parent_hashes are still a duplicate, and 2 attestations with the same slot, shard, oblique_parent_hashes but overlapping yet different attester_bitfield values have to be handled somehow too. Correct me if I'm wrong here!

The naive schema would allow the attestation to be stored within the block/active state, and would only require one lookup. Of course this requires extra storage space in the database since the entirety of the attestation is stored in both the block and the active state.

I'm still a little confused about the distinction between the individual attestations that are signed by attesters, and the aggregated attestations that are included in blocks. From my understanding, it makes sense to store the individual attestations by its hash! But this suggestion seems to be about storing the aggregated attestations by its hash, which I'm having a hard time finding a strong justification for.

Thanks!

hwwhww commented 6 years ago

Feedback appreciated! I'm so sorry for that I missed @terenc3t and @rawfalafel ’s tag in https://github.com/prysmaticlabs/prysm/pull/472!!!

Is there an intermediate step here, where some entity (possibly the proposer) collects individual attestations and bundles them into the aggregated attestations, aka AttestationRecord?

By AttestationRecord here do you mean the in wire protocol format? I think it’s possible, maybe should be named it AttestationRecords in wire protocol, and it could be >=1 attestation(s).

Will this necessarily be true? If proposers are responsible for aggregating attestations and including them in the block, a node that's processing the block won't have those attestations in the DB. oblique_parent_hashes

Agree with the significant flaw of this scheme as you pointed out! There may be a dirty method by replacing “attestation hash” with sig as the query key and only having oblique_parent_hashes in Block, but it's still bad since the I/O overhead of re-wrapping block is TOO HUGE as you mentioned, especially while syncing nodes. Treating attestations like transaction pool probably will be a better way to handle the attestations. What do you think about that?

Finally, is this motivation for this proposed schema change to save on storage space, specifically to avoid duplication of AttestationRecord in the block and the active state?

Yes, if as PryLabs’ suggestion, just keeping the ActiveState in memory, then storing attestations separately is meaningless. The possibility of recalculating ActiveState depends on the forking rate, but it should be fine if comparing it to the query times of re-wrapping a block.

Thank you for looking into this!

rawfalafel commented 6 years ago

By AttestationRecord here do you mean the in wire protocol format?

Yeah, I'm pointing to a distinction between an "individual attestation", which is what a single attester signs, and contains:

and an "aggregated attestation" aka AttestationRecord, which is the wire protocol format defined in the spec. I keep on coming back to this distinction because I feel like the two definitions are being incorrectly conflated.

Treating attestations like transaction pool probably will be a better way to handle the attestations. What do you think about that?

Hmm are you talking about a pool for individual attestations or aggregated attestations? I think a pool for individual attestations make sense.

Yes, if as PryLabs’ suggestion, just keeping the ActiveState in memory, then storing attestations separately is meaningless.

Gotcha! thanks for the clarification.

One simple suggestion is to sort block.attestations and active_state.pending_attestations by slot and shard_id. This way we can do a binary search instead of a scan when searching for existing attestations.

hwwhww commented 6 years ago

individual attestation

Got it. Yes, another data class + name makes sense.

Hmm are you talking about a pool for individual attestations or aggregated attestations? I think a pool for individual attestations make sense.

Yes, for the individual attestations.

One simple suggestion is to sort block.attestations and active_state.pending_attestations by slot and shard_id. This way we can do a binary search instead of a scan when searching for existing attestations.

The current spec says: Extend the list of AttestationRecord objects in the active_state, ordering the new additions in the same order as they came in the block. (Should be pending_attestations) And it's natually the order of making cross-link updated crosslinks: https://github.com/ethereum/beacon_chain/blob/dddee7781386b1c553ea72ad556af4bde09b10c7/beacon_chain/state/state_transition.py#L195

Do you have an idea of what's the use case that we have to search for existing attestations except for a blockchain explorer?

Thanks!

rawfalafel commented 6 years ago

Do you have an idea of what's the use case that we have to search for existing attestations except for a blockchain explorer?

Say the beacon chain processes slot N which contains attestation X. Next, the beacon chain attempts to process slot N+1 which contains the exact same attestation X. Should the beacon chain detect that there's a duplicate attestation?

As another example, what if slot N+1 contains an attestation that is similar to attestation X except for a different value for shard_block_hash. Should that trigger a slashing condition?

If those two scenarios need to be considered, attestations can be queried more quickly if block.attestations and activeState.pending_attestations were ordered by slot and shard.

hwwhww commented 6 years ago

@rawfalafel

Talked with V and Danny, recapping from Danny a little: the current spec doesn't prevent the proposer from accepting and including redundant (partial or full overlaps) across from different blocks of the same chain. It may happen but the proposer won't get the reward from including redundant attestation. (← it's not defined in the spec yet.) And in the state transition, we don't double count for FFG or rewards, so there is no incentive for the proposer to include full overlaps attestations. But we probably can prevent it with other mechanisms like a two-stage protocol that worthy to think about more. Right now we think ordering the attestations for binary search is too complicated for this purpose.

hwwhww commented 6 years ago

deactivate this discussion