Robustness of eth2 clients to the eth1 node

ethers commented 4 years ago

Would like to suggest this reminder to (all) eth2 clients, to be robust on whatever the eth1 node/infrasturcture does.

Here are some recent observations without calling out clients directly:

In one testnet, new beacon nodes couldn't sync to the testnet because the provided goerli node was overloaded with all the clients it is serving.
On a different client, given a running beacon node, validator, and eth1 node, when the eth1 goerli node started having issues (such as losing peers), after some time the beacon node was not able to continue functioning, and so the validator also stopped working. It's my understanding that the validator should continue unaffected, if an eth1 node goes down. This was not the case as the goerli node took down the beacon node. [To this client's credit, the validator node never had to be restarted: when the goerli node and beacon nodes are restarted and function again, the validator node resumes nicely.]

(I recall there was a pre-launch checklist of some sort [by @djrtwo] but I haven't been able to find it again. I suggest this testing be explicitly added.)

prestonvanloon commented 4 years ago

Chiming in on the Prysm side of things since I believe you may have experienced the above issues in our testnet.

new beacon nodes couldn't sync to the testnet because the provided goerli node was overloaded with all the clients it is serving.

In order to determine the genesis state, the beacon node must have access to all of the deposits involved in creating this state. Another idea is that we hardcode the genesis state into the application post-launch.

On a different client, given a running beacon node, validator, and eth1 node, when the eth1 goerli node started having issues (such as losing peers), after some time the beacon node was not able to continue functioning, and so the validator also stopped working. It's my understanding that the validator should continue unaffected, if an eth1 node goes down.

I am not sure if this was Prysm or not, but eth1 connection only affects block proposals. The beacon node should be able to continue without an eth1 connection. In Prysm, we have recently implemented a timeout of 2 seconds when requesting eth1 information during block proposals (https://github.com/prysmaticlabs/prysm/pull/5583). This is to help mitigate any issues where an eth1 node is slow to respond and a block proposal must be created in a timely fashion to maximize the validator's reward. If the timeout is exceeded, a random vote is used.

Thanks

paulhauner commented 4 years ago

I am not sure if this was Prysm or not, but eth1 connection only affects block proposals.

I think this was with Lighthouse but I can't find the issue anymore. There was nothing to suggest that the eth1 node going down was linked to the beacon node losing peers, apart from one happened some time after the other. I've never observed this, nor can I figure how it might happen so I closed the issue.

dapplion commented 9 months ago

pre-genesis issue

ethereum / consensus-specs

Robustness of eth2 clients to the eth1 node #1759