EspressoSystems / espresso-sequencer

93 stars 63 forks source link

Chaos monkey testing: Node failure/partition #1027

Open nomaxg opened 7 months ago

nomaxg commented 7 months ago

Test the following:

  1. Node gets disconnected, makes progress and catch up
  2. Two partitions, neither side makes progress and reconnects
nomaxg commented 7 months ago

@Ancient123 I assume this is best handled on the terraform side?

Ancient123 commented 7 months ago

Probably more AWS CLI than terraform but yeah.

sveitser commented 1 month ago

@Ancient123 @jbearer what's the status on this?

Ancient123 commented 1 month ago

We have tested individual and decent sized (10 of 100) nodes being dropped off network at the same time, and recovering.

Ancient123 commented 1 month ago

I haven't tested full network disconnect while keeping the nodes running, which we should test.