casper-network / casper-node

Reference client for CASPER protocol
https://casper.network
Apache License 2.0
389 stars 219 forks source link

Non-det failure of `run_equivocator_network` test. #1859

Open goral09 opened 2 years ago

goral09 commented 2 years ago

https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1453/3/3 failed with

failures:

---- reactor::participating::tests::run_equivocator_network stdout ----
thread 'reactor::participating::tests::run_equivocator_network' panicked at 'assertion failed: `(left == right)`
  left: `[]`,
 right: `[PublicKey::Ed25519(5bc5542a73a3fd4e53c2f3c84838156cf8ce5de0db828a5644b1c6d883056aa6)]`', node/src/reactor/participating/tests.rs:319:13

source: https://github.com/casper-network/casper-node/blob/dev/node/src/reactor/participating/tests.rs#L319

4.11.23 can we create equivocation through the diagnostic port? Goal: make this deterministic and then address. Use the equivocation mechanism available in Zug

afck commented 2 years ago

As discussed, I can't reproduce this, but added more asserts to the test in #1862 in case it happens again.

afck commented 2 years ago

Reopening since it happened again:

---- reactor::participating::tests::run_equivocator_network stdout ----
thread 'reactor::participating::tests::run_equivocator_network' panicked at 'Alice should have been evicted.', node/src/reactor/participating/tests.rs:304:5

==============================================================
Thread: reactor::participating::tests::run_equivocator_network
To reproduce failure, try running with env var:
CL_TEST_SEED=fd02296c9f70cb27da4531fd5207b884
==============================================================

That means the test ran for 20 eras and Alice didn't equivocate. I guess having two nodes with the same key simply doesn't guarantee an equivocation (which is a good thing!). For this test we'll somehow need to ensure that it actually happens.

afck commented 2 years ago

I think the least invasive way to make Alice equivocate is to delay all messages to and from her clone by one round length: She'd then send a witness unit citing the first proposed block, and the clone should send one that doesn't. In the second round, the clone's messages are delivered and all nodes detect the equivocation. The tests could be simplified to always expect an equivocation in the first era.

One way to implement this is to wrap the reactor in a new FilteringReactor that in dispatch_event applies a configurable filter

Box<dyn Fn(Event) -> Either<Effects<Event>, Event>

and either returns the effects directly or forwards to the inner reactor's dispatch_event.

In run_equivocator_network the filter would wrap all events containing IncomingMessage or NetworkRequest in

self.effect_builder.set_timeout(round_len).event(move |_| event)

to delay them.

goral09 commented 2 years ago

I think the least invasive way to make Alice equivocate is to delay all messages to and from her clone by one round length: She'd then send a witness unit citing the first proposed block, and the clone should send one that doesn't. In the second round, the clone's messages are delivered and all nodes detect the equivocation. The tests could be simplified to always expect an equivocation in the first era.

One way to implement this is to wrap the reactor in a new FilteringReactor that in dispatch_event applies a configurable filter

Box<dyn Fn(Event) -> Either<Effects<Event>, Event>

and either returns the effects directly or forwards to the inner reactor's dispatch_event.

In run_equivocator_network the filter would wrap all events containing IncomingMessage or NetworkRequest in

self.effect_builder.set_timeout(round_len).event(move |_| event)

to delay them.

I like the idea. Similar but a different approach would be to expose map/flat_map method on the Network itself that would forward the function to the map/flat_map methods on the Effect type..

EDIT: My approach wouldn't actually work as it would operate on the Effects but we need to access Event.

afck commented 2 years ago

Yes, Effects are futures, and you can't match on them.

goral09 commented 2 years ago

Yes, Effects are futures, and you can't match on them.

There's already a map on Effect type so what I was proposing was adding a map on Network call call each individual Effect::map and pass in the function from the outer Network. That still wouldn't be equivalent to your proposal though.

afck commented 2 years ago

Ah, right, then I think it could work. It might be a bit more complicated, though.

afck commented 2 years ago

Reopening, as it happened again in https://github.com/casper-network/casper-node/pull/1935: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1744/3/3

afck commented 2 years ago

We doubled the test's era height in https://github.com/casper-network/casper-node/pull/1935, so nodes have more time to detect the equivocation.

afck commented 2 years ago

Looks like that didn't work; it failed again.

TomVasile commented 2 years ago

Experienced this as well, just fyi: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1988

piotr-dziubecki commented 2 years ago

@goral09 is this fixed?

afck commented 2 years ago

Closing this, until it shows up again.

afck commented 2 years ago

…and it did: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/2692/3/3 So Alice failed to equivocate again: reactor/participating/tests.rs#L361

TomVasile commented 2 years ago

FYI, Joe is seeing this also fail non-deterministically in the 1.4.0 release branch: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/2748/3/3

darthsiroftardis commented 2 years ago

another failure: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/2760/3/3

afck commented 2 years ago

And another one: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/3082/3/3

rafal-ch commented 2 years ago

This test is going to be temporarily disabled. It fails so often, that we tend to just retry the build and don't investigate if it uncovered a real problem.

@Fraser999 @piotr-dziubecki @EdHastingsCasperLabs We think this test is very important, though, so please consider increasing priority of the ticket.

afck commented 2 years ago

As discussed I will remove the #[ignore] and temporarily disable only the flaky assertion for now.

afck commented 1 year ago

To give some more context about the test in general: This is meant to simulate an equivocation, by having Alice run two nodes, as well as inactivity (on feat-fast-sync-v2), by having Bob run none.

The difficulty is to make Alice's nodes actually double-sign something despite the precautions in the code that defend against just that: The test tries to keep the two nodes (both configured with Alice's secret key) separated until they both actually signed something without knowing about the other. To do that, it delays incoming and outgoing messages in one of her nodes. It also needs to recognize messages that actually can't be part of a pair of conflicting signatures, like Highway's Pings: These still need to be delayed, so that the nodes don't learn about each other, but they mustn't be taken as a sign that the desired equivocation has been achieved.

Equivocation is possible:

ghost commented 1 year ago

Maybe add a bounty for such time-intensive/frustrating/sporadic bugs.

marc-casperlabs commented 4 months ago

This may be addressed now, depending on how #4551 shakes out. If that makes it into dev and no more failures occur, we can probably consider this addressed.

Potentially relevant commits: