Open goral09 opened 2 years ago
As discussed, I can't reproduce this, but added more asserts to the test in #1862 in case it happens again.
Reopening since it happened again:
---- reactor::participating::tests::run_equivocator_network stdout ----
thread 'reactor::participating::tests::run_equivocator_network' panicked at 'Alice should have been evicted.', node/src/reactor/participating/tests.rs:304:5
==============================================================
Thread: reactor::participating::tests::run_equivocator_network
To reproduce failure, try running with env var:
CL_TEST_SEED=fd02296c9f70cb27da4531fd5207b884
==============================================================
That means the test ran for 20 eras and Alice didn't equivocate. I guess having two nodes with the same key simply doesn't guarantee an equivocation (which is a good thing!). For this test we'll somehow need to ensure that it actually happens.
I think the least invasive way to make Alice equivocate is to delay all messages to and from her clone by one round length: She'd then send a witness unit citing the first proposed block, and the clone should send one that doesn't. In the second round, the clone's messages are delivered and all nodes detect the equivocation. The tests could be simplified to always expect an equivocation in the first era.
One way to implement this is to wrap the reactor in a new FilteringReactor
that in dispatch_event
applies a configurable filter
Box<dyn Fn(Event) -> Either<Effects<Event>, Event>
and either returns the effects directly or forwards to the inner reactor's dispatch_event
.
In run_equivocator_network
the filter would wrap all event
s containing IncomingMessage
or NetworkRequest
in
self.effect_builder.set_timeout(round_len).event(move |_| event)
to delay them.
I think the least invasive way to make Alice equivocate is to delay all messages to and from her clone by one round length: She'd then send a witness unit citing the first proposed block, and the clone should send one that doesn't. In the second round, the clone's messages are delivered and all nodes detect the equivocation. The tests could be simplified to always expect an equivocation in the first era.
One way to implement this is to wrap the reactor in a new
FilteringReactor
that indispatch_event
applies a configurable filterBox<dyn Fn(Event) -> Either<Effects<Event>, Event>
and either returns the effects directly or forwards to the inner reactor's
dispatch_event
.In
run_equivocator_network
the filter would wrap allevent
s containingIncomingMessage
orNetworkRequest
inself.effect_builder.set_timeout(round_len).event(move |_| event)
to delay them.
I like the idea. Similar but a different approach would be to expose . map
/flat_map
method on the Network
itself that would forward the function to the map
/flat_map
methods on the Effect
type.
EDIT: My approach wouldn't actually work as it would operate on the Effect
s but we need to access Event
.
Yes, Effect
s are futures, and you can't match
on them.
Yes,
Effect
s are futures, and you can'tmatch
on them.
There's already a map
on Effect
type so what I was proposing was adding a map
on Network
call call each individual Effect::map
and pass in the function from the outer Network
. That still wouldn't be equivalent to your proposal though.
Ah, right, then I think it could work. It might be a bit more complicated, though.
Reopening, as it happened again in https://github.com/casper-network/casper-node/pull/1935: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1744/3/3
We doubled the test's era height in https://github.com/casper-network/casper-node/pull/1935, so nodes have more time to detect the equivocation.
Looks like that didn't work; it failed again.
Experienced this as well, just fyi: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1988
@goral09 is this fixed?
Closing this, until it shows up again.
…and it did: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/2692/3/3 So Alice failed to equivocate again: reactor/participating/tests.rs#L361
FYI, Joe is seeing this also fail non-deterministically in the 1.4.0 release branch: https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/2748/3/3
This test is going to be temporarily disabled. It fails so often, that we tend to just retry the build and don't investigate if it uncovered a real problem.
@Fraser999 @piotr-dziubecki @EdHastingsCasperLabs We think this test is very important, though, so please consider increasing priority of the ticket.
As discussed I will remove the #[ignore]
and temporarily disable only the flaky assertion for now.
To give some more context about the test in general: This is meant to simulate an equivocation, by having Alice run two nodes, as well as inactivity (on feat-fast-sync-v2
), by having Bob run none.
The difficulty is to make Alice's nodes actually double-sign something despite the precautions in the code that defend against just that: The test tries to keep the two nodes (both configured with Alice's secret key) separated until they both actually signed something without knowing about the other. To do that, it delays incoming and outgoing messages in one of her nodes. It also needs to recognize messages that actually can't be part of a pair of conflicting signatures, like Highway's Ping
s: These still need to be delayed, so that the nodes don't learn about each other, but they mustn't be taken as a sign that the desired equivocation has been achieved.
Equivocation is possible:
Unit
s with the same sequence number.Echo
s with a different hash, or both a Vote(true)
and Vote(false)
, in the same round.Maybe add a bounty for such time-intensive/frustrating/sporadic bugs.
This may be addressed now, depending on how #4551 shakes out. If that makes it into dev
and no more failures occur, we can probably consider this addressed.
Potentially relevant commits:
https://drone-auto-casper-network.casperlabs.io/casper-network/casper-node/1453/3/3 failed with
source: https://github.com/casper-network/casper-node/blob/dev/node/src/reactor/participating/tests.rs#L319
4.11.23 can we create equivocation through the diagnostic port? Goal: make this deterministic and then address. Use the equivocation mechanism available in Zug