cosmos / interchain-security

Interchain Security is an open sourced IBC application which allows cosmos blockchains to lease their proof-of-stake security to one another.
https://cosmos.github.io/interchain-security/
Other
153 stars 120 forks source link

Test Validator Downtime in Integration Tests #256

Closed shaspitz closed 2 years ago

shaspitz commented 2 years ago

We need to integration test the voting power of validators when a downtime has occurred. This should be tested for both a downtime observed on a consumer chain, and a downtime observed on a provider chain.

shaspitz commented 2 years ago

Documenting difficulties...

There are three ways Jehan and I considered to invoke downtime for a validator within the integration test docker container.

  1. Cause the node to panic
  2. Store PID of node, kill and spawn node processes arbitrarily
  3. Block/Censor IP of the node

Solution 1 is easy to implement (and is for #257), but limits our tests in that a validator cannot be brought back online elegantly.

Solution 2 could make sense, but would require a good amount of refactors for integration test scripts. There may also be complexities in node/peer configuration when spawning, killing, or respawning a node process arbitrarily.

Solution 3 allows for the most flexibility, and least amount of refactors. I attempted to use both ip and iptables to achieve such an idea. Examples:

iptables -A INPUT -s 7.7.7.2 -j DROP
iptables -A OUTPUT -s 7.7.7.2 -j DROP
ip addr del 7.7.7.1/32 dev eth0

In all cases, I was not able to silo a single node from the rest of its peers without blocking tcp communication altogether. Even when a node's IP was deleted, it was still able to connect to its peers after a short recovery time. I have a feeling that the assumptions we made about conventional networking may just not apply to a localhost environment. ie. a node may not need a source IP to connect to a (dest) localhost ip. localhost routing also seems to bypass output traffic rules from iptables.

Is there possibly some networking trick that I'm missing here? Or some cool functionality within the p2p layer of tendermint that disallows this idea?

It may also be a good idea to try placing each node within their own network namespace to aid in blocking traffic, using ip netns

Any input here is much appreciated

danwt commented 2 years ago

Hey, I don't have experience simulating network partitions locally but a quick search turned up

I'm just putting here for curiosities sake, I'm not necessarily advocating adding a dependency

shaspitz commented 2 years ago

Hey, I don't have experience simulating network partitions locally but a quick search turned up

I'm just putting here for curiosities sake, I'm not necessarily advocating adding a dependency

Nice finds! There's probably multiple ways this issue could have been solved, those links seem like viable tools for the problem. Jehan had used network namespaces previously, and found this article which ended up walking us through the exact steps we needed