Closed shaspitz closed 2 years ago
Documenting difficulties...
There are three ways Jehan and I considered to invoke downtime for a validator within the integration test docker container.
Solution 1 is easy to implement (and is for #257), but limits our tests in that a validator cannot be brought back online elegantly.
Solution 2 could make sense, but would require a good amount of refactors for integration test scripts. There may also be complexities in node/peer configuration when spawning, killing, or respawning a node process arbitrarily.
Solution 3 allows for the most flexibility, and least amount of refactors. I attempted to use both ip
and iptables
to achieve such an idea. Examples:
iptables -A INPUT -s 7.7.7.2 -j DROP
iptables -A OUTPUT -s 7.7.7.2 -j DROP
ip addr del 7.7.7.1/32 dev eth0
In all cases, I was not able to silo a single node from the rest of its peers without blocking tcp communication altogether. Even when a node's IP was deleted, it was still able to connect to its peers after a short recovery time. I have a feeling that the assumptions we made about conventional networking may just not apply to a localhost environment. ie. a node may not need a source IP to connect to a (dest) localhost ip. localhost routing also seems to bypass output traffic rules from iptables
.
Is there possibly some networking trick that I'm missing here? Or some cool functionality within the p2p layer of tendermint that disallows this idea?
It may also be a good idea to try placing each node within their own network namespace to aid in blocking traffic, using ip netns
Any input here is much appreciated
Hey, I don't have experience simulating network partitions locally but a quick search turned up
I'm just putting here for curiosities sake, I'm not necessarily advocating adding a dependency
Hey, I don't have experience simulating network partitions locally but a quick search turned up
- https://wiki.linuxfoundation.org/networking/netem linux util
- https://github.com/jepsen-io/jepsen jepsen distr sys network sim
- https://github.com/worstcase/blockade blockage docker network partition sim
I'm just putting here for curiosities sake, I'm not necessarily advocating adding a dependency
Nice finds! There's probably multiple ways this issue could have been solved, those links seem like viable tools for the problem. Jehan had used network namespaces previously, and found this article which ended up walking us through the exact steps we needed
We need to integration test the voting power of validators when a downtime has occurred. This should be tested for both a downtime observed on a consumer chain, and a downtime observed on a provider chain.