Open rach-id opened 2 months ago
Thank you for sharing it @rach-id
It seems to be a great framework to work with. I think it does not harm to integrate it or at least the parts that we are interested in after investigating it. Even though BitTwister does some traffic shaping, we can also investigate chaos-mesh
and what we can get out of it.
Yes, I am thinking of slowly adding it. We can start first by having a simple Kubernetes cluster where we can specify the celestia-app versions, the number of validators etc. Then, integrate chaos-mesh into so that we're able to run experiments manually.
If the framework gets used enough and people are interested, we can start integrating it in Knuu.
Still in testing
Proposal
Currently, Knuu allows for introducing delays using
tc
for testing purposes. However, we can enhance its capabilities by integrating chaos-mesh into it.Chaos-mesh is part of the CNCF incubating project, is stable, and is used in production by multiple companies for testing their workloads. It allows the following faults to be injected into the Kubernetes cluster with fine-grained control over them:
For example, the network attacks contain the following:
Also, we're able to stress the resources of validators to see how they would behave, change the clock for a set of validators to see the impact on consensus, IO injections etc.
More can be found in the docs.
What's interesting about it is that devs will be able to spin up their testnets, then they can start introducing different faults into the cluster and watch how it will respond. Alternatively, they can create YAML config files to create the attacks. Example latency YAML:
Which they can apply to the cluster.
A third approach, which is the most interesting for Knuu is that we can integrate this in the framework and allow devs to programmatically define a set of faults in the cluster and build tests using them. This should be possible to do since the framework uses CustomResourceDefinition (CRD) for defining the attacks.
Note: I didn't try integrating it programmatically. If we decide to integrate it, then I can deep dive into it.
Example
To have a feel to this framework, we can spin a simple celestia-app e2e test and inject faults into it.
Important: Make sure you're not running any of the below commands on an existing shared/production Kubernetes cluster, not to end up messing the shared environment. Running this locally is safer.
E2ESimple
test using the commandmake test-e2e
. Follow these instructions for doing so.Note: If you still didn't create the minikube cluster, run it using the
docker
driver and enough ressources. I personnaly used this for this test:chaos-mesh
: the framework is a set of workloads that need to be running in the cluster. The following command will do that:E2ESimple
test:I personally add at the end of the
E2ESimple
atime.Sleep(60 * time.Minute)
so that the testnet doesn't stop down while I'm conducting tests.This will allow accessing the dashboard in
localhost:2333
.First, follow the validator's logs that we'll be targetting in the test:
The validator pod name can be gotten from:
Then, open the dashboard and do the following:
New experiment
100ms
test
, it's the same namespace used in theE2ESimple
.Label Selectors
30s
Now if you check the validator's logs you will see that it's taking rounds to consensus or missing blocks even, depending on the latency used. Then, once the experiment ends, you will see that the validator is catching up again.
Estimation
Integrating this framework will require investigating whether they have an existing programmatic API for executing the attacks. If so, the integration will be easy since we won't need to support everything, we can select a set of attacks and start with them. If not, then it will take more time to integrate.
But first, we will need to decide whether we want to use an existing framework that injects faults or we want to add support for them ourselves natively in Knuu.