Integrate chaos-mesh into Knuu

rach-id commented 2 months ago

Proposal

Currently, Knuu allows for introducing delays using tc for testing purposes. However, we can enhance its capabilities by integrating chaos-mesh into it.

Chaos-mesh is part of the CNCF incubating project, is stable, and is used in production by multiple companies for testing their workloads. It allows the following faults to be injected into the Kubernetes cluster with fine-grained control over them:

For example, the network attacks contain the following:

Also, we're able to stress the resources of validators to see how they would behave, change the clock for a set of validators to see the impact on consensus, IO injections etc.

More can be found in the docs.

What's interesting about it is that devs will be able to spin up their testnets, then they can start introducing different faults into the cluster and watch how it will respond. Alternatively, they can create YAML config files to create the attacks. Example latency YAML:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay2
  namespace: chaos-mesh
spec:
  action: delay 
  mode: one 
  selector: 
    namespaces:
      - test
    labelSelectors:
      'app': 'val0-135317fe'
  delay:
    latency: '20000ms'
  duration: '1m'

Which they can apply to the cluster.

A third approach, which is the most interesting for Knuu is that we can integrate this in the framework and allow devs to programmatically define a set of faults in the cluster and build tests using them. This should be possible to do since the framework uses CustomResourceDefinition (CRD) for defining the attacks.

Note: I didn't try integrating it programmatically. If we decide to integrate it, then I can deep dive into it.

Example

To have a feel to this framework, we can spin a simple celestia-app e2e test and inject faults into it.

Important: Make sure you're not running any of the below commands on an existing shared/production Kubernetes cluster, not to end up messing the shared environment. Running this locally is safer.

Be able to run the E2ESimple test using the command make test-e2e. Follow these instructions for doing so.

Note: If you still didn't create the minikube cluster, run it using the docker driver and enough ressources. I personnaly used this for this test:

minikube start --driver docker --network socket_vmnet --nodes 1 --cpus=8 --memory=10g

Install chaos-mesh: the framework is a set of workloads that need to be running in the cluster. The following command will do that:

curl -sSL https://mirrors.chaos-mesh.org/v2.6.3/install.sh | bash

Run the E2ESimple test:

make test-e2e

I personally add at the end of the E2ESimple a time.Sleep(60 * time.Minute) so that the testnet doesn't stop down while I'm conducting tests.

Once the testnet validators are running, we can expose the chaos-mesh dashboard:

kubectl port-forward --namespace chaos-mesh service/chaos-dashboard 2333:2333

This will allow accessing the dashboard in localhost:2333.

Access the dashboard and run an experiment:

First, follow the validator's logs that we'll be targetting in the test:

kubectl logs --namespace test <validator_pod_name>

The validator pod name can be gotten from:

kubectl get pods --namespace test

Then, open the dashboard and do the following:

Click on New experiment
Choose Kubernetes
Network attack
Select Delay
Set Latency to 100ms
Click submit
Set namespace selector to test, it's the same namespace used in the E2ESimple.
Name experiment some name
Select a validator from Label Selectors
Set the duration to 30s
Click submit
Then submit the experiment

Now if you check the validator's logs you will see that it's taking rounds to consensus or missing blocks even, depending on the latency used. Then, once the experiment ends, you will see that the validator is catching up again.

Estimation

Integrating this framework will require investigating whether they have an existing programmatic API for executing the attacks. If so, the integration will be easy since we won't need to support everything, we can select a set of attacks and start with them. If not, then it will take more time to integrate.

But first, we will need to decide whether we want to use an existing framework that injects faults or we want to add support for them ourselves natively in Knuu.

mojtaba-esk commented 2 months ago

Thank you for sharing it @rach-id It seems to be a great framework to work with. I think it does not harm to integrate it or at least the parts that we are interested in after investigating it. Even though BitTwister does some traffic shaping, we can also investigate chaos-mesh and what we can get out of it.

rach-id commented 2 months ago

Yes, I am thinking of slowly adding it. We can start first by having a simple Kubernetes cluster where we can specify the celestia-app versions, the number of validators etc. Then, integrate chaos-mesh into so that we're able to run experiments manually.

If the framework gets used enough and people are interested, we can start integrating it in Knuu.

smuu commented 1 week ago

Still in testing

celestiaorg / knuu