Open srteam2020 opened 3 years ago
This issue might be hard to reproduce as it will only manifest when the particular interleaving happens (e.g., when the reconcile goroutine runs slow). We are recently building a tool https://github.com/sieve-project/sieve to reliably detect and reproduce bugs like the above one in various k8s controllers. This bug can be automatically reproduced by just a few commands using our tool. Please take a try if you also want to reproduce the bug.
To use the tool, please first run the environment check:
python3 check-env.py
If the checker passes, please build the Kubernetes and controller images using the commands below as our tool needs to set up a kind cluster to reproduce the bug:
python3 build.py -p kubernetes -m obs-gap -d DOCKER_REPO_NAME
python3 build.py -p casskop-operator -m obs-gap -d DOCKER_REPO_NAME -s f87c8e05c1a2896732fc5f3a174f1eb99e936907
-d DOCKER_REPO_NAME
should be the docker repo that you have write access to, -s f87c8e05c1a2896732fc5f3a174f1eb99e936907
is the commit where the bug is found.
Finally, run
python3 sieve.py -p casskop-operator -t nodesperrack -d DOCKER_REPO_NAME
It will try to scale the cassandra cluster from 2-> 1 -> 0 and also manipulate the goroutine interleaving of the operator to trigger the previously mentioned corner case interleaving. The bug is successfully reproduced if you can see the message below
[ERROR] pod SIZE inconsistency: 2 seen after learning run, but 3 seen after testing run
Ideally there should be 2 pods eventually (1 for the operator pod and 1 for the cassandra replica), but after the test run we observe 3 pods (1 for the operator pod and 2 cassandra replicas as mentioned above).
For more information please refer to https://github.com/sieve-project/sieve#sieve-testing-datacenter-infrastructures-using-partial-histories and https://github.com/sieve-project/sieve/blob/main/docs/reprod.md#orange-opensource-casskop-342
Please let me know if you encounter any problems when reproducing the bug with the tool.
@srteam2020 is it something you'd like to work on ?
@cscetbon Thank you for the reply. We are looking for a clean fix for this problem. I will get back and send a PR if we find a solution.
Bug Report
The casskop operator has the following logic to invalidate user request which scales down the cluster to 0, and the replica number of the cluster is then determined by the previous cluster state stored in the annotations of the operator instance,
Here the
oldCRD
refers last-applied-configuration for the CRD, which is dumped and stored inside operator's annotation field during every reconcile.We ran a workload that changes replica (NodesPerRacks) from 2 -> 1 -> 0. Ideally, the replicas should remain 1 since scaling down to 0 not allowed. However, we observed there are 2 replicas at the end.
The reason is that the goroutine of performing reconciliation (g1) and the goroutine of updating the cassandraCluster object (g2) are concurrent, and certain interleaving of the two goroutines can lead to the unexpected scaling behavior. Ideally, the following interleaving will lead to the correct result:
And we find that the following interleaving can lead to the unexpected scaling behavior mentioned above:
What did you do? As mentioned above, we changed the replica (NodesPerRacks) from 2 -> 1 -> 0.
What did you expect to see? There should be 1 replica at the end.
What did you see instead? Under which circumstances? We observed that it ended up with 2 replicas.
Environment
casskop version: f87c8e05c1a2896732fc5f3a174f1eb99e936907 (master branch)
Kubernetes version information: 1.18.9
Cassandra version: 3.11
Possible Solution This issue is probably hard to fix from the operator side, as how the operator reads events from users (e.g., scaling up/down) and how the different goroutines interleave with each other are decided by the
controller-runtime
package, not the operator code. We have not found a good way to solve this issue for now. We also observed some similar behavior in other controllers. We are still working on finding a good solution to this kind of issues and will send a PR for that.