The main focus of this PR was to enable the balancer tests and make them more reliable:
Split single-app / large deployment into two separate CI runs for parallelisation
Made use of the profile config option to greatly reduce the time needed to "train" Cruise Control during testing by passing addition properties to cruisecontrol.properties and changing the goals.
During this PR, I noticed and addressed some additional issues:
Fixed full rebalance action, which used the wrong CC endpoint, e.g. /rebalance instead of /full
Fixed add and remove rebalance action, which were sending True/False instead of an integer for the brokerid parameter (missing parenthesis around a walrus assignment)
Found an issue with how we pass broker-capacities in the peer cluster relation. We dumped a set to a json string, meaning that the order was constantly changing, though the actual content was identical. This made the wait_for_idle checks in the tests extremely likely to timeout because the chances of getting the same 4 brokers order over multiple charm invocations in a 30 seconds period are low.
Fixed an issue with large deployment balancer checking for brokerids on the peer relation instead of the peer cluster
Specific K8s: Sped up integration tests setup by re-using built artifact from previous step. First I wrote my own fixture to do that, but then I realised that there is a dp-workflow plugin for that.
Specific K8s: since we use a node port, adding or removing a broker from the cluster does not change the bootstrap-server field in the configuration files, thus not triggering file changes leading to restarting cruise control
Specific K8s: Fixed JMX configuration on K8s (missing jar in pebble layer). Edit: actually, it seems we have a similar issue with VM with large deployments, because the JMX jar is not in CC opt/bin path
Specific K8s, not sure if fixed: So, it is not directly related to the CC feature, but we regularly have an issue where kafka fails on deployment on the leader-elected hook with the following error:
lightkube.core.exceptions.ApiError: services "kafka-k8s-bootstrap" not found
This comes from the bootstrap_server property on the balancer side accessing the nodeport too soon. I added a tiny retry, we'll see if it fixes for good the issue
Misc.
I moved the existing scenario-based test file into its own submodule. I believe this will give us more structure and control over what we want to run.
I now split the two (by 2) tests "full" versus "add/remove" by deployment topology. Not only this make the tests more reliable (with the trade-off of a slightly smaller feature coverage), they were still well over an hour long.
"Full" rebalance after unit removal test was removed. Discussed with Marc:
The main issue is that the full rebalance does not take that [broker 3 removed] into account, and do not move the partitions from 3 (avoiding dead brokers is not a goal). The solution would be to use /kafkacruisecontrol/fix_offline_replicas according to the wiki, not /rebalance
We now have this hideous pattern where we set the action result to an error message. Juju always returns 0 for an action, even if we use event.fail. This took me a huge amount of time on this PR because this makes tests way harder to debug than they should be.
Changes
The main focus of this PR was to enable the balancer tests and make them more reliable:
profile
config option to greatly reduce the time needed to "train" Cruise Control during testing by passing addition properties tocruisecontrol.properties
and changing the goals.During this PR, I noticed and addressed some additional issues:
/rebalance
instead of/full
broker-capacities
in the peer cluster relation. We dumped a set to a json string, meaning that the order was constantly changing, though the actual content was identical. This made thewait_for_idle
checks in the tests extremely likely to timeout because the chances of getting the same 4 brokers order over multiple charm invocations in a 30 seconds period are low.bootstrap-server
field in the configuration files, thus not triggering file changes leading to restarting cruise controlleader-elected
hook with the following error:This comes from the
bootstrap_server
property on the balancer side accessing the nodeport too soon. I added a tiny retry, we'll see if it fixes for good the issueMisc.
event.fail
. This took me a huge amount of time on this PR because this makes tests way harder to debug than they should be.