Closed kosstennbl closed 1 month ago
Lgtm
lgtm
there is a good spec test for node_drain https://github.com/cnti-testcatalog/testsuite/blob/main/spec/workload/resilience/node_drain_spec.cr which behaves correctly (verified).
The real reason why the issue was not detected earlier during github actions is that they use one-node kind setup for testing. node_drain test needs multi-node setup and the spec tests "passes" because the test is "skipped". Example: https://github.com/cnti-testcatalog/testsuite/actions/runs/9025631332/job/24802869636
⏭️ 🏆SKIPPED: [node_drain] node_drain chaos test requires the cluster to have atleast two schedulable nodes 🗡️💀♻
The spec tests is happy with such skipping:
if KubectlClient::Get.schedulable_nodes_list.size > 1
(/(PASSED).*(node_drain chaos test passed)/ =~ result[:output]).should_not be_nil
else
(/(SKIPPED).*(node_drain chaos test requires the cluster to have atleast two)/ =~ result[:output]).should_not be_nil
end
So I propose to adapt github actions so they run on kind with 2 schedulable nodes Since it is more generic adaptation I suggest to handle this in a separate ticket.
@martin-mat I have verified that the fix for node drain works as a single test and in the cert command. However, I will say the cert command does not ever finish, which appears to be a seperate issue from node drain. It may pertain to either sig_term_handled, zombie_handled, or specialized_init_system. The logging doesn't appear to indicate where we are stuck. But for this ticket I think the fix for node_drain should go in.
`--- name: cnf testsuite testsuite_version: node-drain-fix-2024-05-15-142132-3258c691 status: command: /home/dwilmes/.mtx/konstruxx/working/tests/testCHF/cnf-testsuite cert essential points: 100 exit_code: 0 items:
@daniel-wilmes please open a new issue for
However, I will say the cert command does not ever finish, which appears to be a seperate issue from node drain. It may pertain to either sig_term_handled, zombie_handled, or specialized_init_system. The logging doesn't appear to indicate where we are stuck.
You can try disabling those with ~testname (eg. ~sig_term_handled) and isolate which test needs to be investigated. Add that info to the new ticket you create.
cc: @Smitholi67
@martin-mat good catch on the kind in github actions. https://github.com/cnti-testcatalog/testsuite/pull/2024#issuecomment-2112309884
In 69dedb30dd686ab2d1edffb4178ddb3f2c94e7e4, Litmus version was updated And all links for experiments were changed accordingly except for the node_drain. This commit fixes that.
ref: #2022