kosstennbl commented 1 month ago

In 69dedb30dd686ab2d1edffb4178ddb3f2c94e7e4, Litmus version was updated And all links for experiments were changed accordingly except for the node_drain. This commit fixes that.

ref: #2022

taylor commented 1 month ago

~~Lgtm~~

martin-mat commented 1 month ago

lgtm

martin-mat commented 1 month ago

there is a good spec test for node_drain https://github.com/cnti-testcatalog/testsuite/blob/main/spec/workload/resilience/node_drain_spec.cr which behaves correctly (verified).

The real reason why the issue was not detected earlier during github actions is that they use one-node kind setup for testing. node_drain test needs multi-node setup and the spec tests "passes" because the test is "skipped". Example: https://github.com/cnti-testcatalog/testsuite/actions/runs/9025631332/job/24802869636

⏭️ 🏆SKIPPED: [node_drain] node_drain chaos test requires the cluster to have atleast two schedulable nodes 🗡️💀♻

The spec tests is happy with such skipping:

      if KubectlClient::Get.schedulable_nodes_list.size > 1
        (/(PASSED).*(node_drain chaos test passed)/ =~ result[:output]).should_not be_nil
      else
        (/(SKIPPED).*(node_drain chaos test requires the cluster to have atleast two)/ =~ result[:output]).should_not be_nil
      end

So I propose to adapt github actions so they run on kind with 2 schedulable nodes Since it is more generic adaptation I suggest to handle this in a separate ticket.

2026

daniel-wilmes commented 1 month ago

@martin-mat I have verified that the fix for node drain works as a single test and in the cert command. However, I will say the cert command does not ever finish, which appears to be a seperate issue from node drain. It may pertain to either sig_term_handled, zombie_handled, or specialized_init_system. The logging doesn't appear to indicate where we are stuck. But for this ticket I think the fix for node_drain should go in.

`--- name: cnf testsuite testsuite_version: node-drain-fix-2024-05-15-142132-3258c691 status: command: /home/dwilmes/.mtx/konstruxx/working/tests/testCHF/cnf-testsuite cert essential points: 100 exit_code: 0 items:

name: increase_decrease_capacity status: passed type: essential points: 100
name: volume_hostpath_not_found status: passed type: essential points: 100
name: node_drain status: passed type: essential points: 100
name: privileged_containers status: passed type: essential points: 100
name: non_root_containers status: failed type: essential points: 0
name: cpu_limits status: passed type: essential points: 100
name: memory_limits status: passed type: essential points: 100
name: hostpath_mounts status: passed type: essential points: 100
name: container_sock_mounts status: passed type: essential points: 100
name: selinux_options status: na type: essential points: 0
name: hostport_not_used status: passed type: essential points: 100
name: hardcoded_ip_addresses_in_k8s_runtime_configuration status: passed type: essential points: 100
name: latest_tag status: passed type: essential points: 100
name: log_output status: passed type: essential points: 100
name: specialized_init_system status: failed type: essential points: 0 `

taylor commented 1 month ago

@daniel-wilmes please open a new issue for

However, I will say the cert command does not ever finish, which appears to be a seperate issue from node drain. It may pertain to either sig_term_handled, zombie_handled, or specialized_init_system. The logging doesn't appear to indicate where we are stuck.

You can try disabling those with ~testname (eg. ~sig_term_handled) and isolate which test needs to be investigated. Add that info to the new ticket you create.

cc: @Smitholi67

taylor commented 1 month ago

@martin-mat good catch on the kind in github actions. https://github.com/cnti-testcatalog/testsuite/pull/2024#issuecomment-2112309884

cnti-testcatalog / testsuite

node-drain: Fix link to experiment #2024

2026