cnti-testcatalog / testsuite

šŸ“žšŸ“±ā˜ŽļøšŸ“”šŸŒ Cloud Native Telecom Initiative (CNTI) Test Catalog is a tool to check for and provide feedback on the use of K8s + cloud native best practices in networking applications and platforms
https://wiki.lfnetworking.org/display/LN/Test+Catalog
Apache License 2.0
169 stars 70 forks source link

Container ready status clarification during sig_term_handled check #2068

Closed sysarch-repo closed 3 weeks ago

sysarch-repo commented 3 weeks ago

Describe the bug The sig_term_handled check is failing due to skipping a pod with container in a not-ready status where both the container and the pod are up and running.

To Reproduce $ cnf-testsuite version CNF TestSuite version: v1.2.0

The issue seems to be specific to a particular pod. When I remove the pod from the AUT, the test PASSes. On the other hand, there is nothing special about this pod among the other pods of the AUT. The only difference in the debug log is that the problematic pod shows restartCount = 1 while others show 0.

My question is why the container statuses JSON is showing ready set to false, while the container status in the YAML pod spec is true. Does it have to do with the restartCount or how exactly does the check work?

  1. run sig_term_handled check that fails
  2. check the debug info and compare the pod resource status with the information of the testsuite
  3. Note the discrepancy in the container ready parameter
INFO -- cnf-testsuite: pod_name: <rel-name>-dns-sig-7c44f78ff-hf2tl
INFO -- cnf-testsuite: wait_for_resource_availability kind, name: pod <rel-name>-dns-sig-7c44f78ff-hf2tl
INFO -- cnf-testsuite: resource_desired_is_available? command: kubectl get pod <rel-name>-dns-sig-7c44f78ff-hf2tl -o=yaml -n <namespace>
DEBUG -- cnf-testsuite: resource_desired_is_available? output: apiVersion: v1
kind: Pod
...
  containerStatuses:
  - containerID: containerd://15e7bd812a80eb9357522519f1f8ae271e31dff8698f87cffd477224267261b9
    image: <image>
    imageID: <imageId>
    lastState:
      terminated:
        containerID: containerd://bfc9125d916a23e1d3650b30b6d0769b2c49d2e8c78762d0f2b9524712e88f1a
        exitCode: 0
        finishedAt: "2024-06-06T21:56:34Z"
        reason: Completed
        startedAt: "2024-06-06T21:38:45Z"
    name: dns-sig
    ready: true
    restartCount: 1
    started: true
    state:
      running:
        startedAt: "2024-06-06T21:56:35Z"

INFO -- cnf-testsuite: container status: {"containerID" => "containerd://15e7bd812a80eb9357522519f1f8ae271e31dff8698f87cffd477224267261b9", "image" => "<image>", "imageID" => "<imageId>", "lastState" => {"terminated" => {"containerID" => "containerd://bfc9125d916a23e1d3650b30b6d0769b2c49d2e8c78762d0f2b9524712e88f1a", "exitCode" => 0, "finishedAt" => "2024-06-06T21:56:34Z", "reason" => "Completed", "startedAt" => "2024-06-06T21:38:45Z"}}, "name" => "dns-sig", "ready" => false, "restartCount" => 1, "started" => true, "state" => {"running" => {"startedAt" => "2024-06-06T21:56:35Z"}}}

INFO -- cnf-testsuite: not ready! skipping: containerStatuses pod:<release-name>-dns-sig-7c44f78ff-hf2tl container:dns-sig

Expected behavior If this is a bug, the container status should be correctly picked up by the CNTI testsuite.

Device (please complete the following information): $ uname -a Linux ip-10-0-17-74 6.5.0-1020-aws https://github.com/cnti-testcatalog/testsuite/issues/20~22.04.1-Ubuntu SMP Wed May 1 16:10:50 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

martin-mat commented 3 weeks ago

Thanks for the report.

Obviously something unexpected is happening with the 'ready' attribute.

I did a code review but no obvious cause of such behavior. It is probably somewhere here https://github.com/cnti-testcatalog/testsuite/blob/v1.2.0/src/tasks/workload/microservice.cr#L482-L513

Can you attach complete debug logs from the sig_term_handled execution? If nothing is found there then perhaps I can add some more debug logs and ask you for re-execution.

sysarch-repo commented 3 weeks ago

@martin-mat I believe this issue is closely related to https://github.com/cnti-testcatalog/testsuite/issues/2062 where you provided your initial thoughts. It is the same AUT and there seems to be something wrong with the selection of the objects to be used in the test. I will close this ticket and we will follow up in 2062 with the specialized_ini_system test that is faster to rerun.