Label filtering on StatefulSets is bringing additional pods into the test

sysarch-repo commented 3 weeks ago

Describe the bug If there are multiple pods in the namespace of the AUT, tests like specialized_init_system or sig_term_handled are triggered multiple times for the same pod or the tests are triggered for pods of other objects that should not be applied the test. The problem seems to be with the label filtering on StatefulSets. The label filtering on Deployments works as expected.

To Reproduce Steps to reproduce the behavior:

$ cnf-testsuite version CNF TestSuite version: v1.2.0

Include one Deployment and StatefulSet object in the AUT. Set the pod selection and the pod labels as follows:


Deployment  release-dns-sig:
selector.matchLabels.app/pod-group: release-dns-sig
template.metadata.labels.l1: v1
template.metadata.labels.l2: v2
template.metadata.labels.app/pod-group: release-dns-sig
template.metadata.labels.l3: v3

StatefulSet release-dns-prov: selector.matchLabels.app/pod-group: release-dns-prov template.metadata.labels.l1: v4 template.metadata.labels.l2: v5 template.metadata.labels.app/pod-group: release-dns-prov template.metadata.labels.l3: v3

2. deploy the AUT with 1 replica of the release-dns-sig pod and 1 replica of the release-dns-prov pod
3. run specialized_init_system test and verify the test was executed twice for the  release-dns-sig pod
4. inspect the debug log

Deployment release-dns-sig (seems to be using selector.matchLabels in label filtering): INFO -- cnf-testsuite: resource kind: Deployment INFO -- cnf-testsuite: pods_by_resource name: release-dns-sig DEBUG -- cnf-testsuite: resource_labels kind: Deployment resource_name: release-dns-sig DEBUG -- cnf-testsuite: resource_labels: {"app/pod-group" => "release-dns-sig"} INFO -- cnf-testsuite: pods_by_resource labels: {"app/pod-group" => "release-dns-sig"} DEBUG -- cnf-testsuite: pods_by_label labels: {"app/pod-group" => "release-dns-sig"} ... pod selection INFO -- cnf-testsuite-specialized_init_system: Pod count for resource Deployment/release-dns-sig in ns: 1 <----- OK!

StatefulSet release-dns-prov (seems to be using template.metadata.labels in label filtering): INFO -- cnf-testsuite: resource kind: StatefulSet INFO -- cnf-testsuite: pods_by_resource name: release-dns-prov DEBUG -- cnf-testsuite: resource_labels kind: StatefulSet resource_name: release-dns-prov DEBUG -- cnf-testsuite: resource_labels: {"l1" => "v4", "l2" => "v5", "app/pod-group" => "release-dns-prov", "l3" => "v3"} INFO -- cnf-testsuite: pods_by_resource labels: {"l1" => "v4", "l2" => "v5", "app/pod-group" => "release-dns-prov", "l3" => "v3"} DEBUG -- cnf-testsuite: pods_by_label labels: {"l1" => "v4", "l2" => "v5", "app/pod-group" => "release-dns-prov", "l3" => "v3"} ... pod selection INFO -- cnf-testsuite-specialized_init_system: Pod count for resource StatefulSet/release-dns-prov in ns1: 2 <---- WRONG



As a result, the test is executed twice for the release-dns-sig Deployment which makes for example the sig_term_handled test FAIL (probably because the pod is restarted from the previous test run making the test logic consider the container as not ready).

**Expected behavior**
Label filtering on StatefulSets only selects the relevant pods, i.e. the selector.matchLabels (not the template.metadata.labels) are utilized in the pod selection.

**Device (please complete the following information):**
$ uname -a
Linux ip-10-0-17-74 6.5.0-1020-aws #20~22.04.1-Ubuntu SMP Wed May  1 16:10:50 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

martin-mat commented 3 weeks ago

Good point. However identifying resources deployed by a helm chart is probably non-trivial task. It needs to be analyzed/explored if/how this would be possible.

sysarch-repo commented 3 weeks ago

@martin-mat, Below are additional insights that point to an issue with the label filtering on StatefulSets.

I am deploying two objects as AUT:

Deployment/release-dns-sig
StatefulSet/release-dns-prov The number of replicas is 1 pod for each AUT object.

There are also other Deployments / StatefulSets deployed in the same namespace (without those, the AUT would not become ready and work properly).

For the Deployment, the test correctly logs: INFO -- cnf-testsuite-specialized_init_system: Pod count for resource Deployment/release-dns-sig in ns: 1

But for the StatefulSet,the test incorrectly determines: INFO -- cnf-testsuite-specialized_init_system: Pod count for resource StatefulSet/release-dns-prov in ns: 8

So instead of identifying one replica of StatefulSet/release-dns-prov the test checks 8 objects including itself, the already checked object Deployment/release-dns-sig, and 6 other objects in the namespace.

I have updated the title and the bug description accordingly and am staying tuned for any follow-up thoughts or support.

sysarch-repo commented 1 week ago

@martin-mat, is there anything else I can add to the ticket to move it forward? Currently, the issue is extremely extending the duration of the tests and I also believe that the sig_term_handled test fails because of the multiple involvement of the same pod in the test. Thanks a lot for your support.

martin-mat commented 1 week ago

@sysarch-repo sure you can; this is open community project, you are free and welcome to figure out how to design a fix and and go ahead with implementation. If you are not into digging into the code, you need to wait until someone has some time to work on it. The fix of this does not look to be trivial and needs some time. Meanwhile you can think about a workaround - depending on your use case, for example splitting to 2 different namespaces.

Other than that that: can you get in contact with me via LFN slack? https://github.com/cnti-testcatalog/testsuite?tab=readme-ov-file#communication-and-community-meetings

martin-mat commented 1 week ago

the issue might point here https://github.com/cnf-testsuite/kubectl_client/blob/v1.0.6/kubectl_client.cr#L1310

sysarch-repo commented 1 week ago

@martin-mat thanks a lot. Very helpful and much appreciated. IMO the simple fix would be adding an if statement for statefulset to use the labels in the pod selector - exactly as done for the deployment type.

    def self.resource_spec_labels(kind : String, resource_name : String, namespace : String | Nil = nil) : JSON::Any
      Log.debug { "resource_labels kind: #{kind} resource_name: #{resource_name}" }
      if kind.downcase == "service"
        resp = resource(kind, resource_name, namespace: namespace).dig?("spec", "selector")
      elsif kind.downcase == "deployment" 
        resp = resource(kind, resource_name, namespace: namespace).dig?("spec", "selector", "matchLabels")
      elsif kind.downcase == "statefulset" 
        resp = resource(kind, resource_name, namespace: namespace).dig?("spec", "selector", "matchLabels")
      else
        resp = resource(kind, resource_name, namespace: namespace).dig?("spec", "template", "metadata", "labels")
      end
      Log.debug { "resource_labels: #{resp}" }
      if resp
        resp
      else
        JSON.parse(%({}))
      end
    end

I am not skilled wrt git merges (yet) and crystal but I am happy to join the team on Slack. Thanks again!

martin-mat commented 1 week ago

@sysarch-repo can you please test this patch on your cnf? https://github.com/cnf-testsuite/kubectl_client/pull/14

sysarch-repo commented 1 week ago

@martin-mat I can only test by downloading a new CNTI testsuite tar gz, i.e. here is how I install it on AWS EC2: curl -sLO "https://github.com/cnti-testcatalog/testsuite/releases/download/v1.2.0/cnf-testsuite-v1.2.0.tar.gz" I assume for the test you ask I need to master a different procedure but I have no skills / experience with that. Thoughts?

martin-mat commented 1 week ago

Thoughts?

joining the lfn slack for a quicker support?

martin-mat commented 3 days ago

The issue is addressed by cnf-testsuite/kubectl_client#14 and #2085 and released as a part of v1.3.0.

cnti-testcatalog / testsuite

Label filtering on StatefulSets is bringing additional pods into the test #2062