aws-samples / eks-anywhere-addons

https://aws-samples.github.io/eks-anywhere-addons/
MIT No Attribution
21 stars 39 forks source link

Komodor EKS anywhere add-on #117

Closed nirbenator closed 11 months ago

nirbenator commented 11 months ago

Komodor EKS Anywhere Add-on

Description of changes:

Adding Komodor helm onboarding to EKS-A common

Installation instructions:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

elamaran11 commented 11 months ago

@nirbenator Thankyou for submitting a PR to Conformance framework. The functional test job submitted does meet the criteria based on our requirements. Please check our Functional Job requirements page for more details and please resubmit a qualified functional job meeting our requirements. Also please share the secret for Komodor via a secured channel. Please reachout for any questions.

nirbenator commented 11 months ago

@elamaran11 - thanks for your reply:)

elamaran11 commented 11 months ago

@nirbenator I once again checked the test job now, this again is a technical test to check if watcher pod is running and does not qualify for a functional test. The functional should satisfy the functional specifications for product under test and not technical specifications. Please check our Functional Job requirements page for more details and please resubmit a qualified functional job meeting our requirements. Refer the example here for functional test job.

elamaran11 commented 11 months ago

@nirbenator Also one more feedback on ExternalSecret. Your ExternalSecret script needs two AWS secrets which is not unnecessary. Please follow this approach for referencing single AWS Secret Manager Secret via property for multiple secure strings.

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: komodor-external-secret
  namespace: komodor
spec:
  refreshInterval: 1m
  secretStoreRef:
    name: eksa-secret-store #The secret store name we have just created.
    kind: ClusterSecretStore
  target:
    name: k8s-watcher-secret-flux
  data:
  - secretKey: k8s-watcher-apiKey 
    remoteRef:
      key: komodor-secrets
      property: k8s-watcher-apiKey
  - secretKey: k8s-watcher-clusterName 
    remoteRef:
      key: komodor-secrets
      property: k8s-watcher-clusterName 
nirbenator commented 11 months ago

@elamaran11 - added an agent funcitonal connectivity test, as well as used your reference for the externalsecrets

i have a question in regards to updating the chart's version - if we update the chart version, how we update you guys? would you like to receive a PR every time we update the chart version?

elamaran11 commented 11 months ago

@elamaran11 - added an agent funcitonal connectivity test, as well as used your reference for the externalsecrets

i have a question in regards to updating the chart's version - if we update the chart version, how we update you guys? would you like to receive a PR every time we update the chart version?

@nirbenator ExternalSecret change looks good to me. The testjob again looks like technical test and not functional test. Agent connectivity is a technical test and not a functional test. As i discussed with Amit, we want to see functionally at-least a specific observability component flowing to your SAAS platform to show the product is functionally working. The one you have now again does not qualify for a valid functional test.

Conformance framework (Conformitron) is not one time testing but a continuous and consistent testing that will happen across different environments in a continuous basis. So the ISV is expected to manage the version of their helm charts and releases so the product does not fail in our environments. We will reach-out to ISV via various channels if the product fails to install or test fails. We also appreciate ISVs who proactively submit chart version changes like NewRelic. Happy to answer any other questions you have.

elamaran11 commented 11 months ago

@nirbenator The test job looks very reasonable and aligns to functional reqs. Could you please confirm if you completely tested and validated the test job in your environment. Once you confirm i can start validations in my different EKS-A environments.

elamaran11 commented 11 months ago

@nirbenator The test job fails in our EKS Local Cluster Outpost environment. Please check the logs from test job below. Once the job works here, i can test in other environments :

❯ k logs komodor-tester-7bzh7 -n komodor                                                     ─╯
1. Checking readiness probe for watcher
Waiting
Found watcher pod: komodor-k8s-watcher-565578746d-ljbv6
2. Creating test configmap
Error from server (NotFound): error when replacing "STDIN": configmaps "komodor-test-configmap" not found
3. Checking if komodor identified the configmap
Waiting for Komodor to identify the configmap
jq: error (at <stdin>:0): null (null) has no keys
latest_value in komodor =
latest_value in configmap = 1691597412
Waiting for Komodor to identify the configmap
jq: error (at <stdin>:0): null (null) has no keys
latest_value in komodor =
latest_value in configmap = 1691597412
Waiting for Komodor to identify the configmap
jq: error (at <stdin>:0): null (null) has no keys
latest_value in komodor =
latest_value in configmap = 1691597412
Waiting for Komodor to identify the configmap
jq: error (at <stdin>:0): null (null) has no keys
latest_value in komodor =
latest_value in configmap = 1691597412
Waiting for Komodor to identify the configmap
jq: error (at <stdin>:0): null (null) has no keys
latest_value in komodor =
latest_value in configmap = 1691597412
Waiting for Komodor to identify the configmap
jq: error (at <stdin>:0): null (null) has no keys
latest_value in komodor =
latest_value in configmap = 1691597412
Waiting for Komodor to identify the configmap
jq: error (at <stdin>:0): null (null) has no keys
latest_value in komodor =
latest_value in configmap = 1691597412
elamaran11 commented 11 months ago

@nirbenator Also another inference of Komodor pods while running in local cluster. 3 Pods which are scheduled in Control Plane are in init state and does not move. I think the pod is assigned to a control plane node which is in NotReady state which is by design for EKS local cluster on Outpost so it is not spinning the pod in a Control Plane node, it should not do that. Please check on this let us know.

❯ kgp -n komodor                                                                             ─╯
NAME                                   READY   STATUS     RESTARTS   AGE
komodor-k8s-watcher-565578746d-ljbv6   1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-2t8zn       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-5tgq4       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-jfgwh       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-rcj2f       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-t8422       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-tfqqf       1/1     Running    0          7m2s
komodor-tester-7bzh7                   1/1     Running    0          5m6s
❯ kgno                                                                                       ─╯
NAME                                       STATUS     ROLES           AGE   VERSION
ip-10-0-4-147.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-4-210.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-233.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-99.us-west-2.compute.internal    Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-5-214.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-5-238.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad
amit9192 commented 11 months ago

@nirbenator Also another inference of Komodor pods while running in local cluster. 3 Pods which are scheduled in Control Plane are in init state and does not move. I think the pod is assigned to a control plane node which is in NotReady state which is by design for EKS local cluster on Outpost so it is not spinning the pod in a Control Plane node, it should not do that. Please check on this let us know.

❯ kgp -n komodor                                                                             ─╯
NAME                                   READY   STATUS     RESTARTS   AGE
komodor-k8s-watcher-565578746d-ljbv6   1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-2t8zn       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-5tgq4       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-jfgwh       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-rcj2f       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-t8422       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-tfqqf       1/1     Running    0          7m2s
komodor-tester-7bzh7                   1/1     Running    0          5m6s
❯ kgno                                                                                       ─╯
NAME                                       STATUS     ROLES           AGE   VERSION
ip-10-0-4-147.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-4-210.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-233.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-99.us-west-2.compute.internal    Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-5-214.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-5-238.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad

@elamaran11 it seems like you are correct, these Pods are scheduled on those control-plane nodes. If I understand you correctly they are NotReady by design which means they are not supposed to allow any Pod to run on them, should we just update the DS configuration to not run on nodes from that type/role?

elamaran11 commented 11 months ago

@nirbenator Also another inference of Komodor pods while running in local cluster. 3 Pods which are scheduled in Control Plane are in init state and does not move. I think the pod is assigned to a control plane node which is in NotReady state which is by design for EKS local cluster on Outpost so it is not spinning the pod in a Control Plane node, it should not do that. Please check on this let us know.

❯ kgp -n komodor                                                                             ─╯
NAME                                   READY   STATUS     RESTARTS   AGE
komodor-k8s-watcher-565578746d-ljbv6   1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-2t8zn       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-5tgq4       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-jfgwh       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-rcj2f       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-t8422       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-tfqqf       1/1     Running    0          7m2s
komodor-tester-7bzh7                   1/1     Running    0          5m6s
❯ kgno                                                                                       ─╯
NAME                                       STATUS     ROLES           AGE   VERSION
ip-10-0-4-147.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-4-210.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-233.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-99.us-west-2.compute.internal    Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-5-214.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-5-238.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad

@elamaran11 it seems like you are correct, these Pods are scheduled on those control-plane nodes. If I understand you correctly they are NotReady by design which means they are not supposed to allow any Pod to run on them, should we just update the DS configuration to not run on nodes from that type/role?

@amit9192 That is very accurate for EKS on Local Clusters (Outpost). The pods should not be necessarily running unless it has something to do with collecting some events from control plane. In case of deployments apart from EKS on Cloud, CP is also managed by the user so if you think pods dont have a need to run in CP, you should update your deployment configurations accordingly. Please see the labels and annotations in CP Node :

Name:               ip-10-0-4-210.us-west-2.compute.internal
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-4-210.us-west-2.compute.internal
                    kubernetes.io/os=linux
                    node-role.eks-local.amazonaws.com/control-plane=
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
                    node.kubernetes.io/instance-type=m5.large
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2b
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.0.4.210
                    cluster.x-k8s.io/cluster-name: elamaras-conformitron-rover-dont-delete
                    cluster.x-k8s.io/cluster-namespace: aws-601017151385-elamaras-conformitron-rover-dont-e3c11427
                    cluster.x-k8s.io/labels-from-machine:
                    cluster.x-k8s.io/machine: elamaras-conformitron-rover-dont-delete-control-plane-xlxsl
                    cluster.x-k8s.io/owner-kind: KubeadmControlPlane
                    cluster.x-k8s.io/owner-name: elamaras-conformitron-rover-dont-delete-control-plane
amit9192 commented 11 months ago

@nirbenator Also another inference of Komodor pods while running in local cluster. 3 Pods which are scheduled in Control Plane are in init state and does not move. I think the pod is assigned to a control plane node which is in NotReady state which is by design for EKS local cluster on Outpost so it is not spinning the pod in a Control Plane node, it should not do that. Please check on this let us know.

❯ kgp -n komodor                                                                             ─╯
NAME                                   READY   STATUS     RESTARTS   AGE
komodor-k8s-watcher-565578746d-ljbv6   1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-2t8zn       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-5tgq4       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-jfgwh       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-rcj2f       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-t8422       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-tfqqf       1/1     Running    0          7m2s
komodor-tester-7bzh7                   1/1     Running    0          5m6s
❯ kgno                                                                                       ─╯
NAME                                       STATUS     ROLES           AGE   VERSION
ip-10-0-4-147.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-4-210.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-233.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-99.us-west-2.compute.internal    Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-5-214.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-5-238.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad

@elamaran11 it seems like you are correct, these Pods are scheduled on those control-plane nodes. If I understand you correctly they are NotReady by design which means they are not supposed to allow any Pod to run on them, should we just update the DS configuration to not run on nodes from that type/role?

@amit9192 That is very accurate for EKS on Local Clusters (Outpost). The pods should not be necessarily running unless it has something to do with collecting some events from control plane. In case of deployments apart from EKS on Cloud, CP is also managed by the user so if you think pods dont have a need to run in CP, you should update your deployment configurations accordingly. Please see the labels and annotations in CP Node :

Name:               ip-10-0-4-210.us-west-2.compute.internal
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-4-210.us-west-2.compute.internal
                    kubernetes.io/os=linux
                    node-role.eks-local.amazonaws.com/control-plane=
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
                    node.kubernetes.io/instance-type=m5.large
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2b
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.0.4.210
                    cluster.x-k8s.io/cluster-name: elamaras-conformitron-rover-dont-delete
                    cluster.x-k8s.io/cluster-namespace: aws-601017151385-elamaras-conformitron-rover-dont-e3c11427
                    cluster.x-k8s.io/labels-from-machine:
                    cluster.x-k8s.io/machine: elamaras-conformitron-rover-dont-delete-control-plane-xlxsl
                    cluster.x-k8s.io/owner-kind: KubeadmControlPlane
                    cluster.x-k8s.io/owner-name: elamaras-conformitron-rover-dont-delete-control-plane

@elamaran11 Thanks for the explanation, I did see the labels. The DS Pods are being used to collect metrics from the Nodes, I'm debating whether we should or should not collect CP metrics as well, WDYT?

elamaran11 commented 11 months ago

@nirbenator Also another inference of Komodor pods while running in local cluster. 3 Pods which are scheduled in Control Plane are in init state and does not move. I think the pod is assigned to a control plane node which is in NotReady state which is by design for EKS local cluster on Outpost so it is not spinning the pod in a Control Plane node, it should not do that. Please check on this let us know.

❯ kgp -n komodor                                                                             ─╯
NAME                                   READY   STATUS     RESTARTS   AGE
komodor-k8s-watcher-565578746d-ljbv6   1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-2t8zn       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-5tgq4       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-jfgwh       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-rcj2f       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-t8422       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-tfqqf       1/1     Running    0          7m2s
komodor-tester-7bzh7                   1/1     Running    0          5m6s
❯ kgno                                                                                       ─╯
NAME                                       STATUS     ROLES           AGE   VERSION
ip-10-0-4-147.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-4-210.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-233.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-99.us-west-2.compute.internal    Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-5-214.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-5-238.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad

@elamaran11 it seems like you are correct, these Pods are scheduled on those control-plane nodes. If I understand you correctly they are NotReady by design which means they are not supposed to allow any Pod to run on them, should we just update the DS configuration to not run on nodes from that type/role?

@amit9192 That is very accurate for EKS on Local Clusters (Outpost). The pods should not be necessarily running unless it has something to do with collecting some events from control plane. In case of deployments apart from EKS on Cloud, CP is also managed by the user so if you think pods dont have a need to run in CP, you should update your deployment configurations accordingly. Please see the labels and annotations in CP Node :

Name:               ip-10-0-4-210.us-west-2.compute.internal
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-4-210.us-west-2.compute.internal
                    kubernetes.io/os=linux
                    node-role.eks-local.amazonaws.com/control-plane=
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
                    node.kubernetes.io/instance-type=m5.large
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2b
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.0.4.210
                    cluster.x-k8s.io/cluster-name: elamaras-conformitron-rover-dont-delete
                    cluster.x-k8s.io/cluster-namespace: aws-601017151385-elamaras-conformitron-rover-dont-e3c11427
                    cluster.x-k8s.io/labels-from-machine:
                    cluster.x-k8s.io/machine: elamaras-conformitron-rover-dont-delete-control-plane-xlxsl
                    cluster.x-k8s.io/owner-kind: KubeadmControlPlane
                    cluster.x-k8s.io/owner-name: elamaras-conformitron-rover-dont-delete-control-plane

@elamaran11 Thanks for the explanation, I did see the labels. The DS Pods are being used to collect metrics from the Nodes, I'm debating whether we should or should not collect CP metrics as well, WDYT?

@amit9192 If the product is capable of collecting CP metrics which the customers can use and if it is part of your differentiating features by all means but caveat is why is the pod in Init state and cant deploy in CP nodes in Local Cluster probably because of NotReady state? In that case we need to exclude Komodor from Local Cluster (or in Local Cluster alone you should exclude the CP piece) and submit to three other deployment models like BM, vSphere and SnowBall like our NewRelic Partner. These are some options you have to explore but i will leave to you on how to position the conformance validation best suited for your marketing and product sell needs. @shapirov103 Thoughts?

amit9192 commented 11 months ago

@nirbenator Also another inference of Komodor pods while running in local cluster. 3 Pods which are scheduled in Control Plane are in init state and does not move. I think the pod is assigned to a control plane node which is in NotReady state which is by design for EKS local cluster on Outpost so it is not spinning the pod in a Control Plane node, it should not do that. Please check on this let us know.

❯ kgp -n komodor                                                                             ─╯
NAME                                   READY   STATUS     RESTARTS   AGE
komodor-k8s-watcher-565578746d-ljbv6   1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-2t8zn       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-5tgq4       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-jfgwh       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-rcj2f       1/1     Running    0          7m2s
komodor-k8s-watcher-daemon-t8422       0/1     Init:0/1   0          7m2s
komodor-k8s-watcher-daemon-tfqqf       1/1     Running    0          7m2s
komodor-tester-7bzh7                   1/1     Running    0          5m6s
❯ kgno                                                                                       ─╯
NAME                                       STATUS     ROLES           AGE   VERSION
ip-10-0-4-147.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-4-210.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-233.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-4-99.us-west-2.compute.internal    Ready      <none>          8d    v1.27.3-eks-a5565ad
ip-10-0-5-214.us-west-2.compute.internal   NotReady   control-plane   8d    v1.27.1-eks-61789d8
ip-10-0-5-238.us-west-2.compute.internal   Ready      <none>          8d    v1.27.3-eks-a5565ad

@elamaran11 it seems like you are correct, these Pods are scheduled on those control-plane nodes. If I understand you correctly they are NotReady by design which means they are not supposed to allow any Pod to run on them, should we just update the DS configuration to not run on nodes from that type/role?

@amit9192 That is very accurate for EKS on Local Clusters (Outpost). The pods should not be necessarily running unless it has something to do with collecting some events from control plane. In case of deployments apart from EKS on Cloud, CP is also managed by the user so if you think pods dont have a need to run in CP, you should update your deployment configurations accordingly. Please see the labels and annotations in CP Node :

Name:               ip-10-0-4-210.us-west-2.compute.internal
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-4-210.us-west-2.compute.internal
                    kubernetes.io/os=linux
                    node-role.eks-local.amazonaws.com/control-plane=
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
                    node.kubernetes.io/instance-type=m5.large
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2b
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.0.4.210
                    cluster.x-k8s.io/cluster-name: elamaras-conformitron-rover-dont-delete
                    cluster.x-k8s.io/cluster-namespace: aws-601017151385-elamaras-conformitron-rover-dont-e3c11427
                    cluster.x-k8s.io/labels-from-machine:
                    cluster.x-k8s.io/machine: elamaras-conformitron-rover-dont-delete-control-plane-xlxsl
                    cluster.x-k8s.io/owner-kind: KubeadmControlPlane
                    cluster.x-k8s.io/owner-name: elamaras-conformitron-rover-dont-delete-control-plane

@elamaran11 Thanks for the explanation, I did see the labels. The DS Pods are being used to collect metrics from the Nodes, I'm debating whether we should or should not collect CP metrics as well, WDYT?

@amit9192 If the product is capable of collecting CP metrics which the customers can use and if it is part of your differentiating features by all means but caveat is why is the pod in Init state and cant deploy in CP nodes in Local Cluster probably because of NotReady state? In that case we need to exclude Komodor from Local Cluster (or in Local Cluster alone you should exclude the CP piece) and submit to three other deployment models like BM, vSphere and SnowBall like our NewRelic Partner. These are some options you have to explore but i will leave to you on how to position the conformance validation best suited for your marketing and product sell needs. @shapirov103 Thoughts?

Thanks for the detailed response, I'll take it with @nirbenator tomorrow morning and we'll loop back here. My general thought is to drop it.

nirbenator commented 11 months ago

@elamaran11 Hey! fixed both the issue with the initial failure for creating a configmap and the controlplane scheduling for the daemonset fyi

elamaran11 commented 11 months ago

@nirbenator The job completed but i still see some error in logs. Do you have to introduce any time delays? Also i would recommend to convert the job to CronJob so it runs with a schedule. Once you fix this, i will run it on the rest.

❯ k logs komodor-tester-2f6pv -n komodor                                                     ─╯
1. Checking readiness probe for watcher
Waiting
Found watcher pod: komodor-k8s-watcher-565578746d-ljbv6
2. Creating test configmap
Error from server (NotFound): configmaps "komodor-test-configmap" not found
configmap/komodor-test-configmap created
3. Checking if komodor identified the configmap
Waiting for Komodor to identify the configmap
jq: error (at <stdin>:0): null (null) has no keys
latest_value in komodor =
latest_value in configmap = 1691682200
Waiting for Komodor to identify the configmap
latest_value in komodor = "1691682200\n"
latest_value in configmap = 1691682200
elamaran11 commented 11 months ago

@nirbenator @amit9192 As your team continue to make recommended changes pointed above, I tried to test the addon on EKS-A on Baremetal and im facing below issues with pod state in AppArmor. When i bounced the pods, got them in Running status but we dont want this experience for customers. Please check this let me know.

❯ kga -n komodor                                                                             ─╯
NAME                                       READY   STATUS     RESTARTS   AGE
pod/komodor-k8s-watcher-68558787c7-dpk5n   1/1     AppArmor   0          4m25s
pod/komodor-k8s-watcher-daemon-gxb65       1/1     AppArmor   0          4m25s
pod/komodor-k8s-watcher-daemon-r9vng       1/1     Running    0          4m25s
pod/komodor-k8s-watcher-daemon-whwxt       1/1     Running    0          4m25s
pod/komodor-tester-ssprw                   0/1     Completed   0          3m32s

NAME                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/komodor-k8s-watcher-daemon   3         3         3       3            3           <none>          4m25s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/komodor-k8s-watcher   1/1     1            1           4m25s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/komodor-k8s-watcher-68558787c7   1         1         1       4m25s

NAME                       COMPLETIONS   DURATION   AGE
job.batch/komodor-tester   0/1           57s        57s
❯ k logs komodor-tester-ssprw -n komodor                                                     ─╯
1. Checking readiness probe for watcher
Waiting
Found watcher pod: komodor-k8s-watcher-68558787c7-dpk5n
2. Creating test configmap
Error from server (NotFound): configmaps "komodor-test-configmap" not found
configmap/komodor-test-configmap created
3. Checking if komodor identified the configmap
Waiting for Komodor to identify the configmap
latest_value in komodor = "1691682200\n"
latest_value in configmap = 1691762487
Waiting for Komodor to identify the configmap
latest_value in komodor = "1691682200\n"
latest_value in configmap = 1691762487
Waiting for Komodor to identify the configmap
latest_value in komodor = "1691682200\n"
latest_value in configmap = 1691762487
Waiting for Komodor to identify the configmap
latest_value in komodor = "1691682200\n"
latest_value in configmap = 1691762487
Waiting for Komodor to identify the configmap
latest_value in komodor = "1691762487\n"
latest_value in configmap = 1691762487
NAME                                       READY   STATUS      RESTARTS   AGE
pod/komodor-k8s-watcher-68558787c7-5svh2   1/1     Running     0          24s
pod/komodor-k8s-watcher-daemon-5jt7h       1/1     Running     0          33s
pod/komodor-k8s-watcher-daemon-r9vng       1/1     Running     0          9m11s
pod/komodor-k8s-watcher-daemon-whwxt       1/1     Running     0          9m11s
pod/komodor-tester-ssprw                   0/1     Completed   0          5m43s
nirbenator commented 11 months ago

@nirbenator The job completed but i still see some error in logs. Do you have to introduce any time delays? Also i would recommend to convert the job to CronJob so it runs with a schedule. Once you fix this, i will run it on the rest.


@elamaran11 
i introduced a better sleep mechanism to wait 30 seconds before approaching komdords backed and avoid jq errors like the one you had

i don't understand the cronjob requirement, all the other tests are done using a test can you please provide a reference or an example? how often do you want the test to run?

regarding appArmor pod status - this is a first, and is also quite weird as the test validates k8s-watcher to be in status "Running" before going on with the functional test. how can i reproduce? are there any events for this pod?

elamaran11 commented 11 months ago

@nirbenator The job completed but i still see some error in logs. Do you have to introduce any time delays? Also i would recommend to convert the job to CronJob so it runs with a schedule. Once you fix this, i will run it on the rest.

@elamaran11 i introduced a better sleep mechanism to wait 30 seconds before approaching komdords backed and avoid jq errors like the one you had

i don't understand the cronjob requirement, all the other tests are done using a test can you please provide a reference or an example? how often do you want the test to run?

regarding appArmor pod status - this is a first, and is also quite weird as the test validates k8s-watcher to be in status "Running" before going on with the functional test. how can i reproduce? are there any events for this pod?

@nirbenator Thankyou for your response. Thanks for the job changes, i will test it once you convert it to a CronJob. Please check this NewRelic PR for CronJob which runs each day. We are making this mandatory for all test jobs because as i said this is an ongoing activity and not one time.

I dont have any thoughts on how to reproduce AppArmor issue because i got this only in Baremetal environment and i didnt get it once i bounced the pod. I dont know if this environment issue, I will test it in vSphere and Snowball too once you have CronJob changes. But from your end see if you want to deploy via flux pointing to this repo in EKS a few times to see if you can reproduce. Thanks for your patience we can get this to finish line soon.

elamaran11 commented 11 months ago

@nirbenator Thankyou for all changes such as Cron, tweaking sleep time etc. I was able to successfully run the Komodor product with the functional job on three deployment models BareMetal, Local Cluster and vSphere successfully now. Please check the job log below. I dont see the AppArmor issue this time on all 3 when i retested so i think its more of interim thing. Thanks again. I should be able to get to Snowball by tmorrowo and close the validadation as well. Thanks for the patience.

❯ k logs komodor-tester-cron-001-59wns -n komodor
1. Checking readiness probe for watcher
Waiting
Found watcher pod: komodor-k8s-watcher-565578746d-p4smd
2. Creating test configmap
configmap/komodor-test-configmap created
3. Checking if komodor identified the configmap
Configmap value matches the desired timestamp. Exiting with code 0.
elamaran11 commented 11 months ago

@nirbenator Everything looks good and we are merging the PR. Please check for Komodor in all 4 deployment models in Validated Partners folder.