Added default tolerations.

musa-asad commented 6 months ago

Description of changes: As indicated in https://github.com/aws/containers-roadmap/issues/2195, Amazon CloudWatch Observability EKS add-on currently does not have default tolerations for cloudwatch-agent and fluent-bit daemonsets, which means tainted nodes won't run cloudwatch-agent and fluent-bit. I simply updated the deployments and daemonsets to have default tolerations and the ability for customers to override this.

Test output:

Nodes:

% kubectl get nodes                                     
NAME                             STATUS   ROLES    AGE   VERSION
ip-192-168-33-152.ec2.internal   Ready    <none>   8h    v1.29.3-eks-ae9a62a

Taint:

% kubectl taint nodes ip-192-168-33-152.ec2.internal key=value:NoSchedule
node/ip-192-168-33-152.ec2.internal tainted

When running helm upgrade --install amazon-cloudwatch-observability helm-charts/charts/amazon-cloudwatch-observability --values helm-charts/charts/amazon-cloudwatch-observability/values.yaml --set clusterName=my-cluster --set region=us-east-1 --set 'tolerations[0].operator=Exists' --set 'tolerations[0].effect=NoExecute':

% kubectl get pods -o wide
NAME                                                              READY   STATUS    RESTARTS   AGE   IP              NODE                             NOMINATED NODE   READINESS GATES
amazon-cloudwatch-observability-controller-manager-6df65767gwnt   1/1     Running   0          48m   192.168.38.37   ip-192-168-33-152.ec2.internal   <none>           <none>

When running helm upgrade --install amazon-cloudwatch-observability helm-charts/charts/amazon-cloudwatch-observability --values helm-charts/charts/amazon-cloudwatch-observability/values.yaml --set clusterName=my-cluster --set region=us-east-1:

% kubectl get pods -o wide
NAME                                                              READY   STATUS    RESTARTS   AGE   IP               NODE                             NOMINATED NODE   READINESS GATES
amazon-cloudwatch-observability-controller-manager-6df65767gwnt   1/1     Running   0          50m   192.168.38.37    ip-192-168-33-152.ec2.internal   <none>           <none>
cloudwatch-agent-2s4td                                            1/1     Running   0          56s   192.168.49.133   ip-192-168-33-152.ec2.internal   <none>           <none>
dcgm-exporter-47gpz                                               1/1     Running   0          56s   192.168.46.197   ip-192-168-33-152.ec2.internal   <none>           <none>
fluent-bit-gdtkn                                                  1/1     Running   0          56s   192.168.33.152   ip-192-168-33-152.ec2.internal   <none>           <none>

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

musa-asad commented 6 months ago

Could you add steps in the PR overview on how was this changes tested ?

Adding.

musa-asad commented 6 months ago

Why is the indentation different for every yaml ? For Neuron monitor its 2 but for the daemon-sets its 6 ?

This was because the indentation of the relevant spec was great for the other daemon-sets as opposed to neuron monitor. For instance, volumes:

  volumes:

and

      volumes:

wonko commented 5 months ago

I believe this resulted in the daemonsets trying to schedule onto fargate nodes, which will never work. This breaks the addon upgrade, as the daemonset never rolls out completely.

aws-observability / helm-charts

Added default tolerations. #41