aws-observability / helm-charts

The AWS Observability Helm Charts repository contains Helm charts to provide easy mechanisms to setup the CloudWatch Agent and other collection agents to collect telemetry data such as metrics, logs and traces to send to AWS monitoring services.
Apache License 2.0
9 stars 17 forks source link

Added default tolerations. #41

Closed musa-asad closed 6 months ago

musa-asad commented 6 months ago

Description of changes: As indicated in https://github.com/aws/containers-roadmap/issues/2195, Amazon CloudWatch Observability EKS add-on currently does not have default tolerations for cloudwatch-agent and fluent-bit daemonsets, which means tainted nodes won't run cloudwatch-agent and fluent-bit. I simply updated the deployments and daemonsets to have default tolerations and the ability for customers to override this.

Test output:

Nodes:

% kubectl get nodes                                     
NAME                             STATUS   ROLES    AGE   VERSION
ip-192-168-33-152.ec2.internal   Ready    <none>   8h    v1.29.3-eks-ae9a62a

Taint:

% kubectl taint nodes ip-192-168-33-152.ec2.internal key=value:NoSchedule
node/ip-192-168-33-152.ec2.internal tainted

When running helm upgrade --install amazon-cloudwatch-observability helm-charts/charts/amazon-cloudwatch-observability --values helm-charts/charts/amazon-cloudwatch-observability/values.yaml --set clusterName=my-cluster --set region=us-east-1 --set 'tolerations[0].operator=Exists' --set 'tolerations[0].effect=NoExecute':

% kubectl get pods -o wide
NAME                                                              READY   STATUS    RESTARTS   AGE   IP              NODE                             NOMINATED NODE   READINESS GATES
amazon-cloudwatch-observability-controller-manager-6df65767gwnt   1/1     Running   0          48m   192.168.38.37   ip-192-168-33-152.ec2.internal   <none>           <none>

When running helm upgrade --install amazon-cloudwatch-observability helm-charts/charts/amazon-cloudwatch-observability --values helm-charts/charts/amazon-cloudwatch-observability/values.yaml --set clusterName=my-cluster --set region=us-east-1:

% kubectl get pods -o wide
NAME                                                              READY   STATUS    RESTARTS   AGE   IP               NODE                             NOMINATED NODE   READINESS GATES
amazon-cloudwatch-observability-controller-manager-6df65767gwnt   1/1     Running   0          50m   192.168.38.37    ip-192-168-33-152.ec2.internal   <none>           <none>
cloudwatch-agent-2s4td                                            1/1     Running   0          56s   192.168.49.133   ip-192-168-33-152.ec2.internal   <none>           <none>
dcgm-exporter-47gpz                                               1/1     Running   0          56s   192.168.46.197   ip-192-168-33-152.ec2.internal   <none>           <none>
fluent-bit-gdtkn                                                  1/1     Running   0          56s   192.168.33.152   ip-192-168-33-152.ec2.internal   <none>           <none>

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

musa-asad commented 6 months ago

Could you add steps in the PR overview on how was this changes tested ?

Adding.

musa-asad commented 6 months ago

Why is the indentation different for every yaml ? For Neuron monitor its 2 but for the daemon-sets its 6 ?

This was because the indentation of the relevant spec was great for the other daemon-sets as opposed to neuron monitor. For instance, volumes:

  volumes:

and

      volumes:
wonko commented 5 months ago

I believe this resulted in the daemonsets trying to schedule onto fargate nodes, which will never work. This breaks the addon upgrade, as the daemonset never rolls out completely.