aws-observability / helm-charts

The AWS Observability Helm Charts repository contains Helm charts to provide easy mechanisms to setup the CloudWatch Agent and other collection agents to collect telemetry data such as metrics, logs and traces to send to AWS monitoring services.
Apache License 2.0
9 stars 17 forks source link

Add Sagemaker instances to node-affinity #36

Closed sam6134 closed 5 months ago

sam6134 commented 6 months ago

Issue #, if available:

Description of changes:

This change adds the instances as part of the node-affinity in the helm to be able to enable monitoring for sage-maker instances.

Testing Manually updated the config for sage-maker cluster

miconeil@80a997366eb0 EpsilonDataStoreReaderService % kubectl get nodes --show-labels=true
NAME                           STATUS   ROLES    AGE   VERSION               LABELS
hyperpod-i-01b0c5ad8bcc02027   Ready    <none>   13h   v1.29.0-eks-5e0fdde   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=hyperpod-i-01b0c5ad8bcc02027,kubernetes.io/os=linux,node.kubernetes.io/instance-type=ml.g5.xlarge,sagemaker.amazonaws.com/cluster-name=jenna-test-gpu,sagemaker.amazonaws.com/instance-group-name=group1
hyperpod-i-09d28431dfd94e184   Ready    <none>   13h   v1.29.0-eks-5e0fdde   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=hyperpod-i-09d28431dfd94e184,kubernetes.io/os=linux,node.kubernetes.io/instance-type=ml.g5.xlarge,sagemaker.amazonaws.com/cluster-name=jenna-test-gpu,sagemaker.amazonaws.com/instance-group-name=group1

miconeil@80a997366eb0 EpsilonDataStoreReaderService % kubectl get all -n amazon-cloudwatch
NAME                                                                  READY   STATUS              RESTARTS   AGE
pod/amazon-cloudwatch-observability-controller-manager-65bcd4bxp28r   1/1     Running             0          11m
pod/cloudwatch-agent-4zcrg                                            1/1     Running             0          10h
pod/cloudwatch-agent-cb8sb                                            1/1     Running             0          10h
pod/dcgm-exporter-kmqc6                                               0/1     ContainerCreating   0          6s
pod/dcgm-exporter-r2nmh                                               0/1     ContainerCreating   0          6s
pod/fluent-bit-gssdn                                                  1/1     Running             0          10h
pod/fluent-bit-r6xng                                                  1/1     Running             0          10h

NAME                                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/amazon-cloudwatch-observability-webhook-service   ClusterIP   172.20.143.125   <none>        443/TCP                      10h
service/cloudwatch-agent                                  ClusterIP   172.20.37.233    <none>        4315/TCP,4316/TCP,2000/TCP   10h
service/cloudwatch-agent-headless                         ClusterIP   None             <none>        4315/TCP,4316/TCP,2000/TCP   10h
service/cloudwatch-agent-monitoring                       ClusterIP   172.20.11.163    <none>        8888/TCP                     10h
service/cloudwatch-agent-windows                          ClusterIP   172.20.190.75    <none>        4315/TCP,4316/TCP,2000/TCP   10h
service/cloudwatch-agent-windows-headless                 ClusterIP   None             <none>        4315/TCP,4316/TCP,2000/TCP   10h
service/cloudwatch-agent-windows-monitoring               ClusterIP   172.20.120.137   <none>        8888/TCP                     10h
service/dcgm-exporter-service                             ClusterIP   172.20.196.216   <none>        9400/TCP                     10h
service/neuron-monitor-service                            ClusterIP   172.20.48.129    <none>        8000/TCP                     10h

NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR              AGE
daemonset.apps/cloudwatch-agent           2         2         2       2            2           kubernetes.io/os=linux     10h
daemonset.apps/cloudwatch-agent-windows   0         0         0       0            0           kubernetes.io/os=windows   10h
daemonset.apps/dcgm-exporter              2         2         2       2            2           kubernetes.io/os=linux     10h
daemonset.apps/fluent-bit                 2         2         2       2            2           kubernetes.io/os=linux     10h
daemonset.apps/fluent-bit-windows         0         0         0       0            0           kubernetes.io/os=windows   10h
daemonset.apps/neuron-monitor             0         0         0       0            0           <none>                     10h

Metrics flowing - Screenshot 2024-05-09 at 15 45 17

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.