Closed sam6134 closed 5 months ago
Issue #, if available:
Description of changes:
This change adds the instances as part of the node-affinity in the helm to be able to enable monitoring for sage-maker instances.
Testing Manually updated the config for sage-maker cluster
miconeil@80a997366eb0 EpsilonDataStoreReaderService % kubectl get nodes --show-labels=true NAME STATUS ROLES AGE VERSION LABELS hyperpod-i-01b0c5ad8bcc02027 Ready <none> 13h v1.29.0-eks-5e0fdde beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=hyperpod-i-01b0c5ad8bcc02027,kubernetes.io/os=linux,node.kubernetes.io/instance-type=ml.g5.xlarge,sagemaker.amazonaws.com/cluster-name=jenna-test-gpu,sagemaker.amazonaws.com/instance-group-name=group1 hyperpod-i-09d28431dfd94e184 Ready <none> 13h v1.29.0-eks-5e0fdde beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=hyperpod-i-09d28431dfd94e184,kubernetes.io/os=linux,node.kubernetes.io/instance-type=ml.g5.xlarge,sagemaker.amazonaws.com/cluster-name=jenna-test-gpu,sagemaker.amazonaws.com/instance-group-name=group1 miconeil@80a997366eb0 EpsilonDataStoreReaderService % kubectl get all -n amazon-cloudwatch NAME READY STATUS RESTARTS AGE pod/amazon-cloudwatch-observability-controller-manager-65bcd4bxp28r 1/1 Running 0 11m pod/cloudwatch-agent-4zcrg 1/1 Running 0 10h pod/cloudwatch-agent-cb8sb 1/1 Running 0 10h pod/dcgm-exporter-kmqc6 0/1 ContainerCreating 0 6s pod/dcgm-exporter-r2nmh 0/1 ContainerCreating 0 6s pod/fluent-bit-gssdn 1/1 Running 0 10h pod/fluent-bit-r6xng 1/1 Running 0 10h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/amazon-cloudwatch-observability-webhook-service ClusterIP 172.20.143.125 <none> 443/TCP 10h service/cloudwatch-agent ClusterIP 172.20.37.233 <none> 4315/TCP,4316/TCP,2000/TCP 10h service/cloudwatch-agent-headless ClusterIP None <none> 4315/TCP,4316/TCP,2000/TCP 10h service/cloudwatch-agent-monitoring ClusterIP 172.20.11.163 <none> 8888/TCP 10h service/cloudwatch-agent-windows ClusterIP 172.20.190.75 <none> 4315/TCP,4316/TCP,2000/TCP 10h service/cloudwatch-agent-windows-headless ClusterIP None <none> 4315/TCP,4316/TCP,2000/TCP 10h service/cloudwatch-agent-windows-monitoring ClusterIP 172.20.120.137 <none> 8888/TCP 10h service/dcgm-exporter-service ClusterIP 172.20.196.216 <none> 9400/TCP 10h service/neuron-monitor-service ClusterIP 172.20.48.129 <none> 8000/TCP 10h NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/cloudwatch-agent 2 2 2 2 2 kubernetes.io/os=linux 10h daemonset.apps/cloudwatch-agent-windows 0 0 0 0 0 kubernetes.io/os=windows 10h daemonset.apps/dcgm-exporter 2 2 2 2 2 kubernetes.io/os=linux 10h daemonset.apps/fluent-bit 2 2 2 2 2 kubernetes.io/os=linux 10h daemonset.apps/fluent-bit-windows 0 0 0 0 0 kubernetes.io/os=windows 10h daemonset.apps/neuron-monitor 0 0 0 0 0 <none> 10h
Metrics flowing -
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Issue #, if available:
Description of changes:
This change adds the instances as part of the node-affinity in the helm to be able to enable monitoring for sage-maker instances.
Testing Manually updated the config for sage-maker cluster
Metrics flowing -
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.