awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
https://awslabs.github.io/data-on-eks/
Apache License 2.0
658 stars 223 forks source link

feat: Add cloudwatch eks add on with enhanced monitoring for neuron #651

Closed ratnopamc closed 2 months ago

ratnopamc commented 2 months ago

What does this PR do?

Adds cloudwatch eks add on with enhanced container insights monitoring for neuron

🛑 Please open an issue first to discuss any significant work and flesh out details/direction. When we triage the issues, we will add labels to the issue like "Enhancement", "Bug" which should indicate to you that this issue can be worked on and we are looking forward to your PR. We would hate for your time to be wasted. Consult the CONTRIBUTING guide for submitting pull-requests.

Motivation

The Neuron Monitor container solution provides a comprehensive monitoring framework for ML workloads on Amazon EKS, using the power of Neuron Monitor. This PR adds cloudwatch eks add on to the trainium-inferentia blueprint with enhanced container insights monitoring for neuron.

More

For Moderators

Additional Notes