OCP-on-NERC / docs

1 stars 11 forks source link

possible goals/pov: IBM autopilot tools #66

Open schwesig opened 4 days ago

schwesig commented 4 days ago

goal of this issue preserve the emails and messages/ideas about this, maybe leading to a pov or a goal, for sure not to forget about it and keep it open for discussion:

Starting point: https://github.com/IBM/autopilot

Starting Questions

status on 2024/10/08

Summary:

Baselining:

Autopilot, it can only check before and after the jobs, not during running (this feature is coming in the future). If this is enough for now, we can use it.

Open Source:

Yes, but it’s still using Nvidia tools like nvidia-smi and DCGMI. Not really a full alternative.

Limitations:

There is no checking during the jobs for now. Errors cannot be detected in real-time.

Integration:

It works fine with Prometheus and Grafana. Fitting into our system.

Setup:

Needs to run on every GPU node, and Nvidia tools must be installed. But results can be gathered.

Test Install

Image

Feedback Heidi

/CC @schwesig @computate @hpdempsey

computate commented 3 days ago

Next step, I will integrate the provided autopilot metrics and grafana dashboards into NERC observability, and then do a demo.