dstackai / dstack

dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.41k stars 140 forks source link

Support AMD GPU metrics #1876

Closed r4victor closed 16 hours ago

r4victor commented 17 hours ago

The metrics API and dstack stats implemented in #1827 only collects metrics for Nvidia GPUs. Metrics for AMD GPUs should also be collected out-of-the-box. Unlike nvidia-smi that is always present in nvidia-supported Docker images, amd-smi may not be present. Still, it seems to be present in most production images, e.g. it's available in the TGI ROCM image.

r4victor commented 17 hours ago

https://rocm.blogs.amd.com/software-tools-optimization/amd-smi-overview/README.html