NTHU-LSALAB / KubeShare

Share GPU between Pods in Kubernetes
Apache License 2.0
193 stars 42 forks source link

Prometheus metric gpu_capacity is not found. #23

Closed jungyh0218 closed 1 year ago

jungyh0218 commented 1 year ago

I checked out my prometheus does not export the metric 'gpu_capacity' and I guess it is the reason why my pods are always not able to be scheduled. How can I query gpu_capacity? I installed prometheus via NVIDIA deepops. https://github.com/NVIDIA/deepops

justin0u0 commented 1 year ago

Hi @jungyh0218, we use NVML to get GPU information from the kubeshare-collector component. Do you see any warning or error in the collector's log? Which is typically located at /kubeshare/log/kubeshare-collector on each of your GPU node.

jungyh0218 commented 1 year ago

Thank you @justin0u0 for your explanation. Are there any published papers about the new features of KubeShare 2.0 version? I've read the paper published in HPDC '20 and it seems like the paper is about the version 1.0.

StarCoral commented 1 year ago

Hi @jungyh0218 , you could refer to this paper. There is an introduction about KubeShare 2.0.

jungyh0218 commented 1 year ago

Sorry for commenting too late. The cause of the issue was very simple; I did not activate the endpoint of kubeshare-aggregator and kubeshare-collector in Prometheus. It is necessary to add custom metric in /etc/prometheus/prometheus.yml file.