AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 309 forks source link

这个GPU共享插件支持使用dcgm-exporter做监控吗 #211

Open db-root opened 1 year ago

db-root commented 1 year ago

kubernetes version:v1.23.16

nvidia-docker info

Client: Docker Engine - Community Version: 24.0.2 Context: default Debug Mode: false Plugins: buildx: Docker Buildx (Docker Inc.) Version: v0.10.5 Path: /usr/libexec/docker/cli-plugins/docker-buildx compose: Docker Compose (Docker Inc.) Version: v2.18.1 Path: /usr/libexec/docker/cli-plugins/docker-compose

Server: Containers: 110 Running: 56 Paused: 0 Stopped: 54 Images: 40 Server Version: 20.10.24 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc Default Runtime: nvidia Init Binary: docker-init containerd version: 3dce8... runc version: v1.1.7-0-g860f061 init version: de40ad0 Security Options: apparmor seccomp Profile: default Kernel Version: 5.4.0-150-generic Operating System: Ubuntu 20.04.6 LTS OSType: linux Architecture: x86_64 CPUs: 48 Total Memory: 125.6GiB Name: node01 ID: Docker Root Dir: /app/docker Debug Mode: false Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

WARNING: No swap limit support

我在集群上部署了gpushare做GPU共享,并且使用dcgm-exporter来做监控。https://github.com/NVIDIA/dcgm-exporter 但是在普罗米修斯上看不到GPU利用率的参数值,以及无法监控pod的gpu资源利用率 有同学用过这种方案吗,麻烦支持一下。 image

binz123 commented 1 year ago

同问+1

fenwuyaoji commented 1 year ago

已收到您的邮件,我将及时查看并回复,谢谢                                                                                                                     王鑫

ZhangSetSail commented 1 year ago

同问+1

ZhangSetSail commented 1 year ago

目前能收集到的指标太少了,温度功耗指标我该怎么获取。 image