Open db-root opened 1 year ago
kubernetes version:v1.23.16
Client: Docker Engine - Community Version: 24.0.2 Context: default Debug Mode: false Plugins: buildx: Docker Buildx (Docker Inc.) Version: v0.10.5 Path: /usr/libexec/docker/cli-plugins/docker-buildx compose: Docker Compose (Docker Inc.) Version: v2.18.1 Path: /usr/libexec/docker/cli-plugins/docker-compose
Server: Containers: 110 Running: 56 Paused: 0 Stopped: 54 Images: 40 Server Version: 20.10.24 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc Default Runtime: nvidia Init Binary: docker-init containerd version: 3dce8... runc version: v1.1.7-0-g860f061 init version: de40ad0 Security Options: apparmor seccomp Profile: default Kernel Version: 5.4.0-150-generic Operating System: Ubuntu 20.04.6 LTS OSType: linux Architecture: x86_64 CPUs: 48 Total Memory: 125.6GiB Name: node01 ID: Docker Root Dir: /app/docker Debug Mode: false Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
WARNING: No swap limit support
我在集群上部署了gpushare做GPU共享,并且使用dcgm-exporter来做监控。https://github.com/NVIDIA/dcgm-exporter 但是在普罗米修斯上看不到GPU利用率的参数值,以及无法监控pod的gpu资源利用率 有同学用过这种方案吗,麻烦支持一下。
同问+1
已收到您的邮件,我将及时查看并回复,谢谢 王鑫
目前能收集到的指标太少了,温度功耗指标我该怎么获取。
kubernetes version:v1.23.16
nvidia-docker info
Client: Docker Engine - Community Version: 24.0.2 Context: default Debug Mode: false Plugins: buildx: Docker Buildx (Docker Inc.) Version: v0.10.5 Path: /usr/libexec/docker/cli-plugins/docker-buildx compose: Docker Compose (Docker Inc.) Version: v2.18.1 Path: /usr/libexec/docker/cli-plugins/docker-compose
Server: Containers: 110 Running: 56 Paused: 0 Stopped: 54 Images: 40 Server Version: 20.10.24 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc Default Runtime: nvidia Init Binary: docker-init containerd version: 3dce8... runc version: v1.1.7-0-g860f061 init version: de40ad0 Security Options: apparmor seccomp Profile: default Kernel Version: 5.4.0-150-generic Operating System: Ubuntu 20.04.6 LTS OSType: linux Architecture: x86_64 CPUs: 48 Total Memory: 125.6GiB Name: node01 ID: Docker Root Dir: /app/docker Debug Mode: false Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
WARNING: No swap limit support
我在集群上部署了gpushare做GPU共享,并且使用dcgm-exporter来做监控。https://github.com/NVIDIA/dcgm-exporter 但是在普罗米修斯上看不到GPU利用率的参数值,以及无法监控pod的gpu资源利用率 有同学用过这种方案吗,麻烦支持一下。