NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.01k stars 301 forks source link

container PID namespace isolation with NVML #63

Open zw0610 opened 4 years ago

zw0610 commented 4 years ago

We'd like to deploy NVML-based monitoring tools to each task container, providing GPU information for ML engineers to take performance analysis.

However, if the PID namespace of the task container is isolated from the host machine's, we found that, even deployed within the container, the NVML (func nvmlDeviceGetComputeRunningProcesses) gives the PID(s) on the host machine. That makes the following info processing difficult because only the container PID namespace is visible to users (ML engineers).

Is there any solution to overcome this pid namespace isolation? Or does NVML has any plan to extend nvmlDeviceGetComputeRunningProcesses so that it can return pid in the container PID namespace?

guptaNswati commented 4 years ago

How are you using the NVML api inside the container? Give more information about your system abd what what you are doing?

zw0610 commented 4 years ago

OS: CentOS Linux release 7.7.1908 Kernel: Linux kube-node-zw 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Docker Info: Engine: Version: 19.03.5 API version: 1.40 (minimum version 1.12) Go version: go1.12.12 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.2.10 nvidia: Version: 1.0.0-rc8+dev docker-init: Version: 0.18.0

Nvidia GPU: Model: 1080 Ti Driver: 440.44

We are simply try to call nvmlDeviceGetComputeRunningProcesses and nvmlDeviceGetGraphicsRunningProcesses inside the container. The result shows:

  1. Both functions, while running inside the container, can get all GPU processes on the host
  2. The returned pids in the nvmlProcessInfo_t array are host PIDs instead of container PIDs.
maleadt commented 4 years ago

Another instance here, where we are using Gitlab CI/CD to launch Docker containers (which have access to the host's NVML library). These fail to inspect their own process info because of PID namespace mismatches, and there's no easy way to launch those containers with --pid=host.