CentaurusInfra / alnair

Intelligent platform for AI workloads
Apache License 2.0
37 stars 12 forks source link

Containerize vGPU server leads cgroup.procs content invisible (leads to process util inquiry always 0, compute control failed) #123

Closed Fizzbb closed 2 years ago

Fizzbb commented 2 years ago
  1. investigating through objdump, compiler, and Makefile specifications to make sure all users can build the same .so file.
  2. may need to add debug message in the intercept lib to check out the followings. gpu utils by process time interval fill rate cuda call token consumptions
Fizzbb commented 2 years ago

mount /sys/fs/cgroup/ from host to container causing the difference, may overwrite container's own cgroup info the required cgroup.process ID info is not retrieved correctly. So the gpu utilization by process did not get the real gpu utils.

Fizzbb commented 2 years ago

the issue must be solved. manual installation requires too much work from the user side. 1)install nvidia tool kit, 2) install go lang, 3) copy paste .so and 4) launch vgpu server, and device plugin process on each gpu node....

current plan: 1) adding debug message to check the process id obtained from "/var/lib/alnair/workspace/cgroup.procs" in two different set up. Verify the process id is wrong in the containerized vgpu server and user container 2) change the /sys/fs/cgroup/ mounting point in the vgpu server 3) verify vgpu server can get the container process id correctly and user container load it correctly in the file.

Fizzbb commented 2 years ago

confirmed that through mounting /sys/fs/cgroup, cgroup.procs file is there, but the file is empty, no process id visible in the container, which is unlike in the host. Mount to different location won't solve this problem. this is mounting kind of created a nested docker hierarchy, which may not make sense. could search docker in docker more. however, current solution is switched to mount docker socket, and ask process id through docker top <containerID> to obtain all the pid in the container.

Fizzbb commented 2 years ago

close by #130 in #130 also add container ID parsing support from cgroupfs. Now both cgroupfs and systemd driver are supported.