binary-husky / hmp2g

Multiagent Reinforcement Learning Research Project
MIT License
113 stars 34 forks source link

Docker container fails to start with --gpus all option on WSL #7

Open TomPan-1901 opened 1 year ago

TomPan-1901 commented 1 year ago

I am developing a multi-agent reinforcement learning environment based your framework, and I want to deploy your docker image on WSL. However, when I enable the --gpus all option, I get the following message and the environment fails to start:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/<layer hash>/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.

I found this issue on nvidia-docker that helped me solve the problem: https://github.com/NVIDIA/nvidia-container-toolkit/issues/289. It says that WSL has its own cuda runtime libraries, which are injected into the container when the image is created, so the container cannot have those static libraries.

After starting the container under priviledged mode without gpu, enter the command below and create a new image, the hmp container can use gpu normally:

rm -rf /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-*.so.1 /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1

I hope you can add this solution to your documentation for future reference. This would make it easier for me and other users who encounter the same problem. Thank you!

binary-husky commented 1 year ago

thank you~