Docker container fails to start with --gpus all option on WSL

I am developing a multi-agent reinforcement learning environment based your framework, and I want to deploy your docker image on WSL. However, when I enable the --gpus all option, I get the following message and the environment fails to start:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/<layer hash>/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.

I found this issue on nvidia-docker that helped me solve the problem: https://github.com/NVIDIA/nvidia-container-toolkit/issues/289. It says that WSL has its own cuda runtime libraries, which are injected into the container when the image is created, so the container cannot have those static libraries.

After starting the container under priviledged mode without gpu, enter the command below and create a new image, the hmp container can use gpu normally:

rm -rf /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-*.so.1 /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1

I hope you can add this solution to your documentation for future reference. This would make it easier for me and other users who encounter the same problem. Thank you!

binary-husky / hmp2g

Docker container fails to start with --gpus all option on WSL #7