Closed rperdon closed 6 years ago
It seems you have the nvidia-container-cli
binary installed, but you are missing the library. This is not supposed to happen given how our packaging is done.
Can you show me the output of ldconfig -p | grep libnvidia-container
?
Thanks
I get no output from that.
What about dpkg -l '*nvidia*'
?
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============-============-============-=================================
un nvhealth-modul
I have running digits, inference server (from compute cloud), inference server clients currently running on the DGX.
You have nvidia-docker
1.0 installed, how did you run DIGITS and the inference server? With nvidia-docker run
? Or with docker run --runtime=nvidia
?
You seem to have nvidia-container-cli
, but not installed through our packages.
which nvidia-container-cli
gives you /usr/bin/nvidia-container-cli
, right?
I guess dpkg -S $(which nvidia-container-cli)
is empty too? Did you install anything on the system or was it already like that?
-Commands are run as per ngc compute: so it looks like digits is run via nvidia-docker
-which nvidia-container-cli Response is blank
-dpkg -S $(which nvidia-container-cli) dpkg-query: error: --search needs at least one file name pattern argument
Use --help for help about querying packages.
DGX system is already like this .The DGX is not setup like a normal ubuntu install, it was initially setup when installed by nvidia approved personnel, then we ran update to 3.17 as per the nvidia corporate website as per instructions for DGX enterprise support. At no point did we manually install nvidia drivers, nvidia docker, docker etc.
which nvidia-container-cli Response is blank
I'm confused here, since your initial error message is:
exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --device=all --compute --utility --require=cuda>=9.0 --pid=74222 /var/lib/docker/overlay2/e31668318907b321251266b57f0cb86c15d9afe9151f5b9004d872bd370ff07b/merged]\\n/usr/bin/nvidia-container-cli: error while loading shared libraries: libnvidia-container.so.1: cannot open shared object file: No such file or directory\\n\""": unknown.
It seems to find /usr/bin/nvidia-container-cli
, but not the library. Can you confirm you still have this error message?
/gpu-rest-engine$ docker run --runtime=nvidia --name=server --net=host --rm inference_server docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\"error running hook: exit status 127, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --device=all --compute --utility --require=cuda>=9.0 --pid=75349 /var/lib/docker/overlay2/81bcea256a8581c07c176465db7d9e16042793b2fbb9195b5064e07d64a10a14/merged]\\n/usr/bin/nvidia-container-cli: error while loading shared libraries: libnvidia-container.so.1: cannot open shared object file: No such file or directory\\n\\"\"": unknown. cpcmectech@dgx-server:/media/shared/GRE/gpu-rest-engine$
Confirmed still have the error. This is why I am confused as to how the DGX is configured. On a linux build I created (16.04 and 384 drivers), I have the GRE working fine. I am leery to do any changes to the DGX OS for fear of breaking its compatibility with the docker containers made for it provided by nvidia compute.
Ok ok, what about these commands? (with my output for a working install).
$ ls -l /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 34832 Jun 11 13:39 /usr/bin/nvidia-container-cli
$ file /usr/bin/nvidia-container-cli
/usr/bin/nvidia-container-cli: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=e73db99dc04bd29ddd9047cb5d4a0954f1915fc1, stripped
$ dpkg -S /usr/bin/nvidia-container-cli
libnvidia-container-tools: /usr/bin/nvidia-container-cli
1: -rwxr--r-- 1 nobody nogroup 34832 Jan 10 2018 /usr/bin/nvidia-container-cli
2:/usr/bin/nvidia-container-cli: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=0adf342087db4e5453345d1d493024aae99052c9, stripped
3:dpkg-query: no path found matching pattern /usr/bin/nvidia-container-cli
Meh, it seems everything is borken here :), I don't know how that happened. Please contact the NVIDIA support for DGX.
You can also use nvidia-docker run
instead of docker run --runtime=nvidia
, if that works for you.
I have tried both, it seems that DGX OS software is not configured like your atypical ubuntu system.
Ok, I don't have a way to test the same configuration as you, so please contact support :)
Thanks for the assistance. The DGX is a neat piece of hardware with some quirks. I have submitted a support ticket to DGX support and linked this thread to help make sense of what should work.
I had a thought on running it off a DGX container. If I used a nvidia/cuda container such as 9.0-cudnn7.2-devel-ubuntu16.04 with the exposed port, then from within, run the commands in the dockerfile to get the updates, go, and build the go file and "run" the go file all from the container, it should be more or less like the container you created/
I've gotten to a point where I can build the main.go file. I am a bit lost as to how the 2nd part works as I cannot locate the "caffe-server" built by go. I checked the usr/local/bin folder and could not locate a binary generated.
Not sure what you are trying to do, why aren't you using the provided Dockerfile?
I'm still waiting on dgx support to advise on their whole setup, so looking into other options
I got it working now. A weird bug in my ssh connection when running a command forced me to reboot the DGX, on retrying everything, I was able to get a GRE loaded. Thanks for the assistance.
I am on DGX OS 3.17 which contains 384 drivers and CUDA 9.0 support. My Docker containers from the nvidia cloud work correctly and launch fine, but it seems the GRE does not fuction:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\"error running hook: exit status 127, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --device=all --compute --utility --require=cuda>=9.0 --pid=74222 /var/lib/docker/overlay2/e31668318907b321251266b57f0cb86c15d9afe9151f5b9004d872bd370ff07b/merged]\\n/usr/bin/nvidia-container-cli: error while loading shared libraries: libnvidia-container.so.1: cannot open shared object file: No such file or directory\\n\\"\"": unknown.
I have read about installing libnvidia-container.git but am unsure if this would be safe to do for the DGX os as I don't want to cause any issues with my nvidia cloud docker images.