GPU REST Engine on nvidia DGX

rperdon commented 6 years ago

I am on DGX OS 3.17 which contains 384 drivers and CUDA 9.0 support. My Docker containers from the nvidia cloud work correctly and launch fine, but it seems the GRE does not fuction:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\"error running hook: exit status 127, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --device=all --compute --utility --require=cuda>=9.0 --pid=74222 /var/lib/docker/overlay2/e31668318907b321251266b57f0cb86c15d9afe9151f5b9004d872bd370ff07b/merged]\\n/usr/bin/nvidia-container-cli: error while loading shared libraries: libnvidia-container.so.1: cannot open shared object file: No such file or directory\\n\\"\"": unknown.

I have read about installing libnvidia-container.git but am unsure if this would be safe to do for the DGX os as I don't want to cause any issues with my nvidia cloud docker images.

flx42 commented 6 years ago

It seems you have the nvidia-container-cli binary installed, but you are missing the library. This is not supposed to happen given how our packaging is done. Can you show me the output of ldconfig -p | grep libnvidia-container?

Thanks

rperdon commented 6 years ago

I get no output from that.

flx42 commented 6 years ago

What about dpkg -l '*nvidia*'?

rperdon commented 6 years ago

Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==============-============-============-================================= un nvhealth-modul (no description available) un nvidia-375-dia (no description available) ii nvidia-384 384.145-0ubu amd64 NVIDIA binary driver - version 38 ii nvidia-384-dev 384.145-0ubu amd64 NVIDIA binary Xorg driver develop ii nvidia-384-dia 384.145-0ubu amd64 NVIDIA driver diagnostics utiliti un nvidia-current (no description available) un nvidia-current (no description available) ii nvidia-docker 1.0.1-4 amd64 NVIDIA Docker container tools un nvidia-driver- (no description available) ii nvidia-libopen 384.145-0ubu amd64 NVIDIA OpenCL Driver and ICD Load ii nvidia-modprob 384.145-0ubu amd64 Load the NVIDIA kernel driver and un nvidia-opencl- (no description available) ii nvidia-opencl- 384.145-0ubu amd64 NVIDIA OpenCL ICD ii nvidia-peer-me 1.0-5 all nvidia peer memory kernel module. ii nvidia-peer-me 1.0-5 all DKMS support for nvidia-peer-memo un nvidia-persist (no description available) un nvidia-prime (no description available) ii nvidia-setting 384.145-0ubu amd64 Tool for configuring the NVIDIA g un nvidia-setting (no description available) un nvidia-sysinfo (no description available) un nvidia-sysinfo (no description available)

rperdon commented 6 years ago

I have running digits, inference server (from compute cloud), inference server clients currently running on the DGX.

flx42 commented 6 years ago

You have nvidia-docker 1.0 installed, how did you run DIGITS and the inference server? With nvidia-docker run? Or with docker run --runtime=nvidia?

You seem to have nvidia-container-cli, but not installed through our packages. which nvidia-container-cli gives you /usr/bin/nvidia-container-cli, right?

I guess dpkg -S $(which nvidia-container-cli) is empty too? Did you install anything on the system or was it already like that?

rperdon commented 6 years ago

-Commands are run as per ngc compute: so it looks like digits is run via nvidia-docker

-which nvidia-container-cli Response is blank

-dpkg -S $(which nvidia-container-cli) dpkg-query: error: --search needs at least one file name pattern argument

Use --help for help about querying packages.

DGX system is already like this .The DGX is not setup like a normal ubuntu install, it was initially setup when installed by nvidia approved personnel, then we ran update to 3.17 as per the nvidia corporate website as per instructions for DGX enterprise support. At no point did we manually install nvidia drivers, nvidia docker, docker etc.

flx42 commented 6 years ago

which nvidia-container-cli Response is blank

I'm confused here, since your initial error message is:

 exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --device=all --compute --utility --require=cuda>=9.0 --pid=74222 /var/lib/docker/overlay2/e31668318907b321251266b57f0cb86c15d9afe9151f5b9004d872bd370ff07b/merged]\\n/usr/bin/nvidia-container-cli: error while loading shared libraries: libnvidia-container.so.1: cannot open shared object file: No such file or directory\\n\""": unknown.

It seems to find /usr/bin/nvidia-container-cli, but not the library. Can you confirm you still have this error message?

rperdon commented 6 years ago

/gpu-rest-engine$ docker run --runtime=nvidia --name=server --net=host --rm inference_server docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 0 caused \\"error running hook: exit status 127, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --device=all --compute --utility --require=cuda>=9.0 --pid=75349 /var/lib/docker/overlay2/81bcea256a8581c07c176465db7d9e16042793b2fbb9195b5064e07d64a10a14/merged]\\n/usr/bin/nvidia-container-cli: error while loading shared libraries: libnvidia-container.so.1: cannot open shared object file: No such file or directory\\n\\"\"": unknown. cpcmectech@dgx-server:/media/shared/GRE/gpu-rest-engine$

Confirmed still have the error. This is why I am confused as to how the DGX is configured. On a linux build I created (16.04 and 384 drivers), I have the GRE working fine. I am leery to do any changes to the DGX OS for fear of breaking its compatibility with the docker containers made for it provided by nvidia compute.

flx42 commented 6 years ago

Ok ok, what about these commands? (with my output for a working install).

$ ls -l /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 34832 Jun 11 13:39 /usr/bin/nvidia-container-cli

$ file /usr/bin/nvidia-container-cli
/usr/bin/nvidia-container-cli: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=e73db99dc04bd29ddd9047cb5d4a0954f1915fc1, stripped

$ dpkg -S /usr/bin/nvidia-container-cli
libnvidia-container-tools: /usr/bin/nvidia-container-cli

rperdon commented 6 years ago

1: -rwxr--r-- 1 nobody nogroup 34832 Jan 10 2018 /usr/bin/nvidia-container-cli

2:/usr/bin/nvidia-container-cli: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=0adf342087db4e5453345d1d493024aae99052c9, stripped

3:dpkg-query: no path found matching pattern /usr/bin/nvidia-container-cli

flx42 commented 6 years ago

Meh, it seems everything is borken here :), I don't know how that happened. Please contact the NVIDIA support for DGX.

You can also use nvidia-docker run instead of docker run --runtime=nvidia, if that works for you.

rperdon commented 6 years ago

I have tried both, it seems that DGX OS software is not configured like your atypical ubuntu system.

flx42 commented 6 years ago

Ok, I don't have a way to test the same configuration as you, so please contact support :)

rperdon commented 6 years ago

Thanks for the assistance. The DGX is a neat piece of hardware with some quirks. I have submitted a support ticket to DGX support and linked this thread to help make sense of what should work.

rperdon commented 6 years ago

I had a thought on running it off a DGX container. If I used a nvidia/cuda container such as 9.0-cudnn7.2-devel-ubuntu16.04 with the exposed port, then from within, run the commands in the dockerfile to get the updates, go, and build the go file and "run" the go file all from the container, it should be more or less like the container you created/

rperdon commented 6 years ago

I've gotten to a point where I can build the main.go file. I am a bit lost as to how the 2nd part works as I cannot locate the "caffe-server" built by go. I checked the usr/local/bin folder and could not locate a binary generated.

flx42 commented 6 years ago

Not sure what you are trying to do, why aren't you using the provided Dockerfile?

rperdon commented 6 years ago

I'm still waiting on dgx support to advise on their whole setup, so looking into other options

rperdon commented 6 years ago

I got it working now. A weird bug in my ssh connection when running a command forced me to reboot the DGX, on retrying everything, I was able to get a GRE loaded. Thanks for the assistance.

NVIDIA / gpu-rest-engine

GPU REST Engine on nvidia DGX #35