Technica-Corporation / Tegra-Docker

Instructions and key files to enable Docker support on NVIDIA Tegra devices, specifically the TX-2.
Apache License 2.0
234 stars 61 forks source link

tx2-docker is not giving access of GPU to docker container #8

Open waseemkhan1989 opened 5 years ago

waseemkhan1989 commented 5 years ago

Hi,

I have created a docker image on Jetson TX2 which contains Nvidia drivers, CUDA and Cudnn libraries. I am trying to give access of GPU and CUDA drivers to this image through tx2-docker script but no success. I think tx2-docker is running successfully which you can see below:

wkh@tegra-ubuntu:~/Tegra-Docker/bin$ ./tx2-docker run openhorizon/aarch64-tx2-cudabase Running an nvidia docker image docker run -e LD_LIBRARY_PATH=:/usr/lib/aarch64-linux-gnu:/usr/lib/aarch64-linux-gnu/tegra:/usr/local/cuda/lib64 --net=host -v /usr/lib/aarch64-linux-gnu:/usr/lib/aarch64-linux-gnu -v /usr/local/cuda/lib64:/usr/local/cuda/lib64 --device=/dev/nvhost-ctrl --device=/dev/nvhost-ctrl-gpu --device=/dev/nvhost-prof-gpu --device=/dev/nvmap --device=/dev/nvhost-gpu --device=/dev/nvhost-as-gpu openhorizon/aarch64-tx2-cudabase

But when I try to run devicequery inside my container, it give me the result:

root@bc1130fc6be4:/usr/local/cuda-8.0/samples/1_Utilities/deviceQuery# ./deviceQuery ./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38 -> no CUDA-capable device is detected Result = FAIL

Any comment! Why this script is not giving access?

kgdad commented 5 years ago

Can you build and run the deviceQuery container that is in the samples directory?

How are you running commands inside of the openhorizon/aarch64-tx2-cudabase container? That container doesn't keep running for me or did you modify the base container?

waseemkhan1989 commented 5 years ago

Hi, Thanks for reply. I did not create a deviceQuery container as mentioned on the link(https://github.com/Technica-Corporation/Tegra-Docker). My openhorizon/aarch64-tx2-cudabase container have the CUDA and Cudnn libraries inside it and as you can see in my post that I am running deviceQuery from inside my container because I have all the CUDA files inside my container.

The problem that I am facing is that I am unable to pass CUDA drivers and GPU libraries to my openhorizon/aarch64-tx2-cudabase container through tx2-docker script which logically should work as mentioned on Technica-Corporation/Tegra-Docker. Any help in this regard?

waseemkhan1989 commented 5 years ago

@kgdad : if I am missing something as per my understanding, then plz guide.

waseemkhan1989 commented 5 years ago

@kgdad: One more thing: On Technica-Corporation/Tegra-Docker, it is being assumed that CUDA is installed on Jetson TX2 without Docker environment and then you create deviceQuery container to check the GPU access inside it. But my Cuda is installed within the docker container. I am thinking about building a docker image containing CUDA drivers and GPU relevant files over my openhorizon/aarch64-tx2-cudabase image. Do you think I can give access through this exercise to my docker?

kgdad commented 5 years ago

I'm not sure exactly why it isn't working for you. You are correct, the script currently assumes you are using the cuda that is already on the system located in /usr/local/bin. However, I was able to run deviceQuery successfully using the steps your outlined above. Used the openhorizon/aarch64-tx2-cudabase image as a base and created a docker container. Ran that container and then exec'ed a shell into the container. Changed into the /usr/local/cuda/samples/1_Utilities/deviceQuery directory, built the executable, and ran it. However, mine was using Cuda 9 but yours seemed to be cuda 8.

Can you give me more exact steps into what you are doing? What dockerfile are you using? What commands are you running specifically to run and exec into your container. Something is different, but I'm not sure what exactly.

waseemkhan1989 commented 5 years ago

Hi! I am using Jetpack 3.1 with CUDA 8.0 and Cudnn 6.0. I changed Dockerfile.cudabase (https://github.com/open-horizon/cogwerx-jetson-tx2/blob/jetpack3.1-L28-libs/Dockerfile.cudabase) a little bit to make it work for CUDA and Cudnn. Dockerfile is attached through which I built my container. For docker, I am using following commands: docker build -f mydockerfile -t openhorizon/aarch64-tx2-cudabase . docker run -it openhorizon/aarch64-tx2-cudabase /bin/bash docker exec -i -t containerID /bin/bash

dockerfilecudacudnndrivers.txt

waseemkhan1989 commented 5 years ago

Another thing, I did the steps as you mentioned in your last comment. to elaborate more: I ran my image through "docker run -it openhorizon/aarch64-tx2-cudabase /bin/bash" command and after executing this command, it directs me to container shell where I followed these steps after changing into the /usr/local/cuda/samples/1_Utilities/deviceQuery directory. sudo make and then I exit my container and ran this command: ./tx2-docker run openhorizon/aarch64-tx2-cudabase Then again I enter into my container through thsi command :"docker exec -i -t containerID /bin/bash". And then change my directory to /usr/local/cuda/samples/1_Utilities/deviceQuery and ran ./deviceQuery but it failed as I mentioned in my post.

kgdad commented 5 years ago

Since you are using everything as deployed in the container instead of using the system files, need to make a small change to the tx2-docker script for this to work for you.
Comment out the two lines at the top of the file where NV_LIBS is set and then create a new statement that sets NV_LIBS to empty. So the top would look like this:

NV_LIBS="/usr/lib/aarch64-linux-gnu \

/usr/local/cuda/lib64"

NV_LIBS=""

Stop the current container if it is running and then start the container again using the tx2-docker command.

Let me know if that works better for you.

waseemkhan1989 commented 5 years ago

@kgdad: Thank you so much. Your solution worked. Seriously, you made my day:). Now deviceQuery passed the result inside container. But I am facing another problem. I am trying to built Caffe image over openhorizon/aarch64-tx2-cudabase(base image) but during the building process of Caffe, it is unable to detect CUDA capable GPU device. Because I am trying to build Caffe image through docker build command and my dockerfile start like this: FROM openhorizon/aarch64-tx2-cudabase ENV ARCH=arm64 ......

As I am new to docker, and according to my understanding, tx2-docker pass on missing files to container and not to the image and I am building Caffe from openhorizon/aarch64-tx2-cudabase image. Any comment on this problem?

kgdad commented 5 years ago

Not sure exactly how this will work. I don't think docker build supports volumes or devices, which what you need to do do get a container to run with GPU support. You can try building caffe natively on the TX2 and then packaging it and then installing that package as part of your docker build process.

I'll see if I have sometime today to take a look at this in more detail.