datamachines / cuda_tensorflow_opencv

DockerFile with GPU support for TensorFlow and OpenCV
Apache License 2.0
120 stars 17 forks source link

Problem on making tensorflow work with gpu (for 10.2_2.1.0_4.3.0-20200423) #2

Closed OkenKhuman closed 4 years ago

OkenKhuman commented 4 years ago

First, Thanks for helping me out last time.

On working with "datamachines/cudnn_tensorflow_opencv:10.2_2.1.0_4.3.0-20200423" image, I have no problem on enabling cuda support but when I try to use Tensorflow with gpu I have issue of not able to detect my GPU i.e. when I enter "import tensorflow as tf;print(len(tf.config.experimental.list_physical_devices('GPU')))" returns me 0

Is there a way to fix it or do I need to download another image with cuda10.1?

Please help me out

If possible also please mension a way to install darknetpy in any of the image (I think it will be very good enhanceement for ML docker images like this)

mmartial commented 4 years ago

Hello Oken, sorry for the lag, I just saw this. How are you running the container? Are you using docker --gpus=all? As for Darknet, I see that Yolov4 is out, I was going to update a Dockerfile I had to build it. Maybe I will add it in an example directory when this is done.

OkenKhuman commented 4 years ago

Yes I use the docker --gpu=all command prefix. OpenCV's DNN (GPU) and other GPU backend packages like CuPy works well. Only the Tensorflow is not able to detect / use my GPU

mmartial commented 4 years ago

tl;dr: still looking into it

long: I am still investigating the main issue but TF requires CuDNN to work, so the "cuda" version will have to be CPU-bound. As I was looking into it, it appears the pip installed version is bound to older version of CUDA (10.0) and is hard-linked to those libraries, so I added some workarounds in the develop-linux branch as well as some tests (in test to run some simple TF code on CPU and GPU).

mmartial commented 4 years ago

Have some preliminary content in the develop-linux branch that now builds TF from source. TF needs the DNN base to compile the GPU dependent part.

OkenKhuman commented 4 years ago

Yesterday I was able to fully download and use your "datamachines/cudnn_tensorflow_opencv:10.1_2.1.0_4.3.0" build. Hopefully the TF there works with GPU :-). Thanks again for this wonderful image it is very helpful for engineering student like me. And if you have any paper based on this I want to cite it on project I am working or is it ok if I just give reference to this?

mmartial commented 4 years ago

Hi Oken,

I am currently building the "20200615" release, which will have TF built from source and will make use of the local CUDA and CuDNN. I would recommend waiting a couple more days (moved the compilation to a system with a lot more cores, and it is still taking a long time) before trying with this version.

If you can not wait for this release, I would encourage you to check out the develop-linux branch and compile the one version that will work best for you. On my gaming laptop (what I was using before compiling TF as well), it takes 4-5 hours per build.

Another option is to run this script to load the CUDA 10.0 libraries for TF to use them but this was more a workaround than an update solution; see: https://github.com/datamachines/cuda_tensorflow_opencv/commit/e6d8d0c3fe4eacb57be943ebe6f2f27094d5ffaa

In the test directory, you will see a few python scripts starting with tf_, I would test those in the running container to see what the system sees.

mmartial commented 4 years ago

Related to reference, feel free to reference the GitHub.

We also published an article that introduced this abstraction: "Enabling GPU-Enhanced Computer Vision and Machine Learning Research Using Containers" (Dec 2019) High Performance Computing - ISC High Performance 2019 International Workshops, Lecture Notes in Computer Science Volume 11887 https://link.springer.com/chapter/10.1007/978-3-030-34356-9_8

mmartial commented 4 years ago

I have committed to the linux-develop branch a re-factorization of the Dockerfile which has so far successfully built all the cudnn- variants. I am waiting for all of them to be compiled before calling it a success and pushing the images as well.

mmartial commented 4 years ago

Confirming the 20200615 release will solve this (being pushed to DockerHub currently) Note that you will want to use the cudnn- variant to get GPU access. Run test/tf_hw.py to obtain the list of functional hardware; when you see the verbose load for CUDA components, you will be given details on your GPU hardware, confirming it is present.

Closing this issue at this point.

mmartial commented 4 years ago

20200615 is now released and pre-built images available on Dockerhub.

mmartial commented 4 years ago

Following your question, I added this https://github.com/datamachines/cuda_tensorflow_opencv#TestingYolov4onyourwebcamLinuxandGPUonly

Might extend it with instructions for https://pypi.org/project/darknetpy/

ghost commented 4 years ago

Off-topic, but just wanted to thank you for your hard work on what is contained in this repo. I've wasted way too much time over the last few years getting TF, OpenCV, CUDA to play nicely together, and this repo means myself and others hopefully need to spend far less time doing so. So thankyou!

mmartial commented 4 years ago

You are quite welcome, I use this container very often for the same reason: I need a ready set of tools just to get some OpenCV code functional, and hopefully soon will extend the JetsonNano one for doing analytics at the edge :)

mmartial commented 4 years ago

darknetpy would unfortunately not be a good solution for using with CTO, it tries to compile Yolo.

But PyYolo (https://github.com/goktug97/PyYOLO) uses the already installed OpenCV and libdarknet.so, so I have confirmed that it works by using their sample.py code; see https://github.com/datamachines/cuda_tensorflow_opencv#641-using-pyyolo