NVIDIA / tensorflow

An Open Source Machine Learning Framework for Everyone
https://developer.nvidia.com/deep-learning-frameworks
Apache License 2.0
990 stars 152 forks source link

ERROR: No supported GPU(s) detected to run this container #22

Closed mosty-gim closed 3 years ago

mosty-gim commented 3 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

I just do like below

(1) create AWS EC2 instance ( AMI : ami-0ef85cf6e604e5650, instance type : p4d.24xlarge ) (2) install nvidia-driver ( NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 ) (3) install docker (4) install nvidia-docker (5) and try command like this ( I didn't use MIG ) sudo docker run -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1 nvcr.io/nvidia/tensorflow:20.12-tf1-py3

I got this logs.. i am beginner for tensorflow, so i think there are some my mistake.. i don't know why tensorflow can not detect gpu. image

even nvidia-smi command works well image

Describe the expected behavior tensorflow detect gpu properly

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

nluehr commented 3 years ago

What version of docker and nvidia-docker are you using? If you are using nvidia-docker2, please try starting the container with the following command.

docker run -it --gpus all nvcr.io/nvidia/tensorflow:20.12-tf1-py3 nvidia-smi

mosty-gim commented 3 years ago

hello @nluehr

What version of docker and nvidia-docker are you using? If you are using nvidia-docker2, please try starting the container with the following command.

docker run -it --gpus all nvcr.io/nvidia/tensorflow:20.12-tf1-py3 nvidia-smi

  1. docker and nvidia-docker version

This is a docker/nvidia-docker version i use:

NVIDIA Docker: 2.6.0
Client:
 Version:           20.10.2
 API version:       1.41
 Go version:        go1.13.8
 Git commit:        20.10.2-0ubuntu1~18.04.2
 Built:             Tue Mar 30 21:24:16 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.2
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.8
  Git commit:       20.10.2-0ubuntu1~18.04.2
  Built:            Mon Mar 29 19:27:41 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.3.3-0ubuntu1~18.04.4
  GitCommit:
 runc:
  Version:          spec: 1.0.2-dev
  GitCommit:
 docker-init:
  Version:          0.19.0
  GitCommit:
  1. The result to try starting the container with the command ( Last night, I upgrade nvidia driver. so nvidia driver version is different from what i said on this issues ): image

and I get the error "No supported GPU(s) detected to run this container" as well

mosty-gim commented 3 years ago

I replace ami to "ami-06e551da0d461d8e2" for my a100 ec2 instance This ami includes all package for cuda development such as nvidia driver, cudnn, tensorflow etc.. And finally, nvidia/tensorflow container detect gpu I think there are some of my mistakes to install packages or miss some package that i have to install...