NVIDIA / semantic-segmentation

Nvidia Semantic Segmentation monorepo
BSD 3-Clause "New" or "Revised" License
1.76k stars 388 forks source link

RTX 3090 Not Supported? #160

Open chrisman1015 opened 2 years ago

chrisman1015 commented 2 years ago

When I run the docker container I'm getting the error:

NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
WARNING: Detected NVIDIA NVIDIA GeForce RTX 3090 GPU, which is not yet supported in this version of the container
ERROR: No supported GPU(s) detected to run this container

Is there a workaround for this? Thanks.

chrisman1015 commented 2 years ago

Full error message:

== PyTorch ==
=============

NVIDIA Release 19.10 (build 8472689)
PyTorch Version 1.3.0a0+24ae9b5

Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2019 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
WARNING: Detected NVIDIA NVIDIA GeForce RTX 3090 GPU, which is not yet supported in this version of the container
ERROR: No supported GPU(s) detected to run this container

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --ipc=host ...

/usr/local/bin/nvidia_entrypoint.sh: line 109: exec: --: invalid option
exec: usage: exec [-cl] [-a name] [command [arguments ...]] [redirection ...]
 sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Sun Aug 15 00:44:44 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:0A:00.0  On |                  N/A |
| 30%   44C    P0    99W / 350W |   1333MiB / 24259MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
tjyuyao commented 2 years ago

Change the base image of the dockerfile to a newer version might help. 20.12-py3 corresponds to cuda11.1, which should be less or equal than the host version but also support 3090. More at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags .

There are some other issues I met, so I personally choose 21.05-py3 with cuda 11.3 (my host cuda version is 11.4), added DEBIAN_FRONTEND in the apt install step, installed opencv-python-headless, checkout to specific version of the apex repo, and this is the final modified version of my Dockerfile:

FROM nvcr.io/nvidia/pytorch:21.05-py3

RUN pip install --no-cache-dir runx==0.0.6
RUN pip install --no-cache-dir numpy
RUN pip install --no-cache-dir sklearn
RUN pip install --no-cache-dir h5py
RUN pip install --no-cache-dir jupyter
RUN pip install --no-cache-dir scikit-image
RUN pip install --no-cache-dir pillow
RUN pip install --no-cache-dir piexif
RUN pip install --no-cache-dir cffi
RUN pip install --no-cache-dir tqdm
RUN pip install --no-cache-dir dominate
RUN pip install --no-cache-dir opencv-python-headless
RUN pip install --no-cache-dir nose
RUN pip install --no-cache-dir ninja

RUN apt-get update && DEBIAN_FRONTEND="noninteractive" apt-get install libgtk2.0-dev -y && rm -rf /var/lib/apt/lists/*

# Install Apex
WORKDIR /home/runner
RUN cd /home/runner && git clone https://gitee.com/hyuyao/apex.git apex && cd apex && git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0 && python setup.py install --cuda_ext --cpp_ext

After installation, the CUDA 3090 unsupported message disappeared.

Jhaprince commented 2 years ago

Hi @tjyuyao, @chrisman1015, I am also facing the similar issues here , It would be great help for me if you all please help me out with this.

These are the exact error message:

============= == PyTorch ==

NVIDIA Release 19.05 (build 6411784) PyTorch Version 1.1.0a0+828a6a3

Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.

Copyright (c) 2014-2019 Facebook Inc. Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) Copyright (c) 2011-2013 NYU (Clement Farabet) Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) Copyright (c) 2006 Idiap Research Institute (Samy Bengio) Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) Copyright (c) 2015 Google Inc. Copyright (c) 2015 Yangqing Jia Copyright (c) 2013-2016 The Caffe contributors All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION. All rights reserved. NVIDIA modifications are covered by the license terms that apply to the underlying project or file. WARNING: Detected NVIDIA A100-PCIE-40GB GPU, which is not yet supported in this version of the container WARNING: Detected NVIDIA A100-PCIE-40GB GPU, which is not yet supported in this version of the container WARNING: Detected NVIDIA A100-PCIE-40GB GPU, which is not yet supported in this version of the container WARNING: Detected NVIDIA A100-PCIE-40GB GPU, which is not yet supported in this version of the container WARNING: Detected NVIDIA A100-PCIE-40GB GPU, which is not yet supported in this version of the container

Here I am also giving you the details of nvidia drivers and cuda

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-PCIE-40GB Off | 00000000:01:00.0 Off | 0 | | N/A 42C P0 42W / 250W | 29452MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 A100-PCIE-40GB Off | 00000000:41:00.0 Off | 0 | | N/A 58C P0 94W / 250W | 36248MiB / 40536MiB | 28% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 A100-PCIE-40GB Off | 00000000:81:00.0 Off | 0 | | N/A 53C P0 81W / 250W | 31684MiB / 40536MiB | 42% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 A100-PCIE-40GB Off | 00000000:C1:00.0 Off | 0 | | N/A 39C P0 36W / 250W | 3MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 A100-PCIE-40GB Off | 00000000:E1:00.0 Off | 0 | | N/A 41C P0 42W / 250W | 6251MiB / 40536MiB | 10% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

My docker version is Docker version 20.10.7, build f0df350

I can also provide other required details if in future required. Hoping to receive a quick response from your side

Thank you in advance