Open chrisman1015 opened 2 years ago
Full error message:
== PyTorch ==
=============
NVIDIA Release 19.10 (build 8472689)
PyTorch Version 1.3.0a0+24ae9b5
Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copyright (c) 2014-2019 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
WARNING: Detected NVIDIA NVIDIA GeForce RTX 3090 GPU, which is not yet supported in this version of the container
ERROR: No supported GPU(s) detected to run this container
NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
nvidia-docker run --ipc=host ...
/usr/local/bin/nvidia_entrypoint.sh: line 109: exec: --: invalid option
exec: usage: exec [-cl] [-a name] [command [arguments ...]] [redirection ...]
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Sun Aug 15 00:44:44 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:0A:00.0 On | N/A |
| 30% 44C P0 99W / 350W | 1333MiB / 24259MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Change the base image of the dockerfile to a newer version might help. 20.12-py3
corresponds to cuda11.1, which should be less or equal than the host version but also support 3090. More at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags .
There are some other issues I met, so I personally choose 21.05-py3
with cuda 11.3 (my host cuda version is 11.4), added DEBIAN_FRONTEND
in the apt install step, installed opencv-python-headless, checkout to specific version of the apex repo, and this is the final modified version of my Dockerfile:
FROM nvcr.io/nvidia/pytorch:21.05-py3
RUN pip install --no-cache-dir runx==0.0.6
RUN pip install --no-cache-dir numpy
RUN pip install --no-cache-dir sklearn
RUN pip install --no-cache-dir h5py
RUN pip install --no-cache-dir jupyter
RUN pip install --no-cache-dir scikit-image
RUN pip install --no-cache-dir pillow
RUN pip install --no-cache-dir piexif
RUN pip install --no-cache-dir cffi
RUN pip install --no-cache-dir tqdm
RUN pip install --no-cache-dir dominate
RUN pip install --no-cache-dir opencv-python-headless
RUN pip install --no-cache-dir nose
RUN pip install --no-cache-dir ninja
RUN apt-get update && DEBIAN_FRONTEND="noninteractive" apt-get install libgtk2.0-dev -y && rm -rf /var/lib/apt/lists/*
# Install Apex
WORKDIR /home/runner
RUN cd /home/runner && git clone https://gitee.com/hyuyao/apex.git apex && cd apex && git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0 && python setup.py install --cuda_ext --cpp_ext
After installation, the CUDA 3090 unsupported message disappeared.
Hi @tjyuyao, @chrisman1015, I am also facing the similar issues here , It would be great help for me if you all please help me out with this.
These are the exact error message:
NVIDIA Release 19.05 (build 6411784) PyTorch Version 1.1.0a0+828a6a3
Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copyright (c) 2014-2019 Facebook Inc. Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) Copyright (c) 2011-2013 NYU (Clement Farabet) Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) Copyright (c) 2006 Idiap Research Institute (Samy Bengio) Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) Copyright (c) 2015 Google Inc. Copyright (c) 2015 Yangqing Jia Copyright (c) 2013-2016 The Caffe contributors All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved. NVIDIA modifications are covered by the license terms that apply to the underlying project or file. WARNING: Detected NVIDIA A100-PCIE-40GB GPU, which is not yet supported in this version of the container WARNING: Detected NVIDIA A100-PCIE-40GB GPU, which is not yet supported in this version of the container WARNING: Detected NVIDIA A100-PCIE-40GB GPU, which is not yet supported in this version of the container WARNING: Detected NVIDIA A100-PCIE-40GB GPU, which is not yet supported in this version of the container WARNING: Detected NVIDIA A100-PCIE-40GB GPU, which is not yet supported in this version of the container
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-PCIE-40GB Off | 00000000:01:00.0 Off | 0 | | N/A 42C P0 42W / 250W | 29452MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 A100-PCIE-40GB Off | 00000000:41:00.0 Off | 0 | | N/A 58C P0 94W / 250W | 36248MiB / 40536MiB | 28% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 A100-PCIE-40GB Off | 00000000:81:00.0 Off | 0 | | N/A 53C P0 81W / 250W | 31684MiB / 40536MiB | 42% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 A100-PCIE-40GB Off | 00000000:C1:00.0 Off | 0 | | N/A 39C P0 36W / 250W | 3MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 A100-PCIE-40GB Off | 00000000:E1:00.0 Off | 0 | | N/A 41C P0 42W / 250W | 6251MiB / 40536MiB | 10% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
I can also provide other required details if in future required. Hoping to receive a quick response from your side
Thank you in advance
When I run the docker container I'm getting the error:
Is there a workaround for this? Thanks.