Open pfcouto opened 1 year ago
Hi @pfcouto
There seems to be an issue with your PyTorch installation. You can follow these steps to install detectron2 within your environment constraints(like python 3.8.10, CUDA 11.3 and linux platform). I hope this resolves your issue, if you face any further issue even after following these steps, let me know.
pyenv install 3.8.10
pyenv shell 3.8.10
python -m venv env && source ./env/bin/activate
python --version && which python
python -m pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
Hi @satishjasthi, if possible I would like to keep using conda. I am running in Linux (Fedora) but using the nvidia-smi
command it shows I have CUDA Version: 12.1
, so I can use a higher CUDA version if the detectron installation allows it. I tried to replicate those steps using it with the following commands:
conda create -n detectron2 python=3.8.10
conda activate detectron2
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
And still got the error building detectron2.
Hi @pfcouto
I see that the root cause is the mismatch between your CUDA version and torch version you are trying to install. In your case you are trying to install torch based on CUDA 11.3 for CUDA 12.1, which may not work and you might end up getting the same error. Instead if your environment supports try installing CUDA 11.7 as latest PyTorch version supports this.
You can do this using Docker. If you have Docker installed, you can use a Docker image that includes CUDA 11.7. NVIDIA provides Docker images with different CUDA versions through the NVIDIA GPU Cloud. You can pull the CUDA 11.7 image with:
docker pull nvcr.io/nvidia/cuda:11.7.0-base-ubuntu20.04
Then you can run your program inside a Docker container using this image. This has the advantage of not affecting your system's CUDA installation, but it requires you to have Docker installed and to be familiar with Docker usage.
Hi @satishjasthi, following your idea I installed the current stable version of pytorch (2.0.1)since it supports CUDA 11.7. And it worked. However, can you explain a bit more about the Docker option? Does it require NVIDIA-Docker (I have had problems before). Would I have to copy all my directory into it every time I want to run? Or copy it into and then always edit it directly in docker? Also, I am working with an external camera, would docker be able to access it?
conda create -n detectron2 python=3.8.10
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
Hi @pfcouto Yes you need to use Nvidia-Docker image because it comes with a desired CUDA version which can communicate with underlying hardware. And more over it makes the whole development process simpler while dealing with projects requiring different CUDA versions. You need not copy entire project data to docker every time that would be cumbersome. Instead you can use docker mount option which will mount the desired directory on host machine to a desired directory inside docker. You can read more about mounting here. And yes using docker you can access any camera connected to the host machine
Hi @satishjasthi. I tried to install NVIDIA-Docker. However, I am facing an error.
Upon running the command docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
i get the error
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
I installed nvidia through Fedora Docs, not Nvidia, so as an example nvcc --version
outputs an error saying that it does not recognize nvcc command but in my host machine I can run nvidia-smi
The commands I used to install nvidia are the following:
sudo dnf install akmod-nvidia
sudo dnf install xorg-x11-drv-nvidia-cuda
And as visible in the following image I am able to run the command nvidia-smi
in my host machine
I followed this guide on how yo install nvidia-docker - - and did the following:
curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
##############################
sudo dnf install nvidia-docker2
# Edit /etc/nvidia-container-runtime/config.toml and disable cgroups:
no-cgroups = true
sudo reboot
##############################
sudo systemctl start docker.service
##############################
docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
and upon running this docker command I get the error show in ### 1.
The thing is, I have the file that it says it is missing (check the following image), so maybe it is looking for it in a different directory?
uname -a
:
Linux fedora 6.2.10-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 6 23:30:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
docker version
Client: Docker Engine - Community
Cloud integration: v1.0.31
Version: 23.0.3
API version: 1.41 (downgraded from 1.42)
Go version: go1.19.7
Git commit: 3e7cbfd
Built: Tue Apr 4 22:10:33 2023
OS/Arch: linux/amd64
Context: desktop-linux
Server: Docker Desktop 4.18.0 (104112)
Engine:
Version: 20.10.24
API version: 1.41 (minimum version 1.12)
Go version: go1.19.7
Git commit: 5d6db84
Built: Tue Apr 4 18:18:42 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.18
GitCommit: 2456e983eb9e37e47538f59ea18f2043c9a73640
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
rpm -qa '*nvidia*'
nvidia-gpu-firmware-20230310-148.fc37.noarch
xorg-x11-drv-nvidia-kmodsrc-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-cuda-libs-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-libs-530.41.03-1.fc37.x86_64
nvidia-settings-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-power-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-530.41.03-1.fc37.x86_64
akmod-nvidia-530.41.03-1.fc37.x86_64
kmod-nvidia-6.2.9-200.fc37.x86_64-530.41.03-1.fc37.x86_64
nvidia-persistenced-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-cuda-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-libs-530.41.03-1.fc37.i686
xorg-x11-drv-nvidia-cuda-libs-530.41.03-1.fc37.i686
kmod-nvidia-6.2.10-200.fc37.x86_64-530.41.03-1.fc37.x86_64
nvidia-container-toolkit-base-1.13.0-1.x86_64
libnvidia-container1-1.13.0-1.x86_64
libnvidia-container-tools-1.13.0-1.x86_64
nvidia-container-toolkit-1.13.0-1.x86_64
nvidia-docker2-2.13.0-1.noarch
nvidia-container-cli -V
cli-version: 1.13.0
lib-version: 1.13.0
build date: 2023-03-31T13:12+00:00
build revision: 20823911e978a50b33823a5783f92b6e345b241a
build compiler: gcc 8.5.0 20210514 (Red Hat 8.5.0-18)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
Thanks for your help!
The error message suggests that the NVIDIA Management Library (libnvidia-ml.so.1
) cannot be found. This library is part of the NVIDIA driver and is required for the NVIDIA Docker runtime to function correctly.
Here are a few things you can try to resolve this issue:
Check Your NVIDIA Driver Installation: Make sure you have the NVIDIA drivers installed correctly on your host system. You can check this by running nvidia-smi
on your host system (outside of Docker). If this command fails or if it doesn't show your GPU(s), you may need to reinstall your NVIDIA drivers.
Update Your NVIDIA Docker Runtime: The NVIDIA Docker runtime has gone through several versions, and older versions may not be compatible with newer NVIDIA drivers or Docker versions. You can update the NVIDIA Docker runtime by following the instructions on the NVIDIA Docker GitHub page.
Reinstall the NVIDIA Docker Runtime: If updating doesn't solve the problem, you might try uninstalling and then reinstalling the NVIDIA Docker runtime. This can help if the runtime was installed incorrectly or if its configuration has become corrupted.
Check the Docker Command: Make sure you're using the correct Docker command to run your container. The --gpus all
option is only available in Docker 19.03 and later, and it requires the NVIDIA Docker runtime to be installed as the default runtime or to be specified with the --runtime nvidia
option. If you're using an older version of Docker, you might need to use nvidia-docker run
instead of docker run
.
Remember to restart your Docker service after making changes to the NVIDIA Docker runtime or its configuration. You can do this with sudo systemctl restart docker
or sudo service docker restart
, depending on your system.
FYI
I tried many methods here, and nothing works. Only one method that work for me is here.
conda install -c conda-forge pycocotools
python -c "import torch; print(torch.version.cuda)"
python -c "import pycocotools; print('pycocotools is installed!')"
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
I am having the same issue in Mac. Regarding the step @satishjasthi 'Now install torch and torch vision using pip. I suggest using torch 1.12.0 and respective torchvision unless you have a very specific requirement for torch 1.10.0. Because you can still run detectron2 with torch 1.12.0 on CUDA 11.3'
I get this error: ERROR: Could not find a version that satisfies the requirement torch==1.12.0+cu113 (from versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0) ERROR: No matching distribution found for torch==1.12.0+cu113
I found extremely tricky to get all the dependencies right. So I am leaving here the instructions to set up the environment.
# Create conda env
conda create --name detectron2 python==3.9 -y
conda activate detectron2
# Install torch
pip install torch torchvision
# Install gcc and g++ with conda
conda install -c conda-forge pybind11
conda install -c conda-forge gxx
conda install -c anaconda gcc_linux-64
conda upgrade -c conda-forge --all
# Install detectron2 (specific version)
pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.6'
I found extremely tricky to get all the dependencies right. So I am leaving here the instructions to set up the environment.
# Create conda env conda create --name detectron2 python==3.9 -y conda activate detectron2 # Install torch pip install torch torchvision # Install gcc and g++ with conda conda install -c conda-forge pybind11 conda install -c conda-forge gxx conda install -c anaconda gcc_linux-64 conda upgrade -c conda-forge --all # Install detectron2 (specific version) pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.6'
I had to add a version to the gcc install, and used conda-forge:
conda install -c conda-forge gcc_linux-64=13.2.0
Instructions To Reproduce the đ Bug:
Full logs or other relevant observations:
Environment:
Provide your environment information using the following command:
Hope someone can help me out. Thanks!