Sense-X / Co-DETR

[ICCV 2023] DETRs with Collaborative Hybrid Assignments Training
MIT License
950 stars 100 forks source link

mmcv error: libGL.so.1 Not Found; No module named 'fairscale' #108

Closed keeper-jie closed 7 months ago

keeper-jie commented 7 months ago
  1. I download 11.3.1-cudnn8-devel-ubuntu20.04 form https://hub.docker.com/r/nvidia/cuda/tags?page=1&name=11.3
  2. run docker container where /home/liujie/Co-DETR-main is path of co-dter and /home/liujie/data is path of data
    docker run --gpus all -it --name co_detr_cuda11.3_dev -v /home/liujie/Co-DETR-main:/code -v /home/liujie/data:/code/data nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 
  3. install miniconda
    
    apt update
    apt install wget

mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm -rf ~/miniconda3/miniconda.sh ~/miniconda3/bin/conda init bash ~/miniconda3/bin/conda init zsh

4. create python 3.7.11 env

conda create -n co_detr_python3.7.11 python=3.7.11 conda activate co_detr_python3.7.11

5. add conda channel

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/

6. install pytorch

conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3

7. add pip channel

mkdir -p ~/.pip

echo "[global]" >> ~/.pip/pip.conf echo "index-url = https://mirrors.aliyun.com/pypi/simple/" >> ~/.pip/pip.conf echo "" >> ~/.pip/pip.conf echo "[install]" >> ~/.pip/pip.conf echo "trusted-host = mirrors.aliyun.com" >> ~/.pip/pip.conf

8. install mmcv

pip install mmcv-full==1.5.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.11.0/index.html

9. install co-detr

cd /code pip install -v -e .

10. run

bash tools/dist_train.sh projects/configs/co_deformable_detr/co_deformable_detr_r50_1x_coco.py 2 /code/workdir/2gpu

11. error : ImportError: libGL.so.1: cannot open shared object file: No such file or directory

File "tools/train.py", line 9, in import mmcv File "/root/miniconda3/envs/co_detr_python3.7.11/lib/python3.7/site-packages/mmcv/init.py", line 4, in from .fileio import * File "/root/miniconda3/envs/co_detr_python3.7.11/lib/python3.7/site-packages/mmcv/fileio/init.py", line 2, in from .file_client import BaseStorageBackend, FileClient File "/root/miniconda3/envs/co_detr_python3.7.11/lib/python3.7/site-packages/mmcv/fileio/file_client.py", line 15, in from mmcv.utils.misc import has_method File "/root/miniconda3/envs/co_detr_python3.7.11/lib/python3.7/site-packages/mmcv/utils/init.py", line 40, in from .env import collect_env File "/root/miniconda3/envs/co_detr_python3.7.11/lib/python3.7/site-packages/mmcv/utils/env.py", line 9, in import cv2 File "/root/miniconda3/envs/co_detr_python3.7.11/lib/python3.7/site-packages/cv2/init.py", line 181, in bootstrap() File "/root/miniconda3/envs/co_detr_python3.7.11/lib/python3.7/site-packages/cv2/init.py", line 153, in bootstrap native_module = importlib.import_module("cv2") File "/root/miniconda3/envs/co_detr_python3.7.11/lib/python3.7/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) ImportError: libGL.so.1: cannot open shared object file: No such file or directory

12. solve it by stackoverflow answer(https://stackoverflow.com/questions/55313610/importerror-libgl-so-1-cannot-open-shared-object-file-no-such-file-or-directo)

RUN apt-get update && apt-get install ffmpeg libsm6 libxext6 -y

13. new error occur: No module named 'fairscale'No module named 'fairscale'

报错

File "tools/train.py", line 17, in File "tools/train.py", line 17, in from mmdet.apis import init_random_seed, set_random_seed, train_detectorfrom mmdet.apis import init_random_seed, set_random_seed, train_detector

File "/code/mmdet/apis/init.py", line 2, in File "/code/mmdet/apis/init.py", line 2, in from .inference import (async_inference_detector, inference_detector, File "/code/mmdet/apis/inference.py", line 13, in from .inference import (async_inference_detector, inference_detector, File "/code/mmdet/apis/inference.py", line 13, in from mmdet.datasets import replace_ImageToTensorfrom mmdet.datasets import replace_ImageToTensor

File "/code/mmdet/datasets/init.py", line 13, in File "/code/mmdet/datasets/init.py", line 13, in from .utils import (NumClassCheckHook, get_loading_pipeline,from .utils import (NumClassCheckHook, get_loading_pipeline,

File "/code/mmdet/datasets/utils.py", line 11, in File "/code/mmdet/datasets/utils.py", line 11, in from mmdet.models.dense_heads import GARPNHead, RPNHeadfrom mmdet.models.dense_heads import GARPNHead, RPNHead

File "/code/mmdet/models/init.py", line 2, in File "/code/mmdet/models/init.py", line 2, in from .backbones import # noqa: F401,F403from .backbones import # noqa: F401,F403

File "/code/mmdet/models/backbones/init.py", line 2, in File "/code/mmdet/models/backbones/init.py", line 2, in from .csp_darknet import CSPDarknetfrom .csp_darknet import CSPDarknet

File "/code/mmdet/models/backbones/csp_darknet.py", line 11, in File "/code/mmdet/models/backbones/csp_darknet.py", line 11, in from ..utils import CSPLayerfrom ..utils import CSPLayer

File "/code/mmdet/models/utils/init.py", line 19, in File "/code/mmdet/models/utils/init.py", line 19, in from .transformer import (DetrTransformerDecoder, DetrTransformerDecoderLayer,from .transformer import (DetrTransformerDecoder, DetrTransformerDecoderLayer,

File "/code/mmdet/models/utils/transformer.py", line 31, in File "/code/mmdet/models/utils/transformer.py", line 31, in import fairscaleimport fairscale

ModuleNotFoundErrorModuleNotFoundError: : No module named 'fairscale'No module named 'fairscale'

14. It is seems error by mmcv

# Env
## host:
A6000
cuda 11.8: 

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

cuda drive: NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2
## nvcc in docker:

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Mon_May__3_19:15:13_PDT_2021 Cuda compilation tools, release 11.3, V11.3.109 Build cuda_11.3.r11.3/compiler.29920130_0

TempleX98 commented 7 months ago

Please install fairscale: pip install fairscale

keeper-jie commented 7 months ago

Please install fairscale: pip install fairscale

thanks for you reply,

  1. install fairscale
    pip install fairscale
  2. run bash tools/dist_train.sh projects/configs/co_deformable_detr/co_deformable_detr_r50_1x_coco.py 2 /code/workdir/2gpu error:
    ModuleNotFoundError: No module named 'timm'
  3. resolve it by pip install timm
    error:
    File "/root/miniconda3/envs/co_detr_python3.7.11/lib/python3.7/site-packages/mmcv/utils/config.py", line 502, in pretty_text
    text, _ = FormatCode(text, style_config=yapf_style, verify=True)
    TypeError: FormatCode() got an unexpected keyword argument 'verify'
  4. solve it by search from issue pip install yapf==0.40.1. My default yapf==0.40.2
    error:
    File "/root/miniconda3/envs/co_detr_python3.7.11/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
    RuntimeError: DataLoader worker (pid 1295394) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. 
  5. edit config file /code/projects/configs/co_deformable_detr/co_deformable_detr_r50_1x_coco.py
    data = dict(
    samples_per_gpu=1,   # 2
  6. error ImportError: Please run "pip install scipy" to install scipy first.
  7. solve it by pip install scipy
  8. suc run bash tools/dist_train.sh projects/configs/co_deformable_detr/co_deformable_detr_r50_1x_coco.py 2 /code/workdir/2gpu

Conclusion and Senses:

1) It is hard for reproduce the environment so we should use docker to develop.

2) GPU memory requirements are so high for SOTA model, my A6000 49G have to modify the config file to run.

3) The transformer architecture used in production have a long way.