Open rentainhe opened 2 years ago
This should be added to the FAQ in the installation docs.
This should be added to the FAQ in the installation docs.
Thanks for your advice~ we will update the document later~
Hello, I'm trying to install detrex on an hpc with Nvidia V100. I managed to set the path CUDA_HOME to path/CUDA/11.8.0
When I run the pip install -e . again, Im getting the following warning & error:
warning: nvcc warning : incompatible redefinition for option 'std', the last value of this option was used (I think this relates to one argument -std=c++17)
error: /.../miniconda3/envs/fps-bm/lib/python3.10/site-packages/torch/include/c10/util/Half.h(73): error: identifier "_castu32_f32" is undefined
/.../miniconda3/envs/fps-bm/lib/python3.10/site-packages/torch/include/c10/util/Half.h(89): error: identifier "_castf32_u32" is undefined
2 errors detected in the compilation of "/.../detrex/detrex/layers/csrc/DCNv3/dcnv3_cuda.cu".
error: command '.../software/CUDA/11.8.0/bin/nvcc' failed with exit code 2
Did you ever encounter this and do you know a fix? My gcc is 11.3 and supports c++17 Thanks in advance
Hello @hg6185
Seems like dcn_v3 operator not suitable for this environment, you can try this two ways:
this is InternImage's official repo: https://github.com/OpenGVLab/InternImage
Seems like they already have python package for this operator: https://github.com/OpenGVLab/InternImage/releases/tag/whl_files
We will update detrex recently to remove such compiling process for this operator
Thanks for the quick reply @rentainhe! Unfortunately, that's not the thing. I removed and reinstalled everything including detectron2 which now cannot be installed due to the same issue. It seems to be a problem with c++ imports in PyTorch.
Thanks for the quick reply @rentainhe! Unfortunately, that's not the thing. I removed and reinstalled everything including detectron2 which now cannot be installed due to the same issue. It seems to be a problem with c++ imports in PyTorch.
I'm sorry to hear that. I suggest you could try lowering the PyTorch version to see if it helps to bypass this issue. @hg6185
Hi again @rentainhe , I found the problem. The Gcc version was incompatible with CUDA. Note that you should have a GCC that is < 10. In my case, everything works fine with CUDA 11.3.1 and GCC 9.4.0. Thanks again for the quick support!
Hi again @rentainhe , I found the problem. The Gcc version was incompatible with CUDA. Note that you should have a GCC that is < 10. In my case, everything works fine with CUDA 11.3.1 and GCC 9.4.0. Thanks again for the quick support!
Would you like to add this situation in our FAQs
here: https://github.com/IDEA-Research/detrex/issues/109#issue-1414444469
Hi @rentainhe ,
I can add this, but what do you mean? :D Do you want me to write a comment that makes a little summary, so you can delete the rest?
Hi @rentainhe ,
I can add this, but what do you mean? :D Do you want me to write a comment that makes a little summary, so you can delete the rest?
Yes, I was wondering if it's better to add it to somewhere or just keep our conversation here to help others who have met the same problem
hi @rentainhe a summary of what fixed issue 1 for me: The 'latest' Detectron2 release requires a gcc version that is lower than 10.0.0. I am working on a HPC and I am able to load different CUDAs and GCCs which is practical in this case.
In order to build Detectron2 and Detrex, I used a miniconda env with CUDA 11.3.1 and gcc 9.4.0. I use PyTorch 3.8 which can be installed by this command (I post it here, because you will have to search for it since it's older): conda install pytorch torchvision torchaudio pytorch-cuda=11.3 -c pytorch -c nvidia
Don't forget the Nvidia Toolkit matching with your version. Note that there are some libs like matplotlib that needed to be deprecated to match an older gcc and Python version. In general, you probably will encounter some issues on the way, but I managed to find a solution to all of them.
For instance, If you get an error with pycocotools, do pip uninstall and conda install (from conda forge)
hi @rentainhe a summary of what fixed issue 1 for me: The 'latest' Detectron2 release requires a gcc version that is lower than 10.0.0. I am working on a HPC and I am able to load different CUDAs and GCCs which is practical in this case.
In order to build Detectron2 and Detrex, I used a miniconda env with CUDA 11.3.1 and gcc 9.4.0. I use PyTorch 3.8 which can be installed by this command (I post it here, because you will have to search for it since it's older): conda install pytorch torchvision torchaudio pytorch-cuda=11.3 -c pytorch -c nvidia
Don't forget the Nvidia Toolkit matching with your version. Note that there are some libs like matplotlib that needed to be deprecated to match an older gcc and Python version. In general, you probably will encounter some issues on the way, but I managed to find a solution to all of them.
For instance, If you get an error with pycocotools, do pip uninstall and conda install (from conda forge)
Thank you so much for summarizing this! It's really useful!
We keep this issue open to collect frequently asked questions and their solutions from the users.
Feel free to leave your comment here if you find any frequent issues and have ways to help others to solve them.
Notes
dataloader.train.total_batch_size
for training as mentioned in this issue: https://github.com/IDEA-Research/detrex/issues/219FAQs
1. ImportError: Cannot import 'detrex._C', therefore 'MultiScaleDeformableAttention' is not available.
detrex need **CUDA runtime** to build the `MultiScaleDeformableAttention` operator. In most cases, users do not need to specify this environment variable if you have installed cuda correctly. The default path of CUDA runtime is `usr/local/cuda`. If you find your `CUDA_HOME` is `None`. You may solve it as follows: - If you've already installed **CUDA runtime** in your environments, specify the environment variable (here we take cuda-11.3 as an example): ```bash export CUDA_HOME=/path/to/cuda-11.3/ ``` - If you do not find the CUDA runtime in your environments, consider install it following the [CUDA Toolkit Installation](https://developer.nvidia.com/cuda-toolkit) to install CUDA. Then specify the environment variable `CUDA_HOME`. - After setting `CUDA_HOME`, rebuild detrex again by running `pip install -e .` You can also refer to these issues for more details: https://github.com/IDEA-Research/detrex/issues/98, https://github.com/IDEA-Research/detrex/issues/852. How to not filter empty annotations during training.
There're three ways for you to not filter empty annotations during training. 1. modify configs in [configs/common/data/coco_detr.py](https://github.com/IDEA-Research/detrex/blob/5d866bd115b6e0e6a0eac253761855196615e5c4/configs/common/data/coco_detr.py#L17) as follows: ```python dataloader.train = L(build_detection_train_loader)( dataset=L(get_detection_dataset_dicts)(names="coco_2017_train", filter_empty=False), ..., ) ``` 2. modify configs in projects as [dino_r50_4scale_24ep.py](https://github.com/IDEA-Research/detrex/blob/5d866bd115b6e0e6a0eac253761855196615e5c4/projects/dino/configs/dino_r50_4scale_24ep.py#L48). ```python # your config.py dataloader = get_config("common/data/coco_detr.py").dataloader # modify dataloader config # not filter empty annotations during training dataloader.train.dataset.filter_empty = False ``` 3. modify your training scripts to override the config. ```python cd detrex python tools/train_net.py --config-file projects/dino/configs/path/to/config.py --num-gpus 8 dataloader.train.dataset.filter_empy=False ``` You can also refer to these issues for more details: https://github.com/IDEA-Research/detrex/issues/78#issuecomment-12840541083. RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:54980 (errno: 98 - Address already in use).
This means that the process you started earlier did not exit correctly, there's two solution: 1. kill the process you started before totally 2. change the running port by setting `--dist-url` ```bash python tools/train_net.py \ --config-file path/to/config.py \ --num-gpus 8 \ --dist-url tcp://127.0.0.1:12345 \ ```4. DINO CPU inference
Please refer to this PR #157 for more details5. Training coco-like custom dataset
Please refer to this PR #186 for more details.