Unable to run on GPU - Githubissues

yujianll commented 1 year ago

Hi, thanks for releasing the code.

I have followed the instruction to set CUDA_HOME variable and successfully installed groundingdino. However, I still get the following warning and error when I run the demo script.

/gpfs/u/home/DFLM/DFLMshcg/yujian/rl_scheduler/detector/groundingdino/models/GroundingDINO/ms_deform_attn.py:31: UserWarning: Failed to load custom C++ ops. Running on CPU mode Only!
Traceback (most recent call last):
  File "/gpfs/u/home/DFLM/DFLMshcg/yujian/rl_scheduler/detector/inference.py", line 160, in <module>
    boxes_filt, pred_phrases = get_grounding_output(
  File "/gpfs/u/home/DFLM/DFLMshcg/yujian/rl_scheduler/detector/inference.py", line 91, in get_grounding_output
    outputs = model(image[None], captions=[caption])
  File "/gpfs/u/home/DFLM/DFLMshcg/scratch/miniconda3-x86/envs/cleanrl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/u/home/DFLM/DFLMshcg/yujian/rl_scheduler/detector/groundingdino/models/GroundingDINO/groundingdino.py", line 313, in forward
    hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
  File "/gpfs/u/home/DFLM/DFLMshcg/scratch/miniconda3-x86/envs/cleanrl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/u/home/DFLM/DFLMshcg/yujian/rl_scheduler/detector/groundingdino/models/GroundingDINO/transformer.py", line 258, in forward
    memory, memory_text = self.encoder(
  File "/gpfs/u/home/DFLM/DFLMshcg/scratch/miniconda3-x86/envs/cleanrl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/u/home/DFLM/DFLMshcg/yujian/rl_scheduler/detector/groundingdino/models/GroundingDINO/transformer.py", line 576, in forward
    output = checkpoint.checkpoint(
  File "/gpfs/u/home/DFLM/DFLMshcg/scratch/miniconda3-x86/envs/cleanrl/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/gpfs/u/home/DFLM/DFLMshcg/scratch/miniconda3-x86/envs/cleanrl/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/gpfs/u/home/DFLM/DFLMshcg/scratch/miniconda3-x86/envs/cleanrl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/u/home/DFLM/DFLMshcg/yujian/rl_scheduler/detector/groundingdino/models/GroundingDINO/transformer.py", line 785, in forward
    src2 = self.self_attn(
  File "/gpfs/u/home/DFLM/DFLMshcg/scratch/miniconda3-x86/envs/cleanrl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfs/u/home/DFLM/DFLMshcg/yujian/rl_scheduler/detector/groundingdino/models/GroundingDINO/ms_deform_attn.py", line 338, in forward
    output = MultiScaleDeformableAttnFunction.apply(
  File "/gpfs/u/home/DFLM/DFLMshcg/yujian/rl_scheduler/detector/groundingdino/models/GroundingDINO/ms_deform_attn.py", line 53, in forward
    output = _C.ms_deform_attn_forward(
NameError: name '_C' is not defined

Here is my environment info:

python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.9.2009 (Core) (x86_64)
GCC version: (Anaconda gcc) 11.2.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.17

Python version: 3.9.16 (main, Mar  8 2023, 14:00:05)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB

Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.13.0+cu117
[pip3] torchaudio==0.13.0+cu117
[pip3] torchvision==0.14.0+cu117
[conda] numpy                     1.21.6                   pypi_0    pypi
[conda] torch                     1.13.0+cu117             pypi_0    pypi
[conda] torchaudio                0.13.0+cu117             pypi_0    pypi
[conda] torchvision               0.14.0+cu117             pypi_0    pypi

I wonder what could be the reason that causes this error. Many thanks in advance!

delima87 commented 1 year ago

I solved it by running the following:

python setup.py build develop --user

yujianll commented 1 year ago

@delima87 Thanks! This solves the error.

However, I find that if I run the code on GPU, the model will detect nothing. The same code works fine if I run on CPU, but if I switch to GPU, the output logits become very small and the model can't detect anything.

I wonder if you encounter the same error.

asrlhhh commented 1 year ago

Following up on this thread. I found it's specifically a V100 dependent problem as I cannot replicate the error on other GPU types. Any fix that can also let it run on GPU?

AzulYang commented 9 months ago

Same problem, Have you already solved it? @yujianll

IDEA-Research / GroundingDINO

Unable to run on GPU #68