TuSimple / centerformer

Implementation for CenterFormer: Center-based Transformer for 3D Object Detection (ECCV 2022)
MIT License
293 stars 28 forks source link

`ValueError: /workplace/spconv/src/spconv/spconv_ops.cc 87 unknown device type` error #18

Open junhocho opened 1 year ago

junhocho commented 1 year ago

Hi, thanks for sharing code. I am leaving an issue since I have trouble on running your code. I run a code without ddp python ./tools/train.py ./configs/nusc/nuscenes_centerformer_separate_detection_head.py, sh setup.sh works nicely. but here is follwing error when running train.py.

Traceback (most recent call last):
  File "./tools/train.py", line 137, in <module>
    main()
  File "./tools/train.py", line 132, in main
    logger=logger,
  File "/workspace/det3d/torchie/apis/train.py", line 335, in train_detector
    trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
  File "/workspace/det3d/torchie/trainer/trainer.py", line 546, in run
    epoch_runner(data_loaders[i], self.epoch, **kwargs)
  File "/workspace/det3d/torchie/trainer/trainer.py", line 413, in train
    self.model, data_batch, train_mode=True, **kwargs
  File "/workspace/det3d/torchie/trainer/trainer.py", line 371, in batch_processor_inline
    losses = model(example, return_loss=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/det3d/models/detectors/voxelnet_dynamic.py", line 52, in forward
    x, _ = self.extract_feat(example)
  File "/workspace/det3d/models/detectors/voxelnet_dynamic.py", line 38, in extract_feat
    data['voxels'], data["coors"], data["batch_size"], data["input_shape"]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/det3d/models/backbones/scn.py", line 156, in forward
    x = self.conv_input(ret)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/spconv/modules.py", line 134, in forward
    input = module(input)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/spconv/conv.py", line 181, in forward
    use_hash=self.use_hash)
  File "/opt/conda/lib/python3.7/site-packages/spconv/ops.py", line 95, in get_indice_pairs
    int(use_hash))
ValueError: /workplace/spconv/src/spconv/spconv_ops.cc 87
unknown device type

I have tried hard to run your code on nuscenes dataset. We also have 8gpus of A100 settting as you do. One difference would be that I use docker image. Here is dockerfile.

FROM pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel
MAINTAINER Junho Cho <junh0.cho@samsung.com>

RUN rm /etc/apt/sources.list.d/cuda.list
RUN rm /etc/apt/sources.list.d/nvidia-ml.list
RUN apt-get update

RUN apt-get install git -y
RUN git clone https://github.com/TuSimple/centerformer.git

RUN cd centerformer && pip install -r requirements.txt

RUN apt-get install wget libboost-all-dev libgl1 -y

# Install cmake v3.13.2
RUN apt-get purge -y cmake && \
    mkdir /root/temp && \
    cd /root/temp && \
    wget https://github.com/Kitware/CMake/releases/download/v3.13.2/cmake-3.13.2.tar.gz && \
    tar -xzvf cmake-3.13.2.tar.gz && \
    cd cmake-3.13.2 && \
    bash ./bootstrap && \
    make && \
    make install && \
    cmake --version && \
    rm -rf /root/temp

RUN git clone --branch v1.2.1  https://github.com/traveller59/spconv.git --recursive
RUN cd spconv && python setup.py bdist_wheel && cd ./dist && pip install *whl

WORKDIR /workspace
ENV PYTHONPATH="${PYTHONPATH}:/workspace"

Through this dockerfile, we build spconv v1.2.1 on cuda 11.1 and pytorch 1.9.1 environment. This makes exact pytorch, cuda version as your setting. Only difference is python, but I think is not a big difference. (also tried python 3.9.12, but no luck).

sh setup.sh always works nicely.

seems following error

ValueError: /root/spconv/src/spconv/spconv_ops.cc 87
unknown device type

might be solved with using other spconv (according to https://github.com/traveller59/spconv/issues/58) , but I have not tried because you specified only spconv 1.2.1 works.

Would there be any idea to sort this issue?

Probably, spconv 1.2.1 does not work in docker accordint to this, but I confirmed spconv 2.2 worked in docker.

If this so, is there any chance this repo be able to support spconv 2.2? (I already tried spconv 2.2 for centerformer and failed a lot)

Machine-NO-Learning commented 10 months ago

acturally conda is ok to do that, surely you use docker maybe based on some reasons like landing projects or your server cuda version which is also used by your friends that can't be changed.

this project is based on centerpoint, you can set env based on that, for spconv v1, I suggest that you:

  1. git clone -b v1.2.1 https://github.com/traveller59/spconv.git --recursive
  2. cd spconv && python setup.py bdist_wheel (maybe you will meet No CUDA compler find and no Cmake.txt in pyblind11, you can do export CUDA_HOME=/usr/local/cuda export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 export PATH=$PATH:$CUDA_HOME/bin, and git clone --recurse-submodules https://github.com/pybind/pybind11.git in spconv/thirdparty/)
  3. cd ./dist
  4. pip install xxx.whl

then maybe you should setup multi_scale_deformable and iou_nms_cuda you should go to det3d/ops/ and setup them

hope this can help you