For Virtual Environment Configuration

zschanghai commented 1 year ago

hello, author of DQS3D:

I have changed the version of pytorch 1.10.2, torchvision 0.11.3 and CUDA 11.3 according to your advice. But the project cannot run successful, I present the docker file I refered for the configuration as follows:

Is it possible that the cause of the problem is the incorrect version of MMCV, and the other dependency？

FROM pytorch/pytorch:1.8.1-cuda10.2-cudnn7-devel

ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC && \ apt-get update && \ apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6

Install MMCV, MMDetection and MMSegmentation RUN pip install mmcv-full==1.3.8 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.2/index.html RUN pip install mmdet==2.14.0 RUN pip install mmsegmentation==0.14.1

Install MMDetection3D RUN conda clean --all RUN git clone https://github.com/samsunglabs/fcaf3d.git /mmdetection3d WORKDIR /mmdetection3d ENV FORCE_CUDA="1" RUN pip install -r requirements/build.txt RUN pip install --no-cache-dir -e .

Install Minkowski Engine RUN apt-get install -y python3-dev libopenblas-dev RUN pip install ninja==1.10.2.3 RUN pip install \ -U git+https://github.com/NVIDIA/MinkowskiEngine@v0.5.4 \ --install-option="--blas=openblas" \ --install-option="--force_cuda" \ -v \ --no-deps

Install differentiable IoU RUN git clone https://github.com/lilanxiao/Rotated_IoU /rotated_iou WORKDIR /rotated_iou RUN git checkout 3bdca6b20d981dffd773507e97f1b53641e98d0a RUN cp -r /rotated_iou/cuda_op /mmdetection3d/mmdet3d/ops/rotated_iou WORKDIR /mmdetection3d/mmdet3d/ops/rotated_iou/cuda_op RUN python setup.py install WORKDIR /mmdetection3d

Your Sincerely!

zschanghai commented 1 year ago

We hope that a new guidance for install or a new docker file might be provided.

zschanghai commented 1 year ago

The problem is as follows:

CUDA_VISIBLE_DEVICES=2 bash tools/dist_train.sh configs/fcaf3d/fcaf3d_sunrgbd-3d-10class-r0.05-aug.py 1

/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprec ated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( Traceback (most recent call last): File "tools/train.py", line 16, in from mmdet3d.apis import train_model File "/data1/zsch/project/DQS3D/mmdet3d/apis/init.py", line 1, in from .inference import (convert_SyncBN, inference_detector, File "/data1/zsch/project/DQS3D/mmdet3d/apis/inference.py", line 10, in from mmdet3d.core import (Box3DMode, DepthInstance3DBoxes, File "/data1/zsch/project/DQS3D/mmdet3d/core/init.py", line 1, in from .anchor import # noqa: F401, F403 File "/data1/zsch/project/DQS3D/mmdet3d/core/anchor/init.py", line 1, in from mmdet.core.anchor import build_anchor_generator File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmdet/core/init.py", line 2, in from .bbox import # noqa: F401, F403 File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmdet/core/bbox/init.py", line 7, in from .samplers import (BaseSampler, CombinedSampler, File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmdet/core/bbox/samplers/init.py", line 9, in from .score_hlr_sampler import ScoreHLRSampler File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmdet/core/bbox/samplers/score_hlr_sampler.py", line 2, in from mmcv.ops import nms_match File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmcv/ops/init.py", line 1, in from .bbox import bbox_overlaps File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmcv/ops/bbox.py", line 3, in ext_module = ext_loader.load_ext('_ext', ['bbox_overlaps']) File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmcv/utils/ext_loader.py", line 12, in load_ext ext = importlib.import_module('mmcv.' + name) File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) ImportError: /data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor6devic eEv ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 944954) of binary: /data1/zsch/software/anaconda3/envs/dqs3d/bin/pytho n3 Traceback (most recent call last): File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2023-05-14_20:58:32 host : hmc37 rank : 0 (local_rank: 0) exitcode : 1 (pid: 944954) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

c7w commented 1 year ago

I've replied to your email. It is resolved now?

Maybe you can also try this:

After installing pytorch+cudatoolkit, build mmcv==1.3.8 library from source and install it.

I hadn't run into that problem... But I think that is because your mmcv library is not successfully installed. See:

ImportError: /data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor6deviceEv

After a few search on the Internet, I found this link: https://github.com/open-mmlab/mmdetection/issues/4291#issuecomment-946909608

Hope it'd help.

AIR-DISCOVER / DQS3D

For Virtual Environment Configuration #2