MCG-NJU / MeMOTR

[ICCV 2023] MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking
https://arxiv.org/abs/2307.15700
MIT License
140 stars 8 forks source link

undefined symbol: _ZNK2at6Tensor7optionsEv #13

Closed mtmyyy closed 6 months ago

mtmyyy commented 6 months ago

你好,当我使用环境为torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1以及CUDA=11.3时,会出现以下问题:

ImportError: /cver/tcying/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor7optionsEv

这貌似是DETR编译的问题,因为我在执行test.py时也会有同样的错误。我换成最新的pytorch版本依旧会有这样的问题。

但是我将环境换成torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1以及CUDA=11.1时,DETR编译成功了,但是运行时又会出现以下错误:

Traceback (most recent call last):
  File "main.py", line 120, in <module>
    main(config=merged_config)
  File "main.py", line 99, in main
    from train_engine import train
  File "/cver/tcying/ytc/MeMOTR/train_engine.py", line 12, in <module>
    from models import build_model
  File "/cver/tcying/ytc/MeMOTR/models/__init__.py", line 6, in <module>
    from .memotr import build as build_memotr
  File "/cver/tcying/ytc/MeMOTR/models/memotr.py", line 13, in <module>
    from .backbone import BackboneWithPE
  File "/cver/tcying/ytc/MeMOTR/models/backbone.py", line 8, in <module>
    from torchvision.models import resnet50, ResNet50_Weights
ImportError: cannot import name 'ResNet50_Weights' from 'torchvision.models' (/cver/tcying/lib/python3.8/site-packages/torchvision/models/__init__.py)

这与问题 #6 很像,但是我不知道该如何解决。

mtmyyy commented 6 months ago

我将MeMOTR/models/backbone.py中第71行的backbone = resnet50(weights=ResNet50_Weights.DEFAULT, norm_layer=FrozenBatchNorm2d)改成了backbone = models.resnet50(pretrained=True, norm_layer=FrozenBatchNorm2d),又出现了新的问题:

Traceback (most recent call last):
  File "main.py", line 120, in <module>
    main(config=merged_config)
  File "main.py", line 99, in main
    from train_engine import train
  File "/cver/tcying/ytc/MeMOTR/train_engine.py", line 12, in <module>
    from models import build_model
  File "/cver/tcying/ytc/MeMOTR/models/__init__.py", line 6, in <module>
    from .memotr import build as build_memotr
  File "/cver/tcying/ytc/MeMOTR/models/memotr.py", line 15, in <module>
    from .query_updater import build as build_query_updater
  File "/cver/tcying/ytc/MeMOTR/models/query_updater.py", line 13, in <module>
    from structures.track_instances import TrackInstances
  File "/cver/tcying/ytc/MeMOTR/structures/track_instances.py", line 7, in <module>
    class TrackInstances:
  File "/cver/tcying/ytc/MeMOTR/structures/track_instances.py", line 52, in TrackInstances
    def __getitem__(self, item: int | slice | torch.BoolTensor) -> "TrackInstances":
TypeError: unsupported operand type(s) for |: 'type' and 'type'
HELLORPG commented 6 months ago

第一个问题你可能需要检查一下你机器上安装的 CUDA 版本,因为编译 Deformable-DETR 的算子是需要本机的 CUDA 版本和环境版本是匹配的。 第二个问题是我在代码中使用了一些 python 3.10 才更新的特性(主要是在函数注释部分),例如你提到的 def __getitem__(self, item: int | slice | torch.BoolTensor) -> "TrackInstances",如果你不方便更新 python 版本,那么直接将这这行代码修改为:

def __getitem__(self, item) -> "TrackInstances":

就可以运行了,不会改变任何实质性的运行结果。

mtmyyy commented 6 months ago

非常感谢,代码目前可以运行,但是请问一下如何使用多卡训练呢,我使用的是A5000八卡,但是无论我修改AVAILABLE_GPUS还是使用分布式训练或者是加上类似CUDA_VISIBLE_DEVICES=0,1,2,3都只能在第一张卡上运行

HELLORPG commented 6 months ago

你运行的命令行指令是什么?按理说使用如下的指令:

python -m torch.distributed.run --nproc_per_node=8 main.py --use-distributed --config-path ./configs/train_dancetrack.yaml --outputs-dir ./outputs/memotr_dancetrack/ --batch-size 1 --data-root <your data dir path>

就可以进行分布式运行了,需要确认开头一定有通过torch.distributed.run启动。

mtmyyy commented 6 months ago

已解决,非常感谢