Problems when training with multiple GPUs

dragonfly606 / MonoCD

[CVPR 2024] MonoCD: Monocular 3D Object Detection with Complementary Depths

MIT License

24 stars 7 forks source link

Problems when training with multiple GPUs #1

Closed WYFDUT closed 4 months ago

WYFDUT commented 5 months ago

MonoCD is a really nice work. But I meet some prolems when training. Whenever I try to train with multiple GPUs, I keep getting: RuntimeError: cannot reshape tensor of 0 elements into shape [0, 3, -1] because the unspecified dimension size -1 can be any value and is ambiguous Strangely enough, when I train with a single GPU, I don't get the above error.

I use the following statement for training: CUDA_VISIBLE_DEVICES=0,1 python tools/plain_train_net.py --num_gpus 2 --batch_size 8 --config runs/monocd.yaml --output output/exp1

YuJiXYZ commented 5 months ago

hello， I also try to use this project，but the command that author provided do not work for my computer. So could you provide the your install command. Thank you!.

dragonfly606 commented 5 months ago

@WYFDUT thank you for your attention. Actually, I haven't tested the training code on multiple GPUs yet. If possible, please run it on a single GPU first. I will add a multi-GPU training code in the future.

WYFDUT commented 5 months ago

@WYFDUT thank you for your attention. Actually, I haven't tested the training code on multiple GPUs yet. If possible, please run it on a single GPU first. I will add a multi-GPU training code in the future.

Thank you very much, hopefully we can get your code trained on multiple GPUs afterward!

WYFDUT commented 5 months ago

hello， I also try to use this project，but the command that author provided do not work for my computer. So could you provide the your install command. Thank you!.

Perhaps you could describe the details of the problem you meeted.

YuJiXYZ commented 5 months ago

ok，I have tried some version of pytorch, but when I use this command：“pip install -r requirements.txt”，I always meet this problem, so could please solve this, thank you!

WYFDUT commented 5 months ago

ok，I have tried some version of pytorch, but when I use this command：“pip install -r requirements.txt”，I always meet this problem, so could please solve this, thank you!

It seems like your CUDA Version does not match pytorch version. Perhaps you need to install the version of cudatoolkit that corresponds to your Pytorch environment My GPU is 2080Ti My torch version is： pytorch 1.11.0 with cuda11.3 cudnn8.2.0_0

YuJiXYZ commented 5 months ago

ok，I have tried some version of pytorch, but when I use this command：“pip install -r requirements.txt”，I always meet this problem, so could please solve this, thank you!

It seems like your CUDA Version does not match pytorch version. Perhaps you need to install the version of cudatoolkit that corresponds to your Pytorch environment My GPU is 2080Ti My torch version is： pytorch 1.11.0 with cuda11.3 cudnn8.2.0_0

Ok, I have tried this version, but I still meet this same problem when I use "sh make.sh". So I think this issue may be located in /MonoCD/model/backbone/DCNv2/setup.py, but I do not know how to change this, thank you!

WYFDUT commented 5 months ago

ok，I have tried some version of pytorch, but when I use this command：“pip install -r requirements.txt”，I always meet this problem, so could please solve this, thank you!

It seems like your CUDA Version does not match pytorch version. Perhaps you need to install the version of cudatoolkit that corresponds to your Pytorch environment My GPU is 2080Ti My torch version is： pytorch 1.11.0 with cuda11.3 cudnn8.2.0_0

Ok, I have tried this version, but I still meet this same problem when I use "sh make.sh". So I think this issue may be located in /MonoCD/model/backbone/DCNv2/setup.py, but I do not know how to change this, thank you!

You can try https://github.com/lucasjinreal/DCNv2_latest for latest version of deformable convolution installation.

dragonfly606 commented 5 months ago

Sorry @WYFDUT , I'm a little busy lately and thank you for your waiting. According to reminder in #3 , this issue may not be caused by training with multiple GPUs rather than the small batch size. I have updated training code to avoid this situation.