Failure during Multi-GPU evaluation

abhi1kumar / DEVIANT

[ECCV 2022] Official PyTorch Code of DEVIANT: Depth Equivariant Network for Monocular 3D Object Detection

https://arxiv.org/abs/2207.10758

MIT License

203 stars 29 forks source link

Failure during Multi-GPU evaluation #27

Closed danielvais closed 9 months ago

danielvais commented 9 months ago

Hi, I'm encountering an error in the first eval epoch. The error I get is: Screenshot from 2024-01-17 16-32-04 I am running the gupnet model training:

CUDA_VISIBLE_DEVICES=0,1 python -u tools/train_val.py --config=experiments/run_221.yaml

I was successful training the model on a sub dataset of only 300 images. The error appears what I train the full dataset. Any suggestions?

abhi1kumar commented 9 months ago

Hi @danielvais Thank you for your interest in DEVIANT again. Here are a couple of things I would try:

The error appears what I train the full dataset.

Try evaluation in single GPU setting:

CUDA_VISIBLE_DEVICES=0 python -u tools/train_val.py --config=experiments/run_221.yaml --resume_model=... -e

Also check if the val batch size is large for the available GPU memory. I see that you use a batch size of 6 . You could try changing this line to make batch size as 2.

danielvais commented 9 months ago

Hi @abhi1kumar As you advised in my previous issue, I can't train the full dataset on a single GPU due to lack of memory.

Reducing the batch size didn't help, but I saw that the validation fail on the last batch which was of size 1 instead of 2. When I removed the last image from the validation dataset, which made the dataset to include an even number of images, the training was successful. I didn't dive deep in to understand why a dataset of odd number of images in the validation raise an error but this solution is good enough for me for now. Thanks:)

abhi1kumar commented 9 months ago

Glad that you were able to find a good enough solution.