Program interrupt when multi-GPU training

exitudio / GaitMixer

Official repository for "GaitMixer: Skeleton-based Gait Representation Learning via Wide-spectrum Multi-axial Mixer"

23 stars 1 forks source link

Program interrupt when multi-GPU training #3

Open hxi667 opened 1 year ago

hxi667 commented 1 year ago

Hi, it is great work! But I also needed some help. When I run train.py with multiple GPUs, (for example, the "--gpus" parameter is set to "0,1,2,3,4,5,6,7"), my program interrupts but returns no errors. I found that the interrupts occurred in the "loss.backward()" line of code. Can you give me some advice? Thank you very much!!

exitudio commented 1 year ago

It may be something with the GPU environment. Have you tried with only 1 GPU and 2 GPUs? (export CUDA_VISIBLE_DEVICES=0)

hxi667 commented 1 year ago

Yes, no problem when I'm just using a gpu, I've set os.environ["CUDA_VISIBLE_DEVICES"]="0,1" , but it still doesn't work.

exitudio commented 1 year ago

Are you using clusters or multiprocess? The code uses DataParallel so it doesn't support multiprocess.

hxi667 commented 1 year ago

Yes, I know this code uses DataParallel, I don't use multiprocess. As a comparison, I can use 8 GPu's on GaitGraph.

exitudio commented 1 year ago

One difference from GaitGraph is we use Triplet loss from pytorch_metric_learning . But it shouldn't be a problem. It also works on my 4 GPU server.

exitudio commented 1 year ago

You can try --loss_func supcon to see that the Triplet loss causes this problem or not.

hxi667 commented 1 year ago

I changed the conda environment to the one used by Garph and the problem was solved! I guess it could be a certain package version that is causing the problem. Thank you again for your kind answers!