Jeff-sjtu / res-loglikelihood-regression

Code for "Human Pose Regression with Residual Log-likelihood Estimation", ICCV 2021 Oral
421 stars 43 forks source link

Timed out initializing process group in store based barrier on rank: 2 #48

Open yotaroshimose opened 2 years ago

yotaroshimose commented 2 years ago

Hi, Thank you for sharing your great work!

I tried to run your training scripts. But my machine only has 4 GPUs. So I changed its WORLD SIZE to 4 from 8 in original yaml file.

Then it says "Timed out initializing process group in store based barrier on rank: 2" or sometimes it suddenly crashes during the epoch and my docker container shutdowns (indicating memory leak?).

Any advise on successfully running your training code?

Thank you for your cooperation.

WjzZwd commented 2 years ago

@yotaroshimose bro,my english is not very good ,so maybe i wrongly understand your question,the following is my advice you can add a statement -> CUDA_VISIBLE_DEVICES=0,1 (i have 4 GPU but only use the first and the second) before the statement -> python ./scripts/validate.py \
in the file which name is validate.sh or train.sh wish help you

yotaroshimose commented 2 years ago

Thank you for your kind reply. I will try the way of your device control. thank you.