Closed fatemehazimi990 closed 1 year ago
it seems it runs well without CUDA_VISIBLE_DEVICES=3,4
:D
@fatemehazimi990 Just a thought on debugging.
CUDA_VISIBLE_DEVICES=3,4
inside the dist_train.sh
, as a sanity check? I hope this will work.@ziqipang I believe it gets stuck before starting the training, maybe in dataloading process. please see the screenshot attached.
CUDA_VISIBLE_DEVICES=3,4
inside the dist_train.sh also showed similar behavior. As a workaround might be better to specify the gpus when starting the docker ...
@fatemehazimi990 Yeah, it looks so. I also don't have a good solution to this.
Thanks :)
Hi @ziqipang ,
I have one more question about distributed training :) I could run the code on single gpu, but when trying on multiple gpus the code seems to get stuck at some point ... I am using the following run command:
CUDA_VISIBLE_DEVICES=3,4 bash tools/dist_train.sh projects/configs/tracking/petr/f1_q500_800x320.py 2 --work-dir work_dirs/f1_pf_track/
Would you have a suggestion what could be underlying reason or how to approach it?