question about distributed training

TRI-ML / PF-Track

Implementation of PF-Track

Other

208 stars 27 forks source link

question about distributed training #16

Closed fatemehazimi990 closed 1 year ago

fatemehazimi990 commented 1 year ago

Hi @ziqipang ,

I have one more question about distributed training :) I could run the code on single gpu, but when trying on multiple gpus the code seems to get stuck at some point ... I am using the following run command: CUDA_VISIBLE_DEVICES=3,4 bash tools/dist_train.sh projects/configs/tracking/petr/f1_q500_800x320.py 2 --work-dir work_dirs/f1_pf_track/

Would you have a suggestion what could be underlying reason or how to approach it?

fatemehazimi990 commented 1 year ago

it seems it runs well without CUDA_VISIBLE_DEVICES=3,4 :D

ziqipang commented 1 year ago

@fatemehazimi990 Just a thought on debugging.

Where is the place that gets stuck? Is it someplace before or after the training part has started?
Could you try adding CUDA_VISIBLE_DEVICES=3,4 inside the dist_train.sh, as a sanity check? I hope this will work.

fatemehazimi990 commented 1 year ago

@ziqipang I believe it gets stuck before starting the training, maybe in dataloading process. please see the screenshot attached.

fatemehazimi990 commented 1 year ago

CUDA_VISIBLE_DEVICES=3,4 inside the dist_train.sh also showed similar behavior. As a workaround might be better to specify the gpus when starting the docker ...

ziqipang commented 1 year ago

@fatemehazimi990 Yeah, it looks so. I also don't have a good solution to this.

fatemehazimi990 commented 1 year ago

Thanks :)