Issue about DistributedDataParallel（DDP）

TengdaHan / CoCLR

[NeurIPS'20] Self-supervised Co-Training for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.

Apache License 2.0

286 stars 32 forks source link

Issue about DistributedDataParallel（DDP） #42

Closed ZihuaEvan closed 2 years ago

ZihuaEvan commented 3 years ago

Hi，thank you for your work! When I was training the weight in model.pretrain.py, it showed that I have to use DistributedDataParallel（DDP）. Is it means that it have to train on different server by multi-card training ? And I failed my training on one mechine with 2 GPUs(1080ti) Thank you for your attention！

TengdaHan commented 3 years ago

Hi, the command I provided in the pretrain-instruction or here should work on a single server with 2GPUs. What error message did you get?

ZihuaEvan commented 3 years ago

Is it possible that the memory of cards are not enough? Here is the error message: subprocess.CalledProcessError: Command '['/seu_share/apps/anaconda3/bin/python3', '-u', '/seu_share/home/220205287/COCLR/main_coclr.py', '--local_rank=1']' died with <Signals.SIGBUS: 7>.

TengdaHan commented 3 years ago

Hi, from my experience, there are many reasons for "Signals.SIGBUS: 7" -- sorry I cannot help much on this. Also, I am not sure if OOM is the reason.

But you can quickly debug with the basic DDP example here: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html, to see if the problem is from your local machine.