train with DDP - Githubissues

datvuthanh / HybridNets

HybridNets: End-to-End Perception Network

MIT License

582 stars 118 forks source link

train with DDP #90

Closed happyday-lkj closed 1 year ago

happyday-lkj commented 1 year ago

Hi, I try train the model with DDP model, with script python train_ddp.py --num_gpus 2 however, I meet the error File "HybridNets/hybridnets/loss.py", line 500, in soft_tversky_score assert output.size() == target.size() AssertionError Then, I modidy the code class ModelWithLoss in train_ddp.py with the “multilabel” instead of model.seg_mode, but the code seems stopped .

happyday-lkj commented 1 year ago

when --num_gpus set to 1, the program work well, but when the gpu_nums set greater than 1, then the program seems stuck.

happyday-lkj commented 1 year ago

solved

wanyunfeiAlex commented 1 year ago

@happyday-lkj hey, I've encountered the same problem. How did you fix that?

happyday-lkj commented 1 year ago

@happyday-lkj hey, I've encountered the same problem. How did you fix that?

first, you can modidy the code class ModelWithLoss in train_ddp.py with the “multilabel” instead of model.seg_mode, then refactor the train_ddp.py follow this https://github.com/rentainhe/pytorch-distributed-training/tree/master

BTW, it also have some problems with val_ddp.py.

wanyunfeiAlex commented 1 year ago

@happyday-lkj hey, I've encountered the same problem. How did you fix that?

first, you can modidy the code class ModelWithLoss in train_ddp.py with the “multilabel” instead of model.seg_mode, then refactor the train_ddp.py follow this https://github.com/rentainhe/pytorch-distributed-training/tree/master

BTW, it also have some problems with val_ddp.py.

thanks for your advice, I can train with DDP now.