facebookresearch / Mask2Former

Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"
MIT License
2.54k stars 384 forks source link

Much more training time is needed if I trained model with multi-machine&single-gpu #152

Open BJUT-AIVBD opened 2 years ago

BJUT-AIVBD commented 2 years ago

I tried to run distributed training on multi machine single GPU, and found it takes much more time than training on single machine single GPU, so I made the following test:

Config: configs/coco/instance-segmentation/maskformer2_R50_bs16_50ep.yaml (Modifying Batch Size = 4) num_gpus: single machine single RTX3090 vs. two machines single RTX3090

The results show that it takes 7 days to train on single RTX3090, but 70 days for distributed training. Two machines are located in the same local area network using an Ethernet switch with CAT-6 cables. Can you give me some advice for this issue?

yahooo-m commented 1 year ago

Have you fixed this error?