I tried to run distributed training on multi machine single GPU, and found it takes much more time than training on single machine single GPU, so I made the following test:
Config: configs/coco/instance-segmentation/maskformer2_R50_bs16_50ep.yaml (Modifying Batch Size = 4)
num_gpus: single machine single RTX3090 vs. two machines single RTX3090
The results show that it takes 7 days to train on single RTX3090, but 70 days for distributed training.
Two machines are located in the same local area network using an Ethernet switch with CAT-6 cables.
Can you give me some advice for this issue?
I tried to run distributed training on multi machine single GPU, and found it takes much more time than training on single machine single GPU, so I made the following test:
Config: configs/coco/instance-segmentation/maskformer2_R50_bs16_50ep.yaml (Modifying Batch Size = 4) num_gpus: single machine single RTX3090 vs. two machines single RTX3090
The results show that it takes 7 days to train on single RTX3090, but 70 days for distributed training. Two machines are located in the same local area network using an Ethernet switch with CAT-6 cables. Can you give me some advice for this issue?