Use Multiple machines - Githubissues

hq03 commented 2 years ago

Excuse me, whether to support Multiple machines training？ how should the specific operation？ thank you！

akashAD98 commented 2 years ago

you can use this commands for multi GPU

train p5 models

python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py --workers 8 --device 0,1,2,3 --sync-bn --batch-size 128 --data data/coco.yaml --img 640 640 --cfg cfg/training/yolov7.yaml --weights '' --name yolov7 --hyp data/hyp.scratch.p5.yaml

train p6 models

python -m torch.distributed.launch --nproc_per_node 8 --master_port 9527 train_aux.py --workers 8 --device 0,1,2,3,4,5,6,7 --sync-bn --batch-size 128 --data data/coco.yaml --img 1280 1280 --cfg cfg/training/yolov7-w6.yaml --weights '' --name yolov7-w6 --hyp data/hyp.scratch.p6.yaml

hq03 commented 2 years ago

I can run this in multiple GPU, but can't run this in Multiple machines. For example, on two communicating servers (8 GPU per machine, total 16 GPU).

sadimoodi commented 1 year ago

@hq03 could u run this on the same machine with mutiple GPUs? when i run this command, my whole server restarts, there is no error displayed

WongKinYiu / yolov7

Use Multiple machines #558

train p5 models

train p6 models