Hello @WongKinYiu
I am running this command to train on mutiple GPUs on a redhat 8.7 server:
python -m torch.distributed.launch --nproc_per_node 2 --master_port 9527 train_aux.py --workers 5 --device 0,1 --sync-bn --batch-size 2 --data data/custom_data.yaml --img 1280 1280 --cfg cfg/training/yolov7-e6e-custom.yaml --weights yolov7-e6e_training.pt --name yolov7-w6 --hyp data/hyp.scratch.p6.yaml --epochs 1
My Server reboots on its own after reaching the below output, there is no specific error raised, redhat logs do not show any errors, what could be the reason for this?
Note: training on a single GPU does not cause any issue.
Output:
/home/ali/.local/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distribu ted.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloade d, please further tune the variable for optimal performance in your application as needed.
Hello @WongKinYiu I am running this command to train on mutiple GPUs on a redhat 8.7 server: python -m torch.distributed.launch --nproc_per_node 2 --master_port 9527 train_aux.py --workers 5 --device 0,1 --sync-bn --batch-size 2 --data data/custom_data.yaml --img 1280 1280 --cfg cfg/training/yolov7-e6e-custom.yaml --weights yolov7-e6e_training.pt --name yolov7-w6 --hyp data/hyp.scratch.p6.yaml --epochs 1
My Server reboots on its own after reaching the below output, there is no specific error raised, redhat logs do not show any errors, what could be the reason for this? Note: training on a single GPU does not cause any issue.
Output: /home/ali/.local/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distribu ted.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloade d, please further tune the variable for optimal performance in your application as needed.
YOLOR 🚀 v0.1-121-g2fdc7f1 torch 1.13.1+cu117 CUDA:0 (NVIDIA A16, 14938.5MB) CUDA:1 (NVIDIA A16, 14938.5MB)
Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. batch size = 2 world_size = 2 Namespace(weights='yolov7-e6e_training.pt', cfg='cfg/training/yolov7-e6e-custom.yaml', data='data/custom_data.yaml', hyp='data/hyp.scratch.p6.yaml', epochs=1, batch_size=1, img_size=[1280, 1280], rect=False, resume=False, nosave=False, notest=False, noautoanchor=False, evolve=False, bucket='', cache_images=False, image_weights=False, device='0,1', multi_scale=False, single_cls=False, adam=False, sync_bn=True, local_rank=0, workers=5, project='runs/train', entity=None, name='yolov7-w6', exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias='latest', v5_metric=False, world_size=2, global_rank=0, save_dir='runs/train/yolov7-w6', total_batch_size=2) tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/ hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.3, cls_pw=1.0, obj=0.7, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.2, scale=0.9, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.15, copy_paste=0.0, paste_in=0.15, loss_ota=1 batch size = 2 world_size = 2 wandb: Install Weights & Biases for YOLOR logging with 'pip install wandb' (recommended)