Server reboots when training on Multi GPUs

sadimoodi commented 1 year ago

Hello @WongKinYiu I am running this command to train on mutiple GPUs on a redhat 8.7 server: python -m torch.distributed.launch --nproc_per_node 2 --master_port 9527 train_aux.py --workers 5 --device 0,1 --sync-bn --batch-size 2 --data data/custom_data.yaml --img 1280 1280 --cfg cfg/training/yolov7-e6e-custom.yaml --weights yolov7-e6e_training.pt --name yolov7-w6 --hyp data/hyp.scratch.p6.yaml --epochs 1

My Server reboots on its own after reaching the below output, there is no specific error raised, redhat logs do not show any errors, what could be the reason for this? Note: training on a single GPU does not cause any issue.

Output: /home/ali/.local/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distribu ted.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloade d, please further tune the variable for optimal performance in your application as needed.

YOLOR 🚀 v0.1-121-g2fdc7f1 torch 1.13.1+cu117 CUDA:0 (NVIDIA A16, 14938.5MB) CUDA:1 (NVIDIA A16, 14938.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. batch size = 2 world_size = 2 Namespace(weights='yolov7-e6e_training.pt', cfg='cfg/training/yolov7-e6e-custom.yaml', data='data/custom_data.yaml', hyp='data/hyp.scratch.p6.yaml', epochs=1, batch_size=1, img_size=[1280, 1280], rect=False, resume=False, nosave=False, notest=False, noautoanchor=False, evolve=False, bucket='', cache_images=False, image_weights=False, device='0,1', multi_scale=False, single_cls=False, adam=False, sync_bn=True, local_rank=0, workers=5, project='runs/train', entity=None, name='yolov7-w6', exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias='latest', v5_metric=False, world_size=2, global_rank=0, save_dir='runs/train/yolov7-w6', total_batch_size=2) tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/ hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.3, cls_pw=1.0, obj=0.7, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.2, scale=0.9, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.15, copy_paste=0.0, paste_in=0.15, loss_ota=1 batch size = 2 world_size = 2 wandb: Install Weights & Biases for YOLOR logging with 'pip install wandb' (recommended)

sadimoodi commented 1 year ago

can anyone help with this?

sadimoodi commented 1 year ago

@AlexeyAB @WongKinYiu can u help ?

foolLain commented 10 months ago

hi! Is this problem solved? i meet the same question

WongKinYiu / yolov7

Server reboots when training on Multi GPUs #1570