Hi
I am encountering an error when using 2 GPUs for training YOLOv10n. Here is the error:
(yolov10) C:\Users\muh\yolov10>yolo detect train data=coco.yaml model=yolov10n.yaml epochs=500 batch=32 imgsz=640 device=0,1
Ultralytics YOLOv8.2.55 🚀 Python-3.9.19 torch-2.0.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3060, 12287MiB)
CUDA:1 (NVIDIA GeForce RTX 3060, 12288MiB)
engine\trainer: task=detect, mode=train, model=yolov10n.yaml, data=coco.yaml, epochs=500, time=None, patience=100, batch=32, imgsz=640, save=True, save_period=-1, cache=False, device=(0, 1), workers=8, project=None, name=train18, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs\detect\train18
from n params module arguments
0 -1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2]
1 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2]
2 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True]
3 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
4 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True]
5 -1 1 9856 ultralytics.nn.modules.block.SCDown [64, 128, 3, 2]
6 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
7 -1 1 36096 ultralytics.nn.modules.block.SCDown [128, 256, 3, 2]
8 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True]
9 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5]
10 -1 1 249728 ultralytics.nn.modules.block.PSA [256, 256]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
13 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1]
14 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
15 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
16 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1]
17 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2]
18 [-1, 13] 1 0 ultralytics.nn.modules.conv.Concat [1]
19 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1]
20 -1 1 18048 ultralytics.nn.modules.block.SCDown [128, 128, 3, 2]
21 [-1, 10] 1 0 ultralytics.nn.modules.conv.Concat [1]
22 -1 1 282624 ultralytics.nn.modules.block.C2fCIB [384, 256, 1, True, True]
23 [16, 19, 22] 1 929808 ultralytics.nn.modules.head.v10Detect [80, [64, 128, 256]]
YOLOv10n summary: 385 layers, 2,775,520 parameters, 2,775,504 gradients, 8.7 GFLOPs
DDP: debug command C:\Users\muh\anaconda3\envs\yolov10\python.exe -m torch.distributed.run --nproc_per_node 2 --master_port 61706 C:\Users\muh\AppData\Roaming\Ultralytics\DDP\_temp_gx6i3x3s1975427096336.py
NOTE: Redirects are currently not supported in Windows or MacOs.
Ultralytics YOLOv8.2.55 🚀 Python-3.9.19 torch-2.0.1+cu117 CUDA:0 (NVIDIA GeForce RTX 3060, 12287MiB)
CUDA:1 (NVIDIA GeForce RTX 3060, 12288MiB)
Freezing layer 'model.23.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
Traceback (most recent call last):
File "C:\Users\muh\AppData\Roaming\Ultralytics\DDP\_temp_gx6i3x3s1975427096336.py", line 13, in <module>
results = trainer.train()
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\ultralytics\engine\trainer.py", line 204, in train
self._do_train(world_size)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\ultralytics\engine\trainer.py", line 323, in _do_train
self._setup_train(world_size)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\ultralytics\engine\trainer.py", line 265, in _setup_train
dist.broadcast(self.amp, src=0) # broadcast the tensor from rank 0 to all other ranks (returns None)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\torch\distributed\distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\torch\distributed\distributed_c10d.py", line 1574, in broadcast
work.wait()
RuntimeError: Invalid scalar type
AMP: checks passed ✅
Traceback (most recent call last):
File "C:\Users\muh\AppData\Roaming\Ultralytics\DDP\_temp_gx6i3x3s1975427096336.py", line 13, in <module>
results = trainer.train()
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\ultralytics\engine\trainer.py", line 204, in train
self._do_train(world_size)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\ultralytics\engine\trainer.py", line 323, in _do_train
self._setup_train(world_size)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\ultralytics\engine\trainer.py", line 265, in _setup_train
dist.broadcast(self.amp, src=0) # broadcast the tensor from rank 0 to all other ranks (returns None)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\torch\distributed\distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\torch\distributed\distributed_c10d.py", line 1574, in broadcast
work.wait()
RuntimeError: Invalid scalar type
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5684) of binary: C:\Users\muh\anaconda3\envs\yolov10\python.exe
Traceback (most recent call last):
File "C:\Users\muh\anaconda3\envs\yolov10\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\muh\anaconda3\envs\yolov10\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\torch\distributed\run.py", line 798, in <module>
main()
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\torch\distributed\run.py", line 794, in main
run(args)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
C:\Users\muh\AppData\Roaming\Ultralytics\DDP\_temp_gx6i3x3s1975427096336.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-07-13_22:45:14
host : DESKTOP-0DPMMLD
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 14388)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-13_22:45:14
host : DESKTOP-0DPMMLD
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5684)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
File "C:\Users\muh\anaconda3\envs\yolov10\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\muh\anaconda3\envs\yolov10\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\muh\anaconda3\envs\yolov10\Scripts\yolo.exe\__main__.py", line 7, in <module>
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\ultralytics\cfg\__init__.py", line 708, in entrypoint
getattr(model, mode)(**overrides) # default args from model
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\ultralytics\engine\model.py", line 650, in train
self.trainer.train()
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\ultralytics\engine\trainer.py", line 199, in train
raise e
File "C:\Users\muh\anaconda3\envs\yolov10\lib\site-packages\ultralytics\engine\trainer.py", line 197, in train
subprocess.run(cmd, check=True)
File "C:\Users\muh\anaconda3\envs\yolov10\lib\subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['C:\\Users\\muh\\anaconda3\\envs\\yolov10\\python.exe', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '61706', 'C:\\Users\\muh\\AppData\\Roaming\\Ultralytics\\DDP\\_temp_gx6i3x3s1975427096336.py']' returned non-zero exit status 1.
I face this error only when I use 2 GPUs with device=0,1, but I don't encounter any problems when I use device=0. I am using CUDA 11.7 and PyTorch 2.0.1.
Hi I am encountering an error when using 2 GPUs for training YOLOv10n. Here is the error:
I face this error only when I use 2 GPUs with device=0,1, but I don't encounter any problems when I use device=0. I am using CUDA 11.7 and PyTorch 2.0.1.
Could you help me solve this problem?