Error when training: RuntimeError: Caught RuntimeError in pin memory thread for device 0

farbgeist commented 1 month ago

I get the following error after using the recommended docker image with the recommended installation steps from the readme:

python train.py --batch 16 --epochs 25 --img 640 --device 0 --min-items 0 --close-mosaic 15 --data ../generated_training_images_root_yoloV9/data.yaml --weights /workspace/weights/gelan-c.pt --cfg models/detect/gelan-c.yaml --hyp hyp.scratch-high.yaml train: weights=/workspace/weights/gelan-c.pt, cfg=models/detect/gelan-c.yaml, data=../generated_training_images_root_yoloV9/data.yaml, hyp=hyp.scratch-high.yaml, epochs=25, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, flat_cos_lr=False, fixed_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, min_items=0, close_mosaic=15, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest YOLOv5 🚀 1e33dbb Python-3.8.12 torch-1.11.0a0+b6df043 CUDA:0 (NVIDIA GeForce RTX 3090, 24575MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, cls_pw=1.0, dfl=1.5, obj_pw=1.0, iou_t=0.2, anchor_t=5.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.9, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.15, copy_paste=0.3 ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLO 🚀 in ClearML Comet: run 'pip install comet_ml' to automatically track and visualize YOLO 🚀 runs in Comet TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/ Overriding model.yaml nc=80 with nc=5

             from  n    params  module                                  arguments

0 -1 1 1 -1 1 2 3 4 5 6 7 8 9 10 -1 1 11 [-1, 6] 1 12 13 -1 1 14 [-1, 4] 1 15 16 17 [-1, 12] 1 18 19 20 [-1, 9] 1 21 22 [15, 18, 21] 1 gelan-c summary: 1856 models.common.Conv [3, 64, 3, 2] 73984 models.common.Conv [64, 128, 3, 2] -1 1 212864 models.common.RepNCSPELAN4 [128, 256, 128, 64, 1] -1 1 164352 models.common.ADown [256, 256] -1 1 847616 models.common.RepNCSPELAN4 [256, 512, 256, 128, 1] -1 1 656384 models.common.ADown [512, 512] -1 1 2857472 models.common.RepNCSPELAN4 [512, 512, 512, 256, 1] -1 1 656384 models.common.ADown [512, 512] -1 1 2857472 models.common.RepNCSPELAN4 [512, 512, 512, 256, 1] -1 1 656896 models.common.SPPELAN [512, 512, 256] 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 0 models.common.Concat [1] -1 1 3119616 models.common.RepNCSPELAN4 [1024, 512, 512, 256, 1] 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 0 models.common.Concat [1] -1 1 912640 models.common.RepNCSPELAN4 [1024, 256, 256, 128, 1] -1 1 164352 models.common.ADown [256, 256] 0 models.common.Concat [1] -1 1 2988544 models.common.RepNCSPELAN4 [768, 512, 512, 256, 1] -1 1 656384 models.common.ADown [512, 512] 0 models.common.Concat [1] -1 1 3119616 models.common.RepNCSPELAN4 [1024, 512, 512, 256, 1] 5494495 models.yolo.DDetect [5, [256, 512, 512]] 621 layers, 25440927 parameters, 25440911 gradients, 103.2 GFLOPs

Transferred 931/937 items from /workspace/weights/gelan-c.pt AMP: checks passed ✅ optimizer: SGD(lr=0.01) with parameter groups 154 weight(decay=0.0), 161 weight(decay=0.0005), 160 bias train: Scanning /workspace/generated_training_images_root_yoloV9/train/labels.cache... 221 images, 0 backgrounds, 0 corrupt: 100%|██████████| 221/221 00:00 val: Scanning /workspace/generated_training_images_root_yoloV9/valid/labels.cache... 221 images, 0 backgrounds, 0 corrupt: 100%|██████████| 221/221 00:00 Plotting labels to runs/train/exp17/labels.jpg... Image sizes 640 train, 640 val Using 8 dataloader workers Logging results to runs/train/exp17 Starting training for 25 epochs...

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

0%| | 0/14 00:00 Traceback (most recent call last): File "train.py", line 634, in main(opt) File "train.py", line 528, in main train(opt.hyp, opt, device, callbacks) File "train.py", line 277, in train for i, (imgs, targets, paths, _) in pbar: # batch ------------------------------------------------------------- File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1180, in iter for obj in iterable: File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 438, in reraise raise exception RuntimeError: Caught RuntimeError in pin memory thread for device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop data = pin_memory(data) File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory return [pin_memory(sample) for sample in data] File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in return [pin_memory(sample) for sample in data] File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory return data.pin_memory() RuntimeError: CUDA error: out of memory

It is NOT working with a smaller batch size...

farbgeist commented 1 month ago

the command to start the docker container: docker run --gpus=all --name yolov9 -it -v ./data/training/generated/generated_training_images_root_yoloV9/:/workspace/generated_training_images_root_yoloV9/ -v ./data/jupyter/yoloV9/:/workspace/ --shm-size=64g nvcr.io/nvidia/pytorch:21.11-py3

after that I did the recommended steps:

apt update apt install -y zip htop screen libgl1-mesa-glx pip install seaborn thop cd /yolov9

So for me the standard installation is broken. I am using a RTX 3090 on Ubuntu 22.04 inside of WSL 2 on Windows 11 with the newest Nvidia drivers.

farbgeist commented 3 weeks ago

noone else with that error?

WongKinYiu / yolov9

Error when training: RuntimeError: Caught RuntimeError in pin memory thread for device 0 #533