Failed to create data loader when training with multiple GPUs

haimat commented 1 year ago

We want to train a YOLOv7 network on multiple GPUs (one machine) using transfer learning. However, that leads to the following error:

$ python -m torch.distributed.launch --nproc_per_node 4 train_aux.py --workers 4 --device 0,1,2,3 --batch-size 64 --data /data/scratch/Elektro-YOLO/yolo-data.yaml --img 640 --cfg cfg/training/yolov7-e6e.yaml --weights 'yolov7-e6e_training.pt' --name pl-elektro-640 --hyp data/hyp.scratch.custom.yaml
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
YOLOR 🚀 v0.1-66-g7a7cba7 torch 1.11.0+cu113 CUDA:0 (NVIDIA RTX A6000, 48685.0625MB)
                                            CUDA:1 (NVIDIA RTX A6000, 48685.0625MB)
                                            CUDA:2 (NVIDIA RTX A6000, 48685.0625MB)
                                            CUDA:3 (NVIDIA RTX A6000, 48685.0625MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Namespace(adam=False, artifact_alias='latest', batch_size=16, bbox_interval=-1, bucket='', cache_images=False, cfg='cfg/training/yolov7-e6e.yaml', data='/data/scratch/Pipelife-Elektro-YOLO/yolo-data.yaml', device='0,1,2,3', entity=None, epochs=300, evolve=False, exist_ok=False, global_rank=0, hyp='data/hyp.scratch.custom.yaml', image_weights=False, img_size=[640, 640], label_smoothing=0.0, linear_lr=False, local_rank=0, multi_scale=False, name='exp', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp4', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=64, upload_dataset=False, weights='yolov7-e6e_training.pt', workers=4, world_size=4)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.3, cls_pw=1.0, obj=0.7, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.2, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, paste_in=0.0
wandb: Install Weights & Biases for YOLOR logging with 'pip install wandb' (recommended)
Overriding model.yaml nc=80 with nc=4

                 from  n    params  module                                  arguments
  0                -1  1         0  models.common.ReOrg                     []
  1                -1  1      8800  models.common.Conv                      [12, 80, 3, 1]
  2                -1  1     70880  models.common.DownC                     [80, 160, 1]
[... model summary here ...]
262               161  1   1844480  models.common.Conv                      [320, 640, 3, 1]
263               136  1   4149120  models.common.Conv                      [480, 960, 3, 1]
264               112  1   7375360  models.common.Conv                      [640, 1280, 3, 1]
265[257, 258, 259, 260, 261, 262, 263, 264]  1    176324  models.yolo.IAuxDetect                  [4, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], [320, 640, 960, 1280, 320, 640, 960, 1280]]
/usr/local/lib/python3.8/dist-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Model Summary: 1063 layers, 164946948 parameters, 164946948 gradients, 226.2 GFLOPS

Transferred 1468/1490 items from yolov7-e6e_training.pt
/usr/local/lib/python3.8/dist-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/usr/local/lib/python3.8/dist-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/usr/local/lib/python3.8/dist-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Scaled weight_decay = 0.0005
Optimizer groups: 252 .bias, 252 conv.weight, 252 other
Traceback (most recent call last):
  File "train_aux.py", line 609, in <module>
    train(hyp, opt, device, tb_writer)
  File "train_aux.py", line 245, in train
    dataloader, dataset = create_dataloader(train_path, imgsz, batch_size, gs, opt,
  File "/data/python/yolov7/utils/datasets.py", line 69, in create_dataloader
    dataset = LoadImagesAndLabels(path, imgsz, batch_size,
  File "/data/python/yolov7/utils/datasets.py", line 392, in __init__
    cache, exists = torch.load(cache_path), True  # load
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 920, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: STACK_GLOBAL requires str
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 887123 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 887124 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 887125 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 887122) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_aux.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-07-27_11:50:27
  host      : host
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 887122)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Any ideas?

dungdo123 commented 1 year ago

same problem, is there any solution?

Synergyst commented 1 year ago

I am also experiencing the same issue.

Has anyone else had this problem and figured out what to do to fix it?

YaoQ commented 1 year ago

Try to clean the cache file in the dataset

haimat commented 1 year ago

@YaoQ We started with a fresh dataset without any cache, so this cannot be the issue.

Synergyst commented 1 year ago

Agreed, this can not be the issue since I also started from a fresh dataset.

pullmyleg commented 1 year ago

@haimat did you find a solution?

haimat commented 1 year ago

@pullmyleg Unfortunately not, so I got back to YOLOv5. Frankly in the end I don't care too much for 2-3% AP increase, if the framework is not supported. I rather go with a bit lower AP (but still very good results) and a fully and professionally maintained library that offers proper support ... (just my 2 cents)

ZhangLe-fighting commented 1 year ago

I have also encountered similar problems, which are solved as follows (not necessarily applicable to you):

Install requirement Pip warning appears in txt environment
Therefore, create a pip virtual environment on the basis of conda. Refer to the CSDN tutorial: https://blog.csdn.net/m0_63127854/article/details/126452152
Then run timeError: result type Float can't be cast to the desired output type long int Just rewrite loss.py here
Train again and succeed! I hope it can help you ！

ZhangLe-fighting commented 1 year ago

Baidu Translation is a little Chinese English, hahaha

WongKinYiu / yolov7

Failed to create data loader when training with multiple GPUs #328