AlibabaResearch / efficientteacher

A Supervised and Semi-Supervised Object Detection Library for YOLO Series
GNU General Public License v3.0
805 stars 147 forks source link

无法验证转换后的efficient.pt (RuntimeError: DataLoader worker (pid 408) is killed by signal: Bus error.) #97

Closed Shuixin-Li closed 1 year ago

Shuixin-Li commented 1 year ago

您好,我也遇到了这个问题,我是在手册中提供的Docker环境中运行的命令 python val.py --cfg configs/sup/custom/custom_1.yaml --weights efficient-yolov5s-exp2-best.pt

我遇到了这个问题 RuntimeError: DataLoader worker (pid 408) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

  1. efficient-yolov5s-exp2-best.pt 是我从 yolov5s.pt 训练后得到的 best.pt 转换而来
  2. custom_1.yaml文件内容如下, 都是根据 yolov5s.yaml对照修改的
    
    # EfficientTeacher by Alibaba Cloud 

Parameters

project: '/runs_yolov5' adam: False epochs: 100 weights: '' prune_finetune: False linear_lr: True hyp: lr0: 0.01 hsv_h: 0.015 hsv_s: 0.7 hsv_v: 0.4 lrf: 0.1 scale: 0.5 no_aug_epochs: 0 mixup: 0.0 warmup_epochs: 3

Model: depth_multiple: 0.33 # model depth multiple width_multiple: 0.50 # layer channel multiple Backbone: name: 'YoloV5' activation: 'SiLU' Neck: name: 'YoloV5' in_channels: [256, 512, 1024] out_channels: [256, 512, 1024] activation: 'SiLU' Head: name: 'YoloV5' activation: 'SiLU' anchors: [[10,13, 16,30, 33,23],[30,61, 62,45, 59,119],[116,90, 156,198, 373,326]] # P5/32] Loss: type: 'ComputeLoss' cls: 0.5 obj: 1.0 anchor_t: 4.0

Dataset: data_name: 'coco' train: data/custom_train.txt val: data/custom_val.txt test: data/custom_test.txt nc: 16 # number of classes np: 0 #number of keypoints names: ['-', 'Amblyospiza', 'Anaplectes', 'Bubalornis', 'Dinemellia', 'Euplectes', 'Foudia', 'Histurgops', 'Malimbus', 'Pachyphantes', 'Philetairus', 'Plocepasser', 'Ploceus', 'Pseudonigrita', 'Quelea', 'Sporopipes'] img_size: 640 batch_size: 16


我的数据并不大
```bash
# image location
/Documents/datasets# du -hs
277M    .

# data.txt file location
efficientteacher/data# ls
custom_test.txt   get_coco.sh   unlabelled
custom_train.txt  custom_val.txt    get_label.sh

数据结构如下

datasets
  - train
     - images
     - labels
  - valid
     - images
     - labels
   -train
     - images
     - labels

下面是问题的全貌:

efficientteacher# python val.py --cfg configs/sup/custom/custom_1.yaml --weights efficient-yolov5s-exp2-best.pt
val: data=data/coco128.yaml, weights=['efficient-yolov5s-exp2-best.pt'], batch_size=32, imgsz=640, conf_thres=0.001, iou_thres=0.6, task=val, device=, single_cls=False, augment=False, verbose=False, save_txt=False, save_hybrid=False, save_conf=False, save_json=False, project=runs/val, name=exp, exist_ok=False, half=False, val_ssod=False, num_points=0, cfg=configs/sup/custom/custom_1.yaml, val_dp1000=False
EfficientTeacher  2023-6-4 torch 1.11.0+cu113 CPU

parse model_type:  efficient-yolov5s-exp2-best.pt
Fusing layers... 
Model summary: 211 layers, 7053277 parameters, 56637 gradients
/opt/conda/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2228.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Flops 15.93G Params 7.05M
val: Scanning 'data/custom_val' images and labels...204 found, 0 missing, 12 emp
val: New cache created: data/custom_val.cache
cls gt ratio(positive): (0.00-0) (32.00-1) (109.00-2) (98.00-3) (4.00-4) (25.00-5) (19.00-6) (2.00-7) (3.00-8) (0.00-9) (50.00-10) (164.00-11) (206.00-12) (158.00-13) (14.00-14) (40.00-15)
cls gt total number: 924.0 label number per image: 4.529411764705882
               Class     Images     Labels          P          R     mAP@.5 mAP@ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
               Class     Images     Labels          P          R     mAP@.5 mAP@
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 921, in wait
    ready = selector.select(timeout)
  File "/opt/conda/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 408) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "val.py", line 512, in <module>
    main(opt)
  File "val.py", line 507, in main
    run(**vars(opt))
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "val.py", line 275, in run
    for batch_i, (img, targets, paths, shapes) in enumerate(tqdm(dataloader, desc=s)):
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/shuixin/Documents/efficientteacher/utils/datasets.py", line 382, in __iter__
    yield next(self.iterator)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1024, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 408) exited unexpectedly

我注意到 val: data=data/coco128.yaml 是这里的问题吗?该如何修改?

Shuixin-Li commented 1 year ago

即使添加 --data data/data.yaml 也没有使错误好转 但我发现这是一个Docker的问题,于Docker容器内:

/efficientteacher# df -h
文件系统        容量  已用  可用 已用% 挂载点
overlay         234G  113G  109G   51% /
tmpfs            64M     0   64M    0% /dev
shm              64M     0   64M    0% /dev/shm
/dev/sda2       234G  113G  109G   51% /etc/hosts
tmpfs           7.8G     0  7.8G    0% /proc/asound
tmpfs           7.8G     0  7.8G    0% /proc/acpi
tmpfs           7.8G     0  7.8G    0% /proc/scsi
tmpfs           7.8G     0  7.8G    0% /sys/firmware

可以看到 shm 仅有 64 M

解决办法:当你运行 docker run命令的时候添加 --shm-size 8G 参数,例如 docker run -dit --shm-size 8G registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.3.0-py37-torch1.11.0-tf1.15.5-1.3.0 于新的Docker容器中

# df -h
文件系统        容量  已用  可用 已用% 挂载点
overlay         234G  113G  109G   51% /
tmpfs            64M     0   64M    0% /dev
shm             8.0G     0  8.0G    0% /dev/shm
/dev/sda2       234G  113G  109G   51% /etc/hosts
tmpfs           7.8G     0  7.8G    0% /proc/asound
tmpfs           7.8G     0  7.8G    0% /proc/acpi
tmpfs           7.8G     0  7.8G    0% /proc/scsi
tmpfs           7.8G     0  7.8G    0% /sys/firmware

于新容器中运行 python val.py --cfg configs/sup/custom/custom_1.yaml --weights efficient-yolov5s-exp2-best.pt --data data/data.yaml --batch-size 16 不再出现此错误