PaddlePaddle / PaddleYOLO

🚀🚀🚀 YOLO series of PaddlePaddle implementation, PP-YOLOE+, RT-DETR, YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv10, YOLOX, YOLOv5u, YOLOv7u, YOLOv6Lite, RTMDet and so on. 🚀🚀🚀
https://github.com/PaddlePaddle/PaddleYOLO
GNU General Public License v3.0
547 stars 133 forks source link

读取自己制作的coco数据集时,报错"RuntimeError: fileds number not same among samples in a batch" #120

Closed ge35tay closed 6 months ago

ge35tay commented 1 year ago

问题确认 Search before asking

Bug组件 Bug Component

No response

Bug描述 Describe the Bug

我们自己创建了一个coco数据集,在paddle/paddleDetection的框架下读取需要修改ppdet/data/source/coco.py中59行load_crowd=true后跑通,训练没有问题(如果不修改他会报 "'not found any coco record in"错误)。但是在本框架下按照该方式修改会出现"RuntimeError: fileds number not same among samples in a batch"错误。具体报错信息如下:

det) yinghan.huang@ar-gpu01:~/Projects/VisionDetect/PaddleYOLO$ sh run.sh LAUNCH INFO 2023-04-05 14:31:37,642 ----------- Configuration ---------------------- LAUNCH INFO 2023-04-05 14:31:37,643 devices: 1 LAUNCH INFO 2023-04-05 14:31:37,643 elastic_level: -1 LAUNCH INFO 2023-04-05 14:31:37,643 elastic_timeout: 30 LAUNCH INFO 2023-04-05 14:31:37,643 gloo_port: 6767 LAUNCH INFO 2023-04-05 14:31:37,643 host: None LAUNCH INFO 2023-04-05 14:31:37,643 ips: None LAUNCH INFO 2023-04-05 14:31:37,643 job_id: default LAUNCH INFO 2023-04-05 14:31:37,643 legacy: False LAUNCH INFO 2023-04-05 14:31:37,643 log_dir: log_dir/yolov7_l_300e_bottle1_coco LAUNCH INFO 2023-04-05 14:31:37,644 log_level: INFO LAUNCH INFO 2023-04-05 14:31:37,644 master: None LAUNCH INFO 2023-04-05 14:31:37,644 max_restart: 3 LAUNCH INFO 2023-04-05 14:31:37,644 nnodes: 1 LAUNCH INFO 2023-04-05 14:31:37,644 nproc_per_node: None LAUNCH INFO 2023-04-05 14:31:37,644 rank: -1 LAUNCH INFO 2023-04-05 14:31:37,644 run_mode: collective LAUNCH INFO 2023-04-05 14:31:37,644 server_num: None LAUNCH INFO 2023-04-05 14:31:37,644 servers: LAUNCH INFO 2023-04-05 14:31:37,644 start_port: 6070 LAUNCH INFO 2023-04-05 14:31:37,644 trainer_num: None LAUNCH INFO 2023-04-05 14:31:37,644 trainers: LAUNCH INFO 2023-04-05 14:31:37,644 training_script: tools/train.py LAUNCH INFO 2023-04-05 14:31:37,644 training_script_args: ['-c', 'configs/yolov7/yolov7_l_300e_bottle1_coco.yml', '--eval', '--amp'] LAUNCH INFO 2023-04-05 14:31:37,644 with_gloo: 1 LAUNCH INFO 2023-04-05 14:31:37,645 -------------------------------------------------- LAUNCH INFO 2023-04-05 14:31:37,645 Job: default, mode collective, replicas 1[1:1], elastic False LAUNCH INFO 2023-04-05 14:31:37,654 Run Pod: ycqvdk, replicas 1, status ready LAUNCH INFO 2023-04-05 14:31:37,669 Watching Pod: ycqvdk, replicas 1, status running Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly loading annotations into memory... Done (t=1.65s) creating index... index created! [04/05 14:31:41] ppdet.data.source.coco INFO: Load [1800 samples valid, 0 samples invalid] in file /home/yinghan.huang/Projects/VisionDetect/datasets/yolo/train.json. W0405 14:31:41.656965 16264 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.2 W0405 14:31:41.663031 16264 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. Exception in thread Thread-1: Traceback (most recent call last): File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 536, in _thread_loop batch = self._get_data() File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 674, in _get_data batch.reraise() File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/worker.py", line 172, in reraise raise self.exc_type(msg) RuntimeError: DataLoader worker(2) caught RuntimeError with message: Traceback (most recent call last): File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/worker.py", line 339, in _worker_loop batch = fetcher.fetch(indices) File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/fetcher.py", line 138, in fetch data = self.collate_fn(data) File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/data/reader.py", line 95, in call batch_data = default_collate_fn(data) File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/data/utils.py", line 60, in default_collate_fn return { File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/data/utils.py", line 61, in key: default_collate_fn([d[key] for d in batch]) File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/data/utils.py", line 67, in default_collate_fn raise RuntimeError( RuntimeError: fileds number not same among samples in a batch

Traceback (most recent call last): File "tools/train.py", line 188, in main() File "tools/train.py", line 184, in main run(FLAGS, cfg) File "tools/train.py", line 137, in run trainer.train(FLAGS.eval) File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/engine/trainer.py", line 384, in train for step_id, data in enumerate(self.loader): File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/data/reader.py", line 213, in next return next(self.loader) File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 745, in next self._reader.read_nextlist()[0]) SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception. [Hint: Expected killed != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:175)

LAUNCH INFO 2023-04-05 14:31:45,680 Pod failed LAUNCH ERROR 2023-04-05 14:31:45,680 Container failed !!! Container rank 0 status failed cmd ['/home/yinghan.huang/anaconda3/envs/ppdet/bin/python', '-u', 'tools/train.py', '-c', 'configs/yolov7/yolov7_l_300e_bottle1_coco.yml', '--eval', '--amp'] code 1 log log_dir/yolov7_l_300e_bottle1_coco/workerlog.0 env {'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(ppdet) ', 'MAIL': '/var/mail/yinghan.huang', 'USER': 'yinghan.huang', 'SSH_CLIENT': '10.3.130.22 39540 22', 'LC_TIME': 'en_US.UTF-8', 'LD_LIBRARY_PATH': '/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/cv2/../../lib64:/home/yinghan.huang/anaconda3/pkgs/cudatoolkit-11.2.2-hbe64b41_11/lib:/home/yinghan.huang/anaconda3/pkgs/cudnn-8.2.1.32-h86fa8c9_0/lib', 'SHLVL': '1', 'CONDA_SHLVL': '2', 'OLDPWD': '/home/yinghan.huang/Projects/VisionDetect', 'HOME': '/home/yinghan.huang', 'SSH_TTY': '/dev/pts/9', 'LC_MONETARY': 'C.UTF-8', 'LC_CTYPE': 'en_US.UTF-8', 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1223/bus', '_CE_M': '', 'CUDA_VISIBLEDEVICES': '1', 'LOGNAME': 'yinghan.huang', '': '/bin/sh', 'XDG_SESSION_ID': '680', 'TERM': 'xterm-256color', '_CE_CONDA': '', 'LC_COLLATE': 'en_US.UTF-8', 'CUDADIR': '/usr/local/cuda', 'PATH': '/home/yinghan.huang/anaconda3/envs/ppdet/bin:/home/yinghan.huang/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin', 'LC_ADDRESS': 'C.UTF-8', 'XDG_RUNTIME_DIR': '/run/user/1223', 'LANG': 'C.UTF-8', 'CONDA_PREFIX_1': '/home/yinghan.huang/anaconda3', 'LC_TELEPHONE': 'C.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/yinghan.huang/anaconda3/bin/python', 'LC_MESSAGES': 'en_US.UTF-8', 'LC_NAME': 'C.UTF-8', 'SHELL': '/bin/bash', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'CONDA_DEFAULT_ENV': 'ppdet', 'LC_MEASUREMENT': 'C.UTF-8', 'LC_IDENTIFICATION': 'C.UTF-8', 'PWD': '/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO', 'CONDA_EXE': '/home/yinghan.huang/anaconda3/bin/conda', 'SSH_CONNECTION': '10.3.130.22 39540 10.3.15.202 22', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'LC_NUMERIC': 'en_US.UTF-8', 'LC_PAPER': 'C.UTF-8', 'CONDA_PREFIX': '/home/yinghan.huang/anaconda3/envs/ppdet', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/cv2/qt/fonts', 'POD_NAME': 'ycqvdk', 'PADDLE_MASTER': '10.3.15.202:54366', 'PADDLE_GLOBAL_SIZE': '1', 'PADDLE_LOCAL_SIZE': '1', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_TRAINER_ENDPOINTS': '10.3.15.202:54367', 'PADDLE_CURRENT_ENDPOINT': '10.3.15.202:54367', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '1', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '0'} LAUNCH INFO 2023-04-05 14:31:45,681 ------------------------- ERROR LOG DETAIL ------------------------- GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.2 W0405 14:31:41.663031 16264 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. Exception in thread Thread-1: Traceback (most recent call last): File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 536, in _thread_loop batch = self._get_data() File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 674, in _get_data batch.reraise() File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/worker.py", line 172, in reraise raise self.exc_type(msg) RuntimeError: DataLoader worker(2) caught RuntimeError with message: Traceback (most recent call last): File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/worker.py", line 339, in _worker_loop batch = fetcher.fetch(indices) File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/fetcher.py", line 138, in fetch data = self.collate_fn(data) File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/data/reader.py", line 95, in call batch_data = default_collate_fn(data) File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/data/utils.py", line 60, in default_collate_fn return { File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/data/utils.py", line 61, in key: default_collate_fn([d[key] for d in batch]) File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/data/utils.py", line 67, in default_collate_fn raise RuntimeError( RuntimeError: fileds number not same among samples in a batch

Traceback (most recent call last): File "tools/train.py", line 188, in main() File "tools/train.py", line 184, in main run(FLAGS, cfg) File "tools/train.py", line 137, in run trainer.train(FLAGS.eval) File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/engine/trainer.py", line 384, in train for step_id, data in enumerate(self.loader): File "/home/yinghan.huang/Projects/VisionDetect/PaddleYOLO/ppdet/data/reader.py", line 213, in next return next(self.loader) File "/home/yinghan.huang/anaconda3/envs/ppdet/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 745, in next self._reader.read_nextlist()[0]) SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception. [Hint: Expected killed != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:175)

LAUNCH INFO 2023-04-05 14:31:45,681 Exit code 1

复现环境 Environment

Bug描述确认 Bug description confirmation

是否愿意提交PR? Are you willing to submit a PR?

nemonameless commented 1 year ago

load_crowd一般默认为false,对于自定义数据集基本上也不会用到它,你确定改成True跑通且训的是正确的吗能训出来精度吗?not found any coco record in 就是提示没有读到任何有效的gt框,光改设置load_crowd为True能读到框的话也是很少的一部分,可能你的数据集里真正有效的框还是没读到,建议首先还是检查数据集制作。

ge35tay commented 1 year ago

load_crowd一般默认为false,对于自定义数据集基本上也不会用到它,你确定改成True跑通且训的是正确的吗能训出来精度吗?not found any coco record in 就是提示没有读到任何有效的gt框,光改设置load_crowd为True能读到框的话也是很少的一部分,可能你的数据集里真正有效的框还是没读到,建议首先还是检查数据集制作。

因为我们的数据集使用的是RLE编码,RLE编码要求load_crowd为True,数据集没有问题,可以在PaddleDetection上成功的进行基于solov2的实例分割训练。

而且对于YOLO来说,load_crowd应该不影响吧,因为代码paddleyolo的代码中是在["segmentation"]=True的前提下才去检查load_crowd,另外上述报错还没有运行到涉及到load_crowd的部分

kaixin-bai commented 1 year ago

paddleYOLO/coco.pycoco.getAnnIds的输入中iscrowd标签也是恒定为None或者False,永远不会为True。如果使用RLE编码的话,类似的暗坑好多,paddlepaddle后续都不会支持RLE编码吗?

kaixin-bai commented 1 year ago

load_crowd一般默认为false,对于自定义数据集基本上也不会用到它,你确定改成True跑通且训的是正确的吗能训出来精度吗?not found any coco record in 就是提示没有读到任何有效的gt框,光改设置load_crowd为True能读到框的话也是很少的一部分,可能你的数据集里真正有效的框还是没读到,建议首先还是检查数据集制作。

另外load_crowd只在使用多边形编码分割的时候才应该为False吧,自定义的数据集很多都是用RLE编码,改为True在paddledetection的solov2和yolov3能训练,不过前提是我们修改了pycocotools和paddledetection的源码。另外我们的自定义数据集中,无论iscrowd是否为True,2D bbox都是一样的,影响的只有分割的部分。以下是修改后训练完的推理结果。 0026_IMG_Texture_8Bit_yolo_texinfer

0979_texture

nemonameless commented 6 months ago

谢谢建议,后续会进行下排查修复。