AttributeError: '_DataLoaderIterMultiProcess' object has no attribute '_shutdown'

ZhangLe-fighting commented 1 year ago

问题确认 Search before asking

[X] 我已经搜索过问题，但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

Traceback (most recent call last): File "tools/train.py", line 172, in main() File "tools/train.py", line 168, in main run(FLAGS, cfg) File "tools/train.py", line 123, in run trainer = Trainer(cfg, mode='train') File "/home/hik/zhangle16/PaddleDetection_YOLOSeries/ppdet/engine/trainer.py", line 94, in init self.dataset, cfg.worker_num) File "/home/hik/zhangle16/PaddleDetection_YOLOSeries/ppdet/data/reader.py", line 197, in call self.loader = iter(self.dataloader) File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/reader.py", line 566, in iter return _DataLoaderIterMultiProcess(self) File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 381, in init self._try_put_indices() File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 695, in _try_put_indices indices = next(self._sampler_iter) File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/dataloader/batch_sampler.py", line 262, in iter assert len(indices) == self.total_size AssertionError Exception ignored in: <bound method _DataLoaderIterMultiProcess.del of <paddle.fluid.dataloader.dataloader_iter._DataLoaderIterMultiProcess object at 0x7fd205cfbbe0>> Traceback (most recent call last): File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 712, in del self._try_shutdown_all() File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 503, in _try_shutdown_all if not self._shutdown: AttributeError: '_DataLoaderIterMultiProcess' object has no attribute '_shutdown' INFO 2022-09-05 16:27:07,857 launch_utils.py:343] terminate all the procs

As a beginner, I can't solve this problem alone. Please help me. Thank you！

ZhangLe-fighting commented 1 year ago

nemonameless commented 1 year ago

提问请遵循issue模板，说明下改动了什么，版本和环境是什么，以便排查问题。

ZhangLe-fighting commented 1 year ago

改动：只改了配置文件 coco_detection.yml 文件中 num_classes: 4 yolov7p6_elannet.yml 文件中 YOLOv7Head: anchors: [[29, 35], [41, 62], [79, 37], [96, 101], [185, 58], [155, 174], [298, 125], [267, 290], [621, 176], [414, 402], [880, 414], [945, 949]] （tools/anchor_cluster.py 生成） yolov7p6_reader.yml 文件中 worker_num: 1 ； TrainReader:下batch_size: 4 ； EvalReader:下 batch_size: 4

版本： paddlepaddle-gpu 2.3.2 paddledet 2.4.0 paddle-bfloat 0.1.7 python 3.6.2 cudatoolkit 11.2.2
cudnn 8.1.0.77

服务器环境： Tesla P40 服务器 Linux gpu 4.15.0-142-generic系统 Anaconda下使用conda创建的python3.6的虚拟环境，然后安装了paddle2.3.2 和paddledetection_yoloseries

ZhangLe-fighting commented 1 year ago

训练命令：export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m paddle.distributed.launch --log_dir=./yolo7_dygraph/ --gpus 4,5,6,7 tools/train.py -c configs/yolov7/yolov7p6_e6e_300e_coco.yml --eval -o use_gpu=true

将上述命令换成单卡训练会出现内存不足的错误，卡的内存占用会一直上升直到崩溃

nemonameless commented 1 year ago

一开始训就显存一直上升直到崩溃，就说明bs过大了，调小点就行。训练命令加上--amp混合精度训练，加上可以同bs下减小显存占用。

ZhangLe-fighting commented 1 year ago

一开始训就显存一直上升直到崩溃，就说明bs过大了，调小点就行。训练命令加上--amp混合精度训练，加上可以同bs下减小显存占用。

我使用四张卡，worker_num=1, bs=4 也会这样，单张卡显存22919MiB，报错为： File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 503, in _try_shutdown_all if not self._shutdown: AttributeError: '_DataLoaderIterMultiProcess' object has no attribute '_shutdown'

这个报错是多卡训练的报错，应该没办法再减小bs了，加上混合精度训练依旧会出现上述问题

nemonameless commented 1 year ago

trainreader里的 bs=4 是指每卡的bs。但是再缩小的话总bs也过小了会影响精度。我也是paddle2.3.2，但是python3.7，没有出现这样的问题，要么升级下python试试。

ZhangLe-fighting commented 1 year ago

好的，感谢回答，我这边试一下。

ZhangLe-fighting commented 1 year ago

trainreader里的 bs=4 是指每卡的bs。但是再缩小的话总bs也过小了会影响精度。我也是paddle2.3.2，但是python3.7，没有出现这样的问题，要么升级下python试试。

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m paddle.distributed.launch --log_dir=./ppyoloe_dygraph/ --gpus 2,3,4,7 tools/train.py -c configs/ppyoloe/ppyoloe_crn_l_300e_coco.yml --eval --amp

python升级到3.7（新建了个测试环境）依旧出现上述问题，我查看了报错的py文件，但是水平有限看不出哪里有问题，希望作者可以帮忙查看一下我这里coco数据格式单卡训练和voc数据格式的多卡训练都没有出现这个问题，只有coco数据格式的多卡会报错

附上报错： loading annotations into memory... Done (t=0.25s) creating index... index created! Traceback (most recent call last): File "tools/train.py", line 172, in main() File "tools/train.py", line 168, in main run(FLAGS, cfg) File "tools/train.py", line 123, in run trainer = Trainer(cfg, mode='train') File "/home/hik/zhangle16/paddledet/PaddleDetection_YOLOSeries/ppdet/engine/trainer.py", line 97, in init self.dataset, cfg.worker_num) File "/home/hik/zhangle16/paddledet/PaddleDetection_YOLOSeries/ppdet/data/reader.py", line 197, in call self.loader = iter(self.dataloader) File "/home/hik/anaconda3/envs/paddledet_env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 566, in iter return _DataLoaderIterMultiProcess(self) File "/home/hik/anaconda3/envs/paddledet_env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 381, in init self._try_put_indices() File "/home/hik/anaconda3/envs/paddledet_env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 695, in _try_put_indices indices = next(self._sampler_iter) File "/home/hik/anaconda3/envs/paddledet_env/lib/python3.7/site-packages/paddle/fluid/dataloader/batch_sampler.py", line 262, in iter assert len(indices) == self.total_size AssertionError

C++ Traceback (most recent call last):

No stack trace in paddle, may be caused by external reasons.

Error Message Summary:

FatalError: Termination signal is detected by the operating system. [TimeInfo: Aborted at 1662449374 (unix time) try "date -d @1662449374" if you are using GNU date ] [SignalInfo: SIGTERM (@0x61d7) received by PID 25589 (TID 0x7efda056c700) from PID 25047 ]

Exception ignored in: <function _DataLoaderIterMultiProcess.del at 0x7efd344aec20> Traceback (most recent call last): File "/home/hik/anaconda3/envs/paddledet_env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 712, in del self._try_shutdown_all() File "/home/hik/anaconda3/envs/paddledet_env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 503, in _try_shutdown_all if not self._shutdown: AttributeError: '_DataLoaderIterMultiProcess' object has no attribute '_shutdown' INFO 2022-09-06 15:29:41,695 launch_utils.py:343] terminate all the procs

nemonameless commented 1 year ago

voc格式的多卡训练没有问题，而coco格式有问题吗？直接多卡训coco数据集也有问题吗？相关issue rm -rf /dev/shm/* 清理一下再训下试试其他可能的解决方法参考 https://github.com/PaddlePaddle/PaddleDetection/issues/3279 , https://github.com/PaddlePaddle/PaddleDetection/issues/2555

ZhangLe-fighting commented 1 year ago

AssertionError

C++ Traceback (most recent call last):

No stack trace in paddle, may be caused by external reasons.

Error Message Summary:

FatalError: Termination signal is detected by the operating system. [TimeInfo: Aborted at 1662457696 (unix time) try "date -d @1662457696" if you are using GNU date ] [SignalInfo: SIGTERM (@0xacff) received by PID 44993 (TID 0x7f45d0865700) from PID 44287 ]

Exception ignored in: <function _DataLoaderIterMultiProcess.del at 0x7f45647a7c20> Traceback (most recent call last): File "/home/hik/anaconda3/envs/paddledet_env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 712, in del self._try_shutdown_all() File "/home/hik/anaconda3/envs/paddledet_env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 503, in _try_shutdown_all if not self._shutdown: AttributeError: '_DataLoaderIterMultiProcess' object has no attribute '_shutdown'

这是现在这个报错，然后单卡训练发现ap全为0

nemonameless commented 1 year ago

voc格式的多卡训练没有问题，而coco格式有问题吗？直接多卡训coco数据集也有问题吗？相关issue rm -rf /dev/shm/* 清理一下再训下试试其他可能的解决方法参考 PaddlePaddle/PaddleDetection#3279 , PaddlePaddle/PaddleDetection#2555

这里提到的解决方法都试过了吗？还是报这个错？你的数据集有多大，总张数多少？单卡eval AP为0可能是数据集gt制作有问题，可以可视化检查下或截图看下刚开始训和1epoch时的loss收敛情况。

ZhangLe-fighting commented 1 year ago

好的，我试下，数据集4000张，大概是1080p或2k的图，上面有的方法都试了，还是会报错。单卡ap为0的时候训练超级快，不应该这么快的... 我查看一下问题，然后再给您反馈

ZhangLe-fighting commented 1 year ago

voc格式的多卡训练没有问题，而coco格式有问题吗？直接多卡训coco数据集也有问题吗？相关issue rm -rf /dev/shm/* 清理一下再训下试试其他可能的解决方法参考 PaddlePaddle/PaddleDetection#3279 , PaddlePaddle/PaddleDetection#2555

这里提到的解决方法都试过了吗？还是报这个错？你的数据集有多大，总张数多少？单卡eval AP为0可能是数据集gt制作有问题，可以可视化检查下或截图看下刚开始训和1epoch时的loss收敛情况。

目前voc格式的多卡训练经过测试是没问题的，但是coco格式的多卡训练会报上面列举的错误，暂时未解决，coco格式的单卡训练，可能是数据集格式转换的问题，我这边刚入门排查较慢哈

ZhangLe-fighting commented 1 year ago

index created! <class 'ppdet.modeling.architectures.yolo.YOLOv3'> Traceback (most recent call last): File "tools/train.py", line 172, in main() File "tools/train.py", line 168, in main run(FLAGS, cfg) File "tools/train.py", line 132, in run trainer.train(FLAGS.eval) File "/home/hik/zhangle16/PaddleDetection_YOLOSeries/ppdet/engine/trainer.py", line 443, in train self._flops(flops_loader) File "/home/hik/zhangle16/PaddleDetection_YOLOSeries/ppdet/engine/trainer.py", line 933, in _flops flops = flops(self.model, input_spec) / (10003) File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddleslim/analysis/flops.py", line 133, in dygraph_flops program = dygraph2program(model, inputs) File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/decorator.py", line 232, in fun return caller(func, *(extras + args), *kw) File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in impl return wrapped_func(args, kwargs) File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/framework.py", line 434, in impl return func(*args, kwargs) File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddleslim/core/dygraph.py", line 162, in dygraph2program original_outputs = layer(inputs) File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(inputs, kwargs) File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "/home/hik/zhangle16/PaddleDetection_YOLOSeries/ppdet/modeling/architectures/meta_arch.py", line 75, in forward outs.append(self.get_pred()) File "/home/hik/zhangle16/PaddleDetection_YOLOSeries/ppdet/modeling/architectures/yolo.py", line 127, in get_pred return self._forward() File "/home/hik/zhangle16/PaddleDetection_YOLOSeries/ppdet/modeling/architectures/yolo.py", line 118, in _forward yolo_head_outs, self.inputs['scale_factor']) File "/home/hik/zhangle16/PaddleDetection_YOLOSeries/ppdet/modeling/heads/ppyoloe_head.py", line 370, in post_process pred_dist.transpose([0, 2, 1])) File "/home/hik/zhangle16/PaddleDetection_YOLOSeries/ppdet/modeling/bbox_utils.py", line 780, in batch_distance2bbox x1y1 = -lt + points File "/home/hik/anaconda3/envs/paddle_env/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py", line 299, in impl return math_op(self, other_var, 'axis', axis) ValueError: (InvalidArgument) Broadcast dimension mismatch. Operands could not be broadcast together with the shape of X = [1, 4116, 2] and the shape of Y = [8400, 2]. Received [4116] in X is not equal to [8400] in Y at i:1. [Hint: Expected x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1 == true, but received x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1:0 != true:1.] (at /paddle/paddle/phi/kernels/funcs/common_shape.h:84) [operator < elementwise_add > error] INFO 2022-09-07 11:50:42,890 launch_utils.py:343] terminate all the procs INFO 2022-09-07 11:50:42,890 launch_utils.py:343] terminate all the procs ERROR 2022-09-07 11:50:42,890 launch_utils.py:642] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0, 1] was aborted. Please check its log. ERROR 2022-09-07 11:50:42,890 launch_utils.py:642] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0, 1] was aborted. Please check its log. INFO 2022-09-07 11:50:46,895 launch_utils.py:343] terminate all the procs INFO 2022-09-07 11:50:46,895 launch_utils.py:343] terminate all the procs INFO 2022-09-07 11:50:46,895 launch.py:402] Local processes completed. INFO 2022-09-07 11:50:46,895 launch.py:402] Local processes completed.

更改coco数据之后，单卡和多卡训练都会出现上述问题

ZhangLe-fighting commented 1 year ago

感谢开发者对我的问题进行解答，后面经过测试，voc和coco格式的数据均能正常训练。初步排查的原因是：

在yolov7p6_e6e_300e_coco.yml上不能训练的原因是硬件条件不行，会造成内存不足
在ppyoloe_crn_l_300e_coco.yml上coco数据格式训练时，单卡ap为0和多卡各类报错的原因是coco数据集转换存在问题

再次感谢！

nemonameless commented 1 year ago

祝贺，主要还是你自己通过实践解决的。

PaddlePaddle / PaddleYOLO