无标签学习batch_size问题

Gaojun211 commented 3 years ago

在configs/datasets下设置了 allow_empty: True 进行faster_rcnn模型的无标签学习，batch_size只能设为1，尝试大的batch_size提示显存不够，但是设为1时，显存非常充足，请问这是哪里的问题？

lyuwenyu commented 3 years ago

输入size设置了多大，多卡训练嘛你得看占用显存最多的卡啊都已经100%了

Gaojun211 commented 3 years ago

是多卡训练，但是我用的是0,1,6,7,8,9这六张卡。输入size在哪设置？

lyuwenyu commented 3 years ago

https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/configs/faster_rcnn/_base_/faster_reader.yml#L5

看你GPU显存占用非常不均衡把你程序停了看看GPU占用情况如果没释放的话就先完全释放掉<用 kill -9>

Gaojun211 commented 3 years ago

1、我用的是服务器上的显卡，也有其他人在用，但是在训练前，我用killall -u xx杀掉所有进程，也根据显卡使用情况载入合适的显卡，如上图使用0,1,6,7,8,9卡， export CUDA_VISIBLE_DEVICES=0,1,6,7,8,9。训练指令：python -m paddle.distributed.launch --gpus 0,1,6,7,8,9 tools/train.py -c configs/faster_rcnn/faster_rcnn_r50_vd_fpn_ssld_2x_coco.yml --use_vdl=True --vdl_log_dir=vdl_dir/scalar_fasterrcnn_1560/ 2、输入size使用的是默认大小，没有修改。但是我训练集中的图像大小不一，大的有（2448，,364）小的有（307,440），是不是需要根据我的图像大小修改TrainReader下的target_size？ worker_num: 2 TrainReader: sample_transforms:

Decode: {}
RandomResize: {target_size: [[640, 1333], [672, 1333], [704, 1333], [736, 1333], [768, 1333], [800, 1333]], interp: 2, keep_ratio: True}
RandomFlip: {prob: 0.5}
NormalizeImage: {is_scale: true, mean: [0.485,0.456,0.406], std: [0.229, 0.224,0.225]}
Permute: {} batch_transforms:
PadBatch: {pad_to_stride: -1} batch_size: 1 shuffle: true drop_last: true collate_batch: false

EvalReader: sample_transforms:

Decode: {}
Resize: {interp: 2, target_size: [800, 1333], keep_ratio: True}
NormalizeImage: {is_scale: true, mean: [0.485,0.456,0.406], std: [0.229, 0.224,0.225]}
Permute: {} batch_transforms:
PadBatch: {pad_to_stride: -1} batch_size: 1 shuffle: false drop_last: false

TestReader: sample_transforms:

Decode: {}
Resize: {interp: 2, target_size: [800, 1333], keep_ratio: True}
NormalizeImage: {is_scale: true, mean: [0.485,0.456,0.406], std: [0.229, 0.224,0.225]}
Permute: {} batch_transforms:
PadBatch: {pad_to_stride: -1} batch_size: 1 shuffle: false drop_last: false

Gaojun211 commented 3 years ago

我尝试batch_size=2时，两张显卡。报错

INFO 2021-07-27 01:24:46,358 launch_utils.py:327] terminate all the procs
ERROR 2021-07-27 01:24:46,359 launch_utils.py:584] ABORT!!! Out of all 2 trainers, the trainer process with rank=[1] was aborted. Please check its log.
INFO 2021-07-27 01:24:49,362 launch_utils.py:327] terminate all the procs
查看workerlog.1的log，如下：
Traceback (most recent call last):
  File "tools/train.py", line 139, in <module>
    main()
  File "tools/train.py", line 135, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 110, in run
    trainer.train(FLAGS.eval)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/engine/trainer.py", line 306, in train
    outputs = model(data)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel.py", line 578, in forward
    outputs = self._layers(*inputs, **kwargs)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/meta_arch.py", line 26, in forward
    out = self.get_loss()
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/faster_rcnn.py", line 95, in get_loss
    rpn_loss, bbox_loss = self._forward()
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/faster_rcnn.py", line 78, in _forward
    self.inputs)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/heads/bbox_head.py", line 253, in forward
    self.bbox_weight)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/heads/bbox_head.py", line 317, in get_loss
    reg_target = bbox2delta(rois, tgt_bboxes, bbox_weight)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/bbox_utils.py", line 32, in bbox2delta
    dx = wx * (tgt_ctr_x - src_ctr_x) / src_w
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py", line 250, in __impl__
    return math_op(self, other_var, 'axis', axis)
ValueError: (InvalidArgument) Broadcast dimension mismatch. Operands could not be broadcast together with the shape of X = [512] and the shape of Y = [1024]. Received [512] in X is not equal to [1024] in Y at i:0.
  [Hint: Expected x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1 == true, but received x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1:0 != true:1.] (at /paddle/paddle/fluid/operators/elementwise/elementwise_op_function.h:169)
  [operator < elementwise_sub > error]

workerlog.0的报错：

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::imperative::BasicEngine::Execute()
1   paddle::imperative::PreparedOp::Run(paddle::imperative::NameVariableWrapperMap const&, paddle::imperative::NameVariableWrapperMap const&, paddle::framework::AttributeMap const&)
2   std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvGradOpKernel<float>, paddle::operators::CUDNNConvGradOpKernel<double>, paddle::operators::CUDNNConvGradOpKernel<paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
3   paddle::operators::CUDNNConvGradOpKernel<float>::Compute(paddle::framework::ExecutionContext const&) const
4   cudnnConvolutionBwdFilterAlgo_t paddle::operators::SearchAlgorithm<cudnnConvolutionBwdFilterAlgoPerf_t>::Find<float>(paddle::operators::ConvArgs const&, bool, bool, paddle::framework::ExecutionContext const&)
5   paddle::framework::SignalHandle(char const*, int)
6   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1627349083 (unix time) try "date -d @1627349083" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0x3f5000023ae) received by PID 9217 (TID 0x7fc01378f740) from PID 9134 ***]

lyuwenyu commented 3 years ago

单卡然后bs=一张图会出错嘛

PadBatch: {pad_to_stride: -1} 改成 PadBatch: {pad_to_stride: 32} 试一下

Gaojun211 commented 3 years ago

只要把bs设为1，无论是单卡，多卡都能正常训练。设为2，单卡和多卡都会有一张显卡报错， PadBatch: {pad_to_stride: 32} 改成32 还是-1 都是报 Operands could not be broadcast together with式的错误。

[07/27 05:36:27] ppdet.utils.checkpoint INFO: Finish loading model weights: /data/gaojun/.cache/paddle/weights/ResNet50_vd_ssld_v2_pretrained.pdparams
[07/27 05:36:28] ppdet.engine INFO: Epoch: [0] [    0/30017] learning_rate: 0.000250 loss_rpn_cls: 0.694198 loss_rpn_reg: 0.000000 loss_bbox_cls: 1.128859 loss_bbox_reg: 0.000000 loss: 1.823058 eta: 5 days, 1:55:28 batch_cost: 1.2186 data_cost: 0.0013 ips: 1.6413 images/s
Traceback (most recent call last):
  File "tools/train.py", line 139, in <module>
    main()
  File "tools/train.py", line 135, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 110, in run
    trainer.train(FLAGS.eval)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/engine/trainer.py", line 306, in train
    outputs = model(data)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/meta_arch.py", line 26, in forward
    out = self.get_loss()
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/faster_rcnn.py", line 95, in get_loss
    rpn_loss, bbox_loss = self._forward()
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/faster_rcnn.py", line 78, in _forward
    self.inputs)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/heads/bbox_head.py", line 253, in forward
    self.bbox_weight)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/heads/bbox_head.py", line 317, in get_loss
    reg_target = bbox2delta(rois, tgt_bboxes, bbox_weight)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/bbox_utils.py", line 32, in bbox2delta
    dx = wx * (tgt_ctr_x - src_ctr_x) / src_w
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py", line 250, in __impl__
    return math_op(self, other_var, 'axis', axis)
ValueError: (InvalidArgument) Broadcast dimension mismatch. Operands could not be broadcast together with the shape of X = [512] and the shape of Y = [1024]. Received [512] in X is not equal to [1024] in Y at i:0.
  [Hint: Expected x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1 == true, but received x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1:0 != true:1.] (at /paddle/paddle/fluid/operators/elementwise/elementwise_op_function.h:169)
  [operator < elementwise_sub > error]

难道是因为我训练集中有的图像尺寸是奇数吗？有部分图像尺寸大小是（307，440）、（674,1023）的

Gaojun211 commented 3 years ago

只要把bs设为1，无论是单卡，多卡都能正常训练。设为2，单卡和多卡都会有一张显卡报错， PadBatch: {pad_to_stride: 32} 改成32 还是-1 都是报 Operands could not be broadcast together with式的错误。 [07/27 05:36:27] ppdet.utils.checkpoint INFO: Finish loading model weights: /data/gaojun/.cache/paddle/weights/ResNet50_vd_ssld_v2_pretrained.pdparams [07/27 05:36:28] ppdet.engine INFO: Epoch: [0] [ 0/30017] learning_rate: 0.000250 loss_rpn_cls: 0.694198 loss_rpn_reg: 0.000000 loss_bbox_cls: 1.128859 loss_bbox_reg: 0.000000 loss: 1.823058 eta: 5 days, 1:55:28 batch_cost: 1.2186 data_cost: 0.0013 ips: 1.6413 images/s Traceback (most recent call last): File "tools/train.py", line 139, in main() File "tools/train.py", line 135, in main run(FLAGS, cfg) File "tools/train.py", line 110, in run trainer.train(FLAGS.eval) File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/engine/trainer.py", line 306, in train outputs = model(data) File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in call outputs = self.forward(*inputs, kwargs) File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/meta_arch.py", line 26, in forward out = self.get_loss() File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/faster_rcnn.py", line 95, in get_loss rpn_loss, bbox_loss = self._forward() File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/faster_rcnn.py", line 78, in _forward self.inputs) File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in call outputs = self.forward(*inputs, *kwargs) File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/heads/bbox_head.py", line 253, in forward self.bbox_weight) File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/heads/bbox_head.py", line 317, in get_loss reg_target = bbox2delta(rois, tgt_bboxes, bbox_weight) File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/bbox_utils.py", line 32, in bbox2delta dx = wx (tgt_ctr_x - src_ctr_x) / src_w File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py", line 250, in impl** return math_op(self, other_var, 'axis', axis) ValueError: (InvalidArgument) Broadcast dimension mismatch. Operands could not be broadcast together with the shape of X = [512] and the shape of Y = [1024]. Received [512] in X is not equal to [1024] in Y at i:0. [Hint: Expected x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1 == true, but received x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1:0 != true:1.] (at /paddle/paddle/fluid/operators/elementwise/elementwise_op_function.h:169) [operator < elementwise_sub > error] 难道是因为我训练集中有的图像尺寸是奇数吗？有部分图像尺寸大小是（307，440）、（674,1023）的

我把训练集中尺寸width or height为奇数的图像都删掉后，还是报这样的维度不匹配的问题。

lyuwenyu commented 3 years ago

你用的ppdet是那个版本我用develop版本没问题啊

Gaojun211 commented 3 years ago

我用的ppdet2.1版本，昨天我用develop试了下也是同样的问题：(

Gaojun211 commented 3 years ago

在ppdet/data/source/coco.py 中将allow_empty改成True后能训练大的bs了，只在dataset下的配置文件中添加allow_empty=True不可以。

Gaojun211 commented 3 years ago

在ppdet/data/source/coco.py 中将allow_empty改成True后能训练大的bs了，只在dataset下的配置文件中添加allow_empty=True不可以。

之前看错了。在ppdet/data/source/coco.py 中将allow_empty改成True（未在/configs/dataset/xx.yml配置文件添加allow_empty=True）后虽然能训练大的bs，但网络训练的时候并没有读取无标签的数据。在/configs/dataset/xx.yml配置文件添加allow_empty=True后，bs的问题仍然存在（只能设TrainReader: batch_size: 1，这种情况下才读取全部的数据），即仍然不能训练大的bs，报

我用的ppdet2.1版本，昨天我用develop试了下也是同样的问题：(

这样的错误。

@QingshuChen

cena001plus commented 2 years ago

只要把bs设为1，无论是单卡，多卡都能正常训练。设为2，单卡和多卡都会有一张显卡报错， PadBatch: {pad_to_stride: 32} 改成32 还是-1 都是报 Operands could not be broadcast together with式的错误。

[07/27 05:36:27] ppdet.utils.checkpoint INFO: Finish loading model weights: /data/gaojun/.cache/paddle/weights/ResNet50_vd_ssld_v2_pretrained.pdparams
[07/27 05:36:28] ppdet.engine INFO: Epoch: [0] [    0/30017] learning_rate: 0.000250 loss_rpn_cls: 0.694198 loss_rpn_reg: 0.000000 loss_bbox_cls: 1.128859 loss_bbox_reg: 0.000000 loss: 1.823058 eta: 5 days, 1:55:28 batch_cost: 1.2186 data_cost: 0.0013 ips: 1.6413 images/s
Traceback (most recent call last):
  File "tools/train.py", line 139, in <module>
    main()
  File "tools/train.py", line 135, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 110, in run
    trainer.train(FLAGS.eval)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/engine/trainer.py", line 306, in train
    outputs = model(data)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/meta_arch.py", line 26, in forward
    out = self.get_loss()
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/faster_rcnn.py", line 95, in get_loss
    rpn_loss, bbox_loss = self._forward()
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/architectures/faster_rcnn.py", line 78, in _forward
    self.inputs)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 902, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/heads/bbox_head.py", line 253, in forward
    self.bbox_weight)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/heads/bbox_head.py", line 317, in get_loss
    reg_target = bbox2delta(rois, tgt_bboxes, bbox_weight)
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddledet-2.1.0-py3.6.egg/ppdet/modeling/bbox_utils.py", line 32, in bbox2delta
    dx = wx * (tgt_ctr_x - src_ctr_x) / src_w
  File "/data/gaojun/.conda/envs/paddle/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py", line 250, in __impl__
    return math_op(self, other_var, 'axis', axis)
ValueError: (InvalidArgument) Broadcast dimension mismatch. Operands could not be broadcast together with the shape of X = [512] and the shape of Y = [1024]. Received [512] in X is not equal to [1024] in Y at i:0.
  [Hint: Expected x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1 == true, but received x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1:0 != true:1.] (at /paddle/paddle/fluid/operators/elementwise/elementwise_op_function.h:169)
  [operator < elementwise_sub > error]

难道是因为我训练集中有的图像尺寸是奇数吗？有部分图像尺寸大小是（307，440）、（674,1023）的

faster_rcnn_swin也是同样的问题，单卡可以bs=1可以，但是4卡bs=4就报了如下错误，像你说的那样如果多卡bs=1可以运行（我还没有试），总感觉bs有点小，有点奇怪啊，我想问你这个问题后面如何解决的呢？

OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/fluid/platform/gpu_info.cc:429) [operator < where_index > error] terminate called after throwing an instance of 'paddle::platform::EnforceNotMet' what(): (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/fluid/memory/allocation/cuda_device_context_allocator.h:98)

C++ Traceback (most recent call last):

0 paddle::memory::allocation::CUDADeviceContextAllocatorPool::~CUDADeviceContextAllocatorPool() 1 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()

Error Message Summary:

FatalError: Process abort signal is detected by the operating system. [TimeInfo: Aborted at 1644813617 (unix time) try "date -d @1644813617" if you are using GNU date ] [SignalInfo: SIGABRT (@0x3dc00352fab) received by PID 3485611 (TID 0x7f164b5594c0) from PID 3485611 ]

Gaojun211 commented 2 years ago

一直没解决啊，只能bs=1的时候才能跑成功，但那样太浪费显卡资源了。

PaddlePaddle / PaddleDetection

无标签学习batch_size问题 #3790

C++ Traceback (most recent call last):

Error Message Summary: