PP-YOLOE+ 训练时提示显存不足，需要EB级别的显存。

dium6i commented 1 year ago

问题确认 Search before asking

[X] 我已经搜索过问题，但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

环境：基于paddlepaddle/paddle:2.4.2-gpu-cuda10.2-cudnn7.6-trt7.0的docker镜像，安装了quirements.txt内的库。系统： Ubuntu 22.04 硬件： 3080Ti 12G 驱动版本：525.105.17 CUDA版本：12.0 配置： bs=4，lr=0.000025 运行命令： python tools/train.py -c configs/ppyoloe/ppyoloe_plus_crn_x_80e_coco.yml --eval

报错如下：

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
W0418 05:10:45.465602   518 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.0, Runtime API Version: 10.2
W0418 05:10:45.470810   518 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
[04/18 05:10:47] ppdet.utils.checkpoint INFO: The shape [365] in pretrained weight yolo_head.pred_cls.0.bias is unmatched with the shape [7] in model yolo_head.pred_cls.0.bias. And the weight yolo_head.pred_cls.0.bias will not be loaded
[04/18 05:10:47] ppdet.utils.checkpoint INFO: The shape [365, 960, 3, 3] in pretrained weight yolo_head.pred_cls.0.weight is unmatched with the shape [7, 960, 3, 3] in model yolo_head.pred_cls.0.weight. And the weight yolo_head.pred_cls.0.weight will not be loaded
[04/18 05:10:47] ppdet.utils.checkpoint INFO: The shape [365] in pretrained weight yolo_head.pred_cls.1.bias is unmatched with the shape [7] in model yolo_head.pred_cls.1.bias. And the weight yolo_head.pred_cls.1.bias will not be loaded
[04/18 05:10:47] ppdet.utils.checkpoint INFO: The shape [365, 480, 3, 3] in pretrained weight yolo_head.pred_cls.1.weight is unmatched with the shape [7, 480, 3, 3] in model yolo_head.pred_cls.1.weight. And the weight yolo_head.pred_cls.1.weight will not be loaded
[04/18 05:10:47] ppdet.utils.checkpoint INFO: The shape [365] in pretrained weight yolo_head.pred_cls.2.bias is unmatched with the shape [7] in model yolo_head.pred_cls.2.bias. And the weight yolo_head.pred_cls.2.bias will not be loaded
[04/18 05:10:47] ppdet.utils.checkpoint INFO: The shape [365, 240, 3, 3] in pretrained weight yolo_head.pred_cls.2.weight is unmatched with the shape [7, 240, 3, 3] in model yolo_head.pred_cls.2.weight. And the weight yolo_head.pred_cls.2.weight will not be loaded
[04/18 05:10:47] ppdet.utils.checkpoint INFO: Finish loading model weights: ./ppyoloe_crn_x_obj365_pretrained.pdparams
terminate called after throwing an instance of 'paddle::memory::allocation::BadAlloc'
  what():  

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   masked_select_ad_func(paddle::experimental::Tensor const&, paddle::experimental::Tensor const&)
1   paddle::experimental::masked_select(paddle::experimental::Tensor const&, paddle::experimental::Tensor const&)
2   void phi::MaskedSelectKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, phi::DenseTensor*)
3   phi::DenseTensor::mutable_data(phi::Place const&, paddle::experimental::DataType, unsigned long)
4   paddle::memory::AllocShared(phi::Place const&, unsigned long)
5   paddle::memory::allocation::AllocatorFacade::AllocShared(phi::Place const&, unsigned long)
6   paddle::memory::allocation::AllocatorFacade::Alloc(phi::Place const&, unsigned long)
7   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::Allocator::Allocate(unsigned long)
10  paddle::memory::allocation::Allocator::Allocate(unsigned long)
11  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
12  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
13  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 15.877086EB memory on GPU 0, 6.362915GB memory has been allocated and available memory is only 5.399963GB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 
If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is `export FLAGS_use_cuda_managed_memory=false`.
 (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:95)

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1681794658 (unix time) try "date -d @1681794658" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x206) received by PID 518 (TID 0x7fd2dbb2f700) from PID 518 ***]

Aborted (core dumped)

加入--amp后，报错如下：

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
W0418 05:15:57.127051   620 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.0, Runtime API Version: 10.2
W0418 05:15:57.131618   620 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
[04/18 05:15:58] ppdet.utils.checkpoint INFO: The shape [365] in pretrained weight yolo_head.pred_cls.0.bias is unmatched with the shape [7] in model yolo_head.pred_cls.0.bias. And the weight yolo_head.pred_cls.0.bias will not be loaded
[04/18 05:15:58] ppdet.utils.checkpoint INFO: The shape [365, 960, 3, 3] in pretrained weight yolo_head.pred_cls.0.weight is unmatched with the shape [7, 960, 3, 3] in model yolo_head.pred_cls.0.weight. And the weight yolo_head.pred_cls.0.weight will not be loaded
[04/18 05:15:58] ppdet.utils.checkpoint INFO: The shape [365] in pretrained weight yolo_head.pred_cls.1.bias is unmatched with the shape [7] in model yolo_head.pred_cls.1.bias. And the weight yolo_head.pred_cls.1.bias will not be loaded
[04/18 05:15:58] ppdet.utils.checkpoint INFO: The shape [365, 480, 3, 3] in pretrained weight yolo_head.pred_cls.1.weight is unmatched with the shape [7, 480, 3, 3] in model yolo_head.pred_cls.1.weight. And the weight yolo_head.pred_cls.1.weight will not be loaded
[04/18 05:15:58] ppdet.utils.checkpoint INFO: The shape [365] in pretrained weight yolo_head.pred_cls.2.bias is unmatched with the shape [7] in model yolo_head.pred_cls.2.bias. And the weight yolo_head.pred_cls.2.bias will not be loaded
[04/18 05:15:58] ppdet.utils.checkpoint INFO: The shape [365, 240, 3, 3] in pretrained weight yolo_head.pred_cls.2.weight is unmatched with the shape [7, 240, 3, 3] in model yolo_head.pred_cls.2.weight. And the weight yolo_head.pred_cls.2.weight will not be loaded
[04/18 05:15:58] ppdet.utils.checkpoint INFO: Finish loading model weights: ./ppyoloe_crn_x_obj365_pretrained.pdparams
Traceback (most recent call last):
  File "tools/train.py", line 188, in <module>
    main()
  File "tools/train.py", line 184, in main
    run(FLAGS, cfg)
  File "tools/train.py", line 137, in run
    trainer.train(FLAGS.eval)
  File "/work/PaddleYOLO-release-2.6/ppdet/engine/trainer.py", line 414, in train
    outputs = model(data)
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 1012, in __call__
    return self.forward(*inputs, **kwargs)
  File "/work/PaddleYOLO-release-2.6/ppdet/modeling/architectures/meta_arch.py", line 59, in forward
    out = self.get_loss()
  File "/work/PaddleYOLO-release-2.6/ppdet/modeling/architectures/yolo.py", line 108, in get_loss
    return self._forward()
  File "/work/PaddleYOLO-release-2.6/ppdet/modeling/architectures/yolo.py", line 83, in _forward
    yolo_losses = self.yolo_head(neck_feats, self.inputs)
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 1012, in __call__
    return self.forward(*inputs, **kwargs)
  File "/work/PaddleYOLO-release-2.6/ppdet/modeling/heads/ppyoloe_head.py", line 239, in forward
    return self.forward_train(feats, targets)
  File "/work/PaddleYOLO-release-2.6/ppdet/modeling/heads/ppyoloe_head.py", line 176, in forward_train
    ], targets)
  File "/work/PaddleYOLO-release-2.6/ppdet/modeling/heads/ppyoloe_head.py", line 391, in get_loss
    assigned_scores_sum)
  File "/work/PaddleYOLO-release-2.6/ppdet/modeling/heads/ppyoloe_head.py", line 296, in _bbox_loss
    bbox_mask).reshape([-1, 4])
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/tensor/search.py", line 801, in masked_select
    return _C_ops.masked_select(x, mask)
RuntimeError: (PreconditionNotMet) The Tensor's element number must be equal or greater than zero. The Tensor's shape is [-4656394580360012950] now
  [Hint: Expected numel() >= 0, but received numel():-4656394580360012950 < 0:0.] (at /paddle/paddle/phi/core/dense_tensor_impl.cc:110)

nemonameless commented 1 year ago

显存不够建议换小模型，单卡bs=4没必要训精度肯定不太高。 ppyoloe+ 建议去使用 https://github.com/PaddlePaddle/PaddleDetection

dium6i commented 1 year ago

显存不够建议换小模型，单卡bs=4没必要训精度肯定不太高。

ppyoloe+ 建议去使用 https://github.com/PaddlePaddle/PaddleDetection

刚刚试了下PaddleDetection下的ppyoloe+，错误依旧；欢用picodet也是这样的错误。和cuda版本有关么？

nemonameless commented 8 months ago

请更新到最新代码和更新点的paddle版本去使用。谢谢。

PaddlePaddle / PaddleYOLO

PP-YOLOE+ 训练时提示显存不足，需要EB级别的显存。 #128

问题确认 Search before asking

请提出你的问题 Please ask your question