微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory

ly03240921 commented 2 weeks ago

🔎 Search before asking

[X] I have searched the PaddleOCR Docs and found no similar bug report.
[X] I have searched the PaddleOCR Issues and found no similar bug report.
[X] I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

[2024/08/27 19:14:23] ppocr INFO: epoch: [5/500], global_step: 10, lr: 0.001000, loss: 2.168079, loss_shrink_maps: 1.022120, loss_threshold_maps: 0.760488, loss_binary_maps: 0.204714, loss_cbn: 0.204714, avg_reader_cost: 0.03694 s, avg_batch_cost: 0.04500 s, avg_samples: 0.12, ips: 2.66682 samples/s, eta: 0:41:51, max_mem_reserved: 13909 MB, max_mem_allocated: 11894 MB eval model:: 0%| | 0/4 [00:00<?, ?it/s]Traceback (most recent call last): File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 257, in main(config, device, logger, vdl_writer, seed) File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 209, in main program.train( File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 452, in train cur_metric = eval( File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 622, in eval preds = model(images) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(*inputs, kwargs) File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/architectures/base_model.py", line 99, in forward x = self.head(x, targets=data) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(*inputs, *kwargs) File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 145, in forward cbn_maps = self.cbn_layer(self.up_conv(f), shrink_maps, None) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(inputs, kwargs) File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 127, in forward out = self.last_1(self.last_3(outf)) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(*inputs, *kwargs) File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/backbones/det_mobilenet_v3.py", line 186, in forward x = self.conv(x) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(inputs, **kwargs) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/conv.py", line 715, in forward out = F.conv._conv_nd( File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/functional/conv.py", line 128, in _conv_nd pre_bias = _C_ops.conv2d( MemoryError:

C++ Traceback (most recent call last):

0 paddle::pybind::eager_api_conv2d(_object, _object, _object) 1 conv2d_ad_func(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator >, std::vector<int, std::allocator >, std::string, std::vector<int, std::allocator >, int, std::string) 2 paddle::experimental::conv2d(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&) 3 void phi::ConvCudnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&, phi::DenseTensor) 4 float phi::DeviceContext::Alloc(phi::TensorBase, unsigned long, bool) const 5 phi::DeviceContext::Impl::Alloc(phi::TensorBase, phi::Place const&, phi::DataType, unsigned long, bool, bool) const 6 phi::DenseTensor::AllocateFrom(phi::Allocator, phi::DataType, unsigned long, bool) 7 paddle::memory::allocation::Allocator::Allocate(unsigned long) 8 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long) 9 paddle::memory::allocation::Allocator::Allocate(unsigned long) 10 paddle::memory::allocation::Allocator::Allocate(unsigned long) 11 paddle::memory::allocation::Allocator::Allocate(unsigned long) 12 paddle::memory::allocation::Allocator::Allocate(unsigned long) 13 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long) 14 std::string phi::enforce::GetCompleteTraceBackString(std::string&&, char const*, int) 15 phi::enforce::GetCurrentTraceBackStringabi:cxx11

Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 1. Cannot allocate 3.158203GB memory on GPU 1, 13.315369GB memory has been allocated and available memory is only 2.386902GB.

Please check whether there is any other process using GPU 1.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model. (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:86)

🏃‍♂️ Environment (运行环境)

PaddlePaddle-gpu：2.6 PaddleOCR：2.8 RAM：16G

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

python tools/train.py -c configs/det/ch_PP-OCRv4/ch_PP-OCRv4_det_teacher.yml

alanxinn commented 2 weeks ago

显存不够不够，调小batchsize

ly03240921 commented 2 weeks ago

显存不够不够，调小batchsize

train的batch size是8，跑的时候没问题。eval的batch_size是1，但跑不起来。训练中途每1000个step评估一次嘛，然后它就爆”显存不足“。前面1000个step训练都是正常的

alanxinn commented 2 weeks ago

那有试过更改每次评估的step间隔吗？改小

ly03240921 commented 2 weeks ago

那有试过更改每次评估的step间隔吗？改小

我已经改成10了还是 QQ图片20240828171714 没用，每10个step评估一次

alanxinn commented 2 weeks ago

观察一下到底是内存爆了还是显存爆了吧，把batchsize改成4 看看，虽然我也不知道有没有用，没碰到过这种问题

ly03240921 commented 2 weeks ago

观察一下到底是内存爆了还是显存爆了吧，把batchsize改成4 看看，虽然我也不知道有没有用，没碰到过这种问题

是显存爆了，调了train的batchsize也不行，我训完之后用tools/infer_det.py检测图片也是说显存爆了，就很搞不懂。。。

alanxinn commented 2 weeks ago

观察一下到底是内存爆了还是显存爆了吧，把batchsize改成4 看看，虽然我也不知道有没有用，没碰到过这种问题

是显存爆了，调了train的batchsize也不行，我训完之后用tools/infer_det.py检测图片也是说显存爆了，就很搞不懂。。。

paddle有时候有些奇奇怪怪的bug，要不重新装一下训练环境看看（doge

PaddlePaddle / PaddleOCR