使用ch_PP-OCRv4_rec训练数据集报错：Out of memory error on GPU 0. Cannot allocate 129.394531MB memory on GPU 0, 23.611938GB memory has been allocated and available memory is only 31.687500MB.

lili-changjiang commented 3 weeks ago

系统环境/System Environment：Linux
版本号/Version：Paddle：2.4.2.post112
运行指令/Command Code： python tools/train.py -c configs/rec/PP-OCRv4/ch_PP-OCRv4_rec.yml
完整报错/Complete Error Message：Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 129.394531MB memory on GPU 0, 23.611938GB memory has been allocated and available memory is only 31.687500MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model. If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is export FLAGS_use_cuda_managed_memory=false. (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:95)

我设置的ch_PP-OCRv4_rec.yml:

Global: debug: false use_gpu: true epoch_num: 20 log_smooth_window: 20 print_batch_step: 10 save_model_dir: ./output/rec_ppocr_v4 save_epoch_step: 3 eval_batch_step: [0, 100] cal_metric_during_train: true pretrained_model: ./pretrained_models/ch_PP-OCRv4_rec_train/student checkpoints: save_inference_dir: use_visualdl: false infer_img: doc/imgs_words/ch/word_1.jpg character_dict_path: ppocr/utils/ppocr_keys_v1.txt max_text_length: &max_text_length 25 infer_mode: false use_space_char: true distributed: true save_res_path: ./output/rec/predicts_ppocrv3.txt

Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Cosine learning_rate: 0.0001 warmup_epoch: 2 regularizer: name: L2 factor: 3.0e-05

Architecture: model_type: rec algorithm: SVTR_LCNet Transform: Backbone: name: PPLCNetV3 scale: 0.95 Head: name: MultiHead head_list:

CTCHead: Neck: name: svtr dims: 120 depth: 2 hidden_dims: 120 kernel_size: [1, 3] use_guide: True Head: fc_decay: 0.00001
NRTRHead: nrtr_dim: 384 max_text_length: *max_text_length

Loss: name: MultiLoss loss_config_list:

CTCLoss:
NRTRLoss:

PostProcess:
name: CTCLabelDecode

Metric: name: RecMetric main_indicator: acc

Train: dataset: name: MultiScaleDataSet ds_width: false data_dir: ./train_data/train ext_op_transform_idx: 1 label_file_list:

./train_data/rec/train.txt transforms:
DecodeImage: img_mode: BGR channel_first: false
RecConAug: prob: 0.5 ext_data_num: 2 image_shape: [ 48, 320, 3 ]
```
max_text_length: *max_text_length
```
RecAug:
MultiLabelEncode: gtc_encode: NRTRLabelEncode
KeepKeys: keep_keys:
- image
- label_ctc
- label_gtc
- length
- valid_ratio sampler: name: MultiScaleSampler scales: [[320, 32], [320, 48], [320, 64]] first_bs: &bs 192 fix_bs: false divided_factor: [8, 16] # w, h is_training: True loader: shuffle: true batch_size_per_card: 2
drop_last: true num_workers: 8 Eval: dataset: name: SimpleDataSet data_dir: ./train_data/val label_file_list:
./train_data/rec/val.txt transforms:
DecodeImage: img_mode: BGR channel_first: false
MultiLabelEncode: gtc_encode: NRTRLabelEncode
RecResizeImg: image_shape: [3, 48, 320]
KeepKeys: keep_keys:
- image
- label_ctc
- label_gtc
- length
- valid_ratio loader: shuffle: false drop_last: false batch_size_per_card: 1 num_workers: 4

为什么我的24G显存一下就满了，一点跑不了

UserWangZz commented 3 weeks ago

运行前显卡上有没有其他任务？

lili-changjiang commented 3 weeks ago

运行前显卡上有没有其他任务？

没有其他任务，跑过很多次都是这样

UserWangZz commented 3 weeks ago

尝试一下paddle 2.5.2版本

zhengmeng commented 2 days ago

你好，请问解决了吗？我也遇到了这个问题，我有两张24G的

PaddlePaddle / PaddleOCR

使用ch_PP-OCRv4_rec训练数据集报错：Out of memory error on GPU 0. Cannot allocate 129.394531MB memory on GPU 0, 23.611938GB memory has been allocated and available memory is only 31.687500MB. #11989

完整报错/Complete Error Message：Error Message Summary: