PaddlePaddle / PaddleX

All-in-One Development Tool based on PaddlePaddle(飞桨低代码开发工具)
Apache License 2.0
4.92k stars 962 forks source link

ubuntu24. cuda11.8 下训练 ,显存一直新增,知道溢出报错 #2220

Open alanOO7 opened 1 month ago

alanOO7 commented 1 month ago

描述问题

复现

  1. 您是否已经正常运行我们提供的教程? 正常
  2. 您是否在教程的基础上修改代码内容?还请您提供运行的代码 无
  3. 您使用的数据集是? 自己标注的
  4. 请提供您出现的报错信息及相关log [2024/10/10 23:41:39] ppocr WARNING: You are using VisualDL, the VisualDL is deprecated and removed in ppocr! [2024/10/10 23:41:39] ppocr INFO: Architecture : [2024/10/10 23:41:39] ppocr INFO: Backbone : [2024/10/10 23:41:39] ppocr INFO: name : PPHGNet_small [2024/10/10 23:41:39] ppocr INFO: Head : [2024/10/10 23:41:39] ppocr INFO: head_list : [2024/10/10 23:41:39] ppocr INFO: CTCHead : [2024/10/10 23:41:39] ppocr INFO: Head : [2024/10/10 23:41:39] ppocr INFO: fc_decay : 1e-05 [2024/10/10 23:41:39] ppocr INFO: Neck : [2024/10/10 23:41:39] ppocr INFO: depth : 2 [2024/10/10 23:41:39] ppocr INFO: dims : 120 [2024/10/10 23:41:39] ppocr INFO: hidden_dims : 120 [2024/10/10 23:41:39] ppocr INFO: kernel_size : [1, 3] [2024/10/10 23:41:39] ppocr INFO: name : svtr [2024/10/10 23:41:39] ppocr INFO: use_guide : True [2024/10/10 23:41:39] ppocr INFO: NRTRHead : [2024/10/10 23:41:39] ppocr INFO: max_text_length : 25 [2024/10/10 23:41:39] ppocr INFO: nrtr_dim : 384 [2024/10/10 23:41:39] ppocr INFO: name : MultiHead [2024/10/10 23:41:39] ppocr INFO: Transform : None [2024/10/10 23:41:39] ppocr INFO: algorithm : SVTR_HGNet [2024/10/10 23:41:39] ppocr INFO: model_type : rec [2024/10/10 23:41:39] ppocr INFO: Eval : [2024/10/10 23:41:39] ppocr INFO: dataset : [2024/10/10 23:41:39] ppocr INFO: data_dir : /home/mcn/PaddleX/dataset/cme_rec [2024/10/10 23:41:39] ppocr INFO: label_file_list : ['/home/mcn/PaddleX/dataset/cme_rec/val.txt'] [2024/10/10 23:41:39] ppocr INFO: name : TextRecDataset [2024/10/10 23:41:39] ppocr INFO: transforms : [2024/10/10 23:41:39] ppocr INFO: DecodeImage : [2024/10/10 23:41:39] ppocr INFO: channel_first : False [2024/10/10 23:41:39] ppocr INFO: img_mode : BGR [2024/10/10 23:41:39] ppocr INFO: MultiLabelEncode : [2024/10/10 23:41:39] ppocr INFO: gtc_encode : NRTRLabelEncode [2024/10/10 23:41:39] ppocr INFO: RecResizeImg : [2024/10/10 23:41:39] ppocr INFO: image_shape : [3, 48, 320] [2024/10/10 23:41:39] ppocr INFO: KeepKeys : [2024/10/10 23:41:39] ppocr INFO: keep_keys : ['image', 'label_ctc', 'label_gtc', 'length', 'valid_ratio'] [2024/10/10 23:41:39] ppocr INFO: loader : [2024/10/10 23:41:39] ppocr INFO: batch_size_per_card : 4 [2024/10/10 23:41:39] ppocr INFO: drop_last : False [2024/10/10 23:41:39] ppocr INFO: num_workers : 4 [2024/10/10 23:41:39] ppocr INFO: shuffle : False [2024/10/10 23:41:39] ppocr INFO: Global : [2024/10/10 23:41:39] ppocr INFO: amp_level : OFF [2024/10/10 23:41:39] ppocr INFO: cal_metric_during_train : True [2024/10/10 23:41:39] ppocr INFO: character_dict_path : /home/mcn/PaddleX/dataset/cme_rec/dict.txt [2024/10/10 23:41:39] ppocr INFO: checkpoints : None [2024/10/10 23:41:39] ppocr INFO: debug : False [2024/10/10 23:41:39] ppocr INFO: distributed : False [2024/10/10 23:41:39] ppocr INFO: epoch_num : 200 [2024/10/10 23:41:39] ppocr INFO: eval_batch_epoch : 1 [2024/10/10 23:41:39] ppocr INFO: eval_batch_step : [0, 2000] [2024/10/10 23:41:39] ppocr INFO: hpi_config_path : /home/mcn/PaddleX/paddlex/utils/hpi_configs/PP-OCRv4_server_rec.yaml [2024/10/10 23:41:39] ppocr INFO: infer_img : doc/imgs_words/ch/word_1.jpg [2024/10/10 23:41:39] ppocr INFO: infer_mode : False [2024/10/10 23:41:39] ppocr INFO: log_smooth_window : 20 [2024/10/10 23:41:39] ppocr INFO: max_text_length : 25 [2024/10/10 23:41:39] ppocr INFO: pdx_model_name : PP-OCRv4_server_rec [2024/10/10 23:41:39] ppocr INFO: pretrained_model : https://paddleocr.bj.bcebos.com/pretrained/ch_PP-OCRv4_rec_server_trained.pdparams [2024/10/10 23:41:39] ppocr INFO: print_batch_step : 20 [2024/10/10 23:41:39] ppocr INFO: save_epoch_step : 1 [2024/10/10 23:41:39] ppocr INFO: save_inference_dir : None [2024/10/10 23:41:39] ppocr INFO: save_model_dir : /home/mcn/PaddleX/output/cme [2024/10/10 23:41:39] ppocr INFO: save_res_path : ./output/rec/predicts_ppocrv3.txt [2024/10/10 23:41:39] ppocr INFO: to_static : False [2024/10/10 23:41:39] ppocr INFO: uniform_output_enabled : True [2024/10/10 23:41:39] ppocr INFO: use_amp : False [2024/10/10 23:41:39] ppocr INFO: use_gpu : True [2024/10/10 23:41:39] ppocr INFO: use_mlu : False [2024/10/10 23:41:39] ppocr INFO: use_npu : False [2024/10/10 23:41:39] ppocr INFO: use_space_char : True [2024/10/10 23:41:39] ppocr INFO: use_visualdl : True [2024/10/10 23:41:39] ppocr INFO: use_xpu : False [2024/10/10 23:41:39] ppocr INFO: Loss : [2024/10/10 23:41:39] ppocr INFO: loss_config_list : [2024/10/10 23:41:39] ppocr INFO: CTCLoss : None [2024/10/10 23:41:39] ppocr INFO: NRTRLoss : None [2024/10/10 23:41:39] ppocr INFO: name : MultiLoss [2024/10/10 23:41:39] ppocr INFO: Metric : [2024/10/10 23:41:39] ppocr INFO: main_indicator : acc [2024/10/10 23:41:39] ppocr INFO: name : RecMetric [2024/10/10 23:41:39] ppocr INFO: Optimizer : [2024/10/10 23:41:39] ppocr INFO: beta1 : 0.9 [2024/10/10 23:41:39] ppocr INFO: beta2 : 0.999 [2024/10/10 23:41:39] ppocr INFO: lr : [2024/10/10 23:41:39] ppocr INFO: learning_rate : 0.001 [2024/10/10 23:41:39] ppocr INFO: name : Cosine [2024/10/10 23:41:39] ppocr INFO: warmup_epoch : 5 [2024/10/10 23:41:39] ppocr INFO: name : Adam [2024/10/10 23:41:39] ppocr INFO: regularizer : [2024/10/10 23:41:39] ppocr INFO: factor : 3e-05 [2024/10/10 23:41:39] ppocr INFO: name : L2 [2024/10/10 23:41:39] ppocr INFO: PostProcess : [2024/10/10 23:41:39] ppocr INFO: name : CTCLabelDecode [2024/10/10 23:41:39] ppocr INFO: Train : [2024/10/10 23:41:39] ppocr INFO: dataset : [2024/10/10 23:41:39] ppocr INFO: data_dir : /home/mcn/PaddleX/dataset/cme_rec [2024/10/10 23:41:39] ppocr INFO: ds_width : False [2024/10/10 23:41:39] ppocr INFO: ext_op_transform_idx : 1 [2024/10/10 23:41:39] ppocr INFO: label_file_list : ['/home/mcn/PaddleX/dataset/cme_rec/train.txt'] [2024/10/10 23:41:39] ppocr INFO: name : MSTextRecDataset [2024/10/10 23:41:39] ppocr INFO: transforms : [2024/10/10 23:41:39] ppocr INFO: DecodeImage : [2024/10/10 23:41:39] ppocr INFO: channel_first : False [2024/10/10 23:41:39] ppocr INFO: img_mode : BGR [2024/10/10 23:41:39] ppocr INFO: RecConAug : [2024/10/10 23:41:39] ppocr INFO: ext_data_num : 2 [2024/10/10 23:41:39] ppocr INFO: image_shape : [48, 320, 3] [2024/10/10 23:41:39] ppocr INFO: max_text_length : 25 [2024/10/10 23:41:39] ppocr INFO: prob : 0.5 [2024/10/10 23:41:39] ppocr INFO: RecAug : None [2024/10/10 23:41:39] ppocr INFO: MultiLabelEncode : [2024/10/10 23:41:39] ppocr INFO: gtc_encode : NRTRLabelEncode [2024/10/10 23:41:39] ppocr INFO: KeepKeys : [2024/10/10 23:41:39] ppocr INFO: keep_keys : ['image', 'label_ctc', 'label_gtc', 'length', 'valid_ratio'] [2024/10/10 23:41:39] ppocr INFO: loader : [2024/10/10 23:41:39] ppocr INFO: batch_size_per_card : 4 [2024/10/10 23:41:39] ppocr INFO: drop_last : True [2024/10/10 23:41:39] ppocr INFO: num_workers : 8 [2024/10/10 23:41:39] ppocr INFO: shuffle : True [2024/10/10 23:41:39] ppocr INFO: sampler : [2024/10/10 23:41:39] ppocr INFO: divided_factor : [8, 16] [2024/10/10 23:41:39] ppocr INFO: first_bs : 4 [2024/10/10 23:41:39] ppocr INFO: fix_bs : False [2024/10/10 23:41:39] ppocr INFO: is_training : True [2024/10/10 23:41:39] ppocr INFO: name : MultiScaleSampler [2024/10/10 23:41:39] ppocr INFO: scales : [[320, 32], [320, 48], [320, 64]] [2024/10/10 23:41:39] ppocr INFO: profiler_options : None [2024/10/10 23:41:39] ppocr INFO: train with paddle 3.0.0-beta1 and device Place(gpu:0) [2024/10/10 23:41:39] ppocr INFO: Initialize indexs of datasets:['/home/mcn/PaddleX/dataset/cme_rec/train.txt'] [2024/10/10 23:41:40] ppocr INFO: Initialize indexs of datasets:['/home/mcn/PaddleX/dataset/cme_rec/val.txt'] W1010 23:41:40.017611 2321794 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 11.8 W1010 23:41:40.018028 2321794 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9. [2024/10/10 23:41:40] ppocr INFO: train dataloader has 384 iters [2024/10/10 23:41:40] ppocr INFO: valid dataloader has 61 iters download https://paddleocr.bj.bcebos.com/pretrained/ch_PP-OCRv4_rec_server_trained.pdparams to /root/.paddleocr/models/ch_PP-OCRv4_rec_server_trained.pdparams [2024/10/10 23:41:40] ppocr INFO: Path /root/.paddleocr/models/ch_PP-OCRv4_rec_server_trained.pdparams already exists. Skipping... [2024/10/10 23:41:40] ppocr INFO: load pretrain successful from /root/.paddleocr/models/ch_PP-OCRv4_rec_server_trained [2024/10/10 23:41:40] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 384 iterations [2024/10/10 23:41:42] ppocr INFO: epoch: [1/200], global_step: 20, lr: 0.000005, acc: 0.499998, norm_edit_dis: 0.888670, CTCLoss: 3.973861, NRTRLoss: 1.620898, loss: 5.543493, avg_reader_cost: 0.01234 s, avg_batch_cost: 0.09483 s, avg_samples: 2.7, ips: 28.47145 samples/s, eta: 2:01:21, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:43] ppocr INFO: epoch: [1/200], global_step: 40, lr: 0.000015, acc: 0.499998, norm_edit_dis: 0.821718, CTCLoss: 2.243618, NRTRLoss: 1.535668, loss: 3.877820, avg_reader_cost: 0.00129 s, avg_batch_cost: 0.05488 s, avg_samples: 2.2, ips: 40.08590 samples/s, eta: 1:35:46, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:44] ppocr INFO: epoch: [1/200], global_step: 60, lr: 0.000026, acc: 0.749998, norm_edit_dis: 0.916667, CTCLoss: 1.239613, NRTRLoss: 1.301198, loss: 2.547727, avg_reader_cost: 0.00089 s, avg_batch_cost: 0.04908 s, avg_samples: 2.9, ips: 59.08634 samples/s, eta: 1:24:45, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:45] ppocr INFO: epoch: [1/200], global_step: 80, lr: 0.000036, acc: 0.624998, norm_edit_dis: 0.965909, CTCLoss: 0.909002, NRTRLoss: 1.266591, loss: 2.158252, avg_reader_cost: 0.00092 s, avg_batch_cost: 0.05032 s, avg_samples: 2.5, ips: 49.67758 samples/s, eta: 1:19:38, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:46] ppocr INFO: epoch: [1/200], global_step: 100, lr: 0.000047, acc: 0.749998, norm_edit_dis: 0.905845, CTCLoss: 2.561544, NRTRLoss: 1.556941, loss: 4.135740, avg_reader_cost: 0.00071 s, avg_batch_cost: 0.04691 s, avg_samples: 2.6, ips: 55.42642 samples/s, eta: 1:15:41, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:47] ppocr INFO: epoch: [1/200], global_step: 120, lr: 0.000057, acc: 0.749998, norm_edit_dis: 0.948661, CTCLoss: 1.464629, NRTRLoss: 1.344990, loss: 2.788716, avg_reader_cost: 0.00070 s, avg_batch_cost: 0.04712 s, avg_samples: 3.1, ips: 65.78909 samples/s, eta: 1:13:05, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:48] ppocr INFO: epoch: [1/200], global_step: 140, lr: 0.000067, acc: 0.874997, norm_edit_dis: 0.989583, CTCLoss: 0.811754, NRTRLoss: 1.265736, loss: 2.059149, avg_reader_cost: 0.00071 s, avg_batch_cost: 0.04618 s, avg_samples: 2.6, ips: 56.30415 samples/s, eta: 1:11:03, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:49] ppocr INFO: epoch: [1/200], global_step: 160, lr: 0.000078, acc: 0.624998, norm_edit_dis: 0.879688, CTCLoss: 9.007205, NRTRLoss: 1.582661, loss: 10.955378, avg_reader_cost: 0.00070 s, avg_batch_cost: 0.04718 s, avg_samples: 2.9, ips: 61.46235 samples/s, eta: 1:09:41, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:50] ppocr INFO: epoch: [1/200], global_step: 180, lr: 0.000088, acc: 0.749998, norm_edit_dis: 0.922476, CTCLoss: 4.346100, NRTRLoss: 1.431711, loss: 5.846954, avg_reader_cost: 0.00070 s, avg_batch_cost: 0.04575 s, avg_samples: 2.6, ips: 56.82825 samples/s, eta: 1:08:25, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:50] ppocr INFO: epoch: [1/200], global_step: 200, lr: 0.000099, acc: 0.749998, norm_edit_dis: 0.927084, CTCLoss: 2.613269, NRTRLoss: 1.271903, loss: 3.859729, avg_reader_cost: 0.00072 s, avg_batch_cost: 0.04612 s, avg_samples: 2.5, ips: 54.21219 samples/s, eta: 1:07:27, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:51] ppocr INFO: epoch: [1/200], global_step: 220, lr: 0.000109, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 2.278689, NRTRLoss: 1.324647, loss: 3.603336, avg_reader_cost: 0.00070 s, avg_batch_cost: 0.04675 s, avg_samples: 2.5, ips: 53.47548 samples/s, eta: 1:06:43, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:52] ppocr INFO: epoch: [1/200], global_step: 240, lr: 0.000120, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.612513, NRTRLoss: 1.245184, loss: 1.960803, avg_reader_cost: 0.00077 s, avg_batch_cost: 0.04860 s, avg_samples: 2.6, ips: 53.49899 samples/s, eta: 1:06:19, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:53] ppocr INFO: epoch: [1/200], global_step: 260, lr: 0.000130, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 2.181571, NRTRLoss: 1.307482, loss: 3.489053, avg_reader_cost: 0.00071 s, avg_batch_cost: 0.04810 s, avg_samples: 2.9, ips: 60.28703 samples/s, eta: 1:05:55, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:54] ppocr INFO: epoch: [1/200], global_step: 280, lr: 0.000140, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 2.799613, NRTRLoss: 1.321018, loss: 4.132875, avg_reader_cost: 0.00070 s, avg_batch_cost: 0.04636 s, avg_samples: 2.6, ips: 56.08676 samples/s, eta: 1:05:25, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:55] ppocr INFO: epoch: [1/200], global_step: 300, lr: 0.000151, acc: 0.624998, norm_edit_dis: 0.950521, CTCLoss: 3.988202, NRTRLoss: 1.429777, loss: 5.394050, avg_reader_cost: 0.00087 s, avg_batch_cost: 0.05060 s, avg_samples: 2.8, ips: 55.33606 samples/s, eta: 1:05:20, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:56] ppocr INFO: epoch: [1/200], global_step: 320, lr: 0.000161, acc: 0.749998, norm_edit_dis: 0.965385, CTCLoss: 3.374764, NRTRLoss: 1.368185, loss: 4.840322, avg_reader_cost: 0.00135 s, avg_batch_cost: 0.05866 s, avg_samples: 2.7, ips: 46.02852 samples/s, eta: 1:05:55, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:58] ppocr INFO: epoch: [1/200], global_step: 340, lr: 0.000172, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 1.211475, NRTRLoss: 1.273884, loss: 2.468042, avg_reader_cost: 0.00115 s, avg_batch_cost: 0.05554 s, avg_samples: 2.9, ips: 52.21411 samples/s, eta: 1:06:11, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:59] ppocr INFO: epoch: [1/200], global_step: 360, lr: 0.000182, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.851067, NRTRLoss: 1.274575, loss: 2.115809, avg_reader_cost: 0.00076 s, avg_batch_cost: 0.04679 s, avg_samples: 2.6, ips: 55.56374 samples/s, eta: 1:05:48, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:41:59] ppocr INFO: epoch: [1/200], global_step: 380, lr: 0.000192, acc: 0.874997, norm_edit_dis: 0.989583, CTCLoss: 1.271906, NRTRLoss: 1.321358, loss: 2.593264, avg_reader_cost: 0.00077 s, avg_batch_cost: 0.04677 s, avg_samples: 2.4, ips: 51.31086 samples/s, eta: 1:05:27, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB [2024/10/10 23:42:00] ppocr INFO: epoch: [1/200], global_step: 384, lr: 0.000195, acc: 0.749998, norm_edit_dis: 0.968750, CTCLoss: 1.603831, NRTRLoss: 1.345474, loss: 2.949305, avg_reader_cost: 0.00014 s, avg_batch_cost: 0.00954 s, avg_samples: 0.6, ips: 62.92321 samples/s, eta: 1:05:24, max_mem_reserved: 1469 MB, max_mem_allocated: 985 MB eval model:: 0%| | 0/61 [00:00<?, ?it/s] eval model:: 2%|▏ | 1/61 [00:00<00:06, 9.28it/s] eval model:: 15%|█▍ | 9/61 [00:00<00:01, 48.51it/s] eval model:: 28%|██▊ | 17/61 [00:00<00:00, 60.92it/s] eval model:: 41%|████ | 25/61 [00:00<00:00, 66.63it/s] eval model:: 54%|█████▍ | 33/61 [00:00<00:00, 69.72it/s] eval model:: 67%|██████▋ | 41/61 [00:00<00:00, 71.60it/s] eval model:: 80%|████████ | 49/61 [00:00<00:00, 72.77it/s] eval model:: 93%|█████████▎| 57/61 [00:00<00:00, 73.71it/s] eval model:: 100%|██████████| 61/61 [00:01<00:00, 50.21it/s] [2024/10/10 23:54:49] ppocr INFO: cur metric, acc: 0.7704917717011569, norm_edit_dis: 0.9533083346803203, fps: 328.273861205113 [2024/10/10 23:54:49] ppocr INFO: best metric, acc: 0.9016393073098644, is_float16: False, norm_edit_dis: 0.9640742923909182, fps: 330.00967898380986, best_epoch: 1 [2024/10/10 23:54:50] ppocr INFO: inference model is saved to /home/mcn/PaddleX/output/cme/latest/inference/inference [2024/10/10 23:54:50] ppocr INFO: Export inference config file to /home/mcn/PaddleX/output/cme/latest/inference/inference.yml [2024/10/10 23:54:51] ppocr INFO: Already save model info in /home/mcn/PaddleX/output/cme/latest [2024/10/10 23:54:51] ppocr INFO: save model in /home/mcn/PaddleX/output/cme/latest/latest [2024/10/10 23:54:52] ppocr INFO: inference model is saved to /home/mcn/PaddleX/output/cme/iter_epoch_33/inference/inference [2024/10/10 23:54:52] ppocr INFO: Export inference config file to /home/mcn/PaddleX/output/cme/iter_epoch_33/inference/inference.yml [2024/10/10 23:54:53] ppocr INFO: Already save model info in /home/mcn/PaddleX/output/cme/iter_epoch_33 [2024/10/10 23:54:53] ppocr INFO: save model in /home/mcn/PaddleX/output/cme/iter_epoch_33/iter_epoch_33 [2024/10/10 23:54:54] ppocr INFO: epoch: [34/200], global_step: 12680, lr: 0.000952, acc: 0.624998, norm_edit_dis: 0.939773, CTCLoss: 3.621375, NRTRLoss: 1.496952, loss: 5.390821, avg_reader_cost: 0.22838 s, avg_batch_cost: 0.25649 s, avg_samples: 1.2, ips: 4.67851 samples/s, eta: 1:03:31, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:54:55] ppocr INFO: epoch: [34/200], global_step: 12700, lr: 0.000952, acc: 0.499998, norm_edit_dis: 0.962912, CTCLoss: 0.849674, NRTRLoss: 1.506886, loss: 2.530984, avg_reader_cost: 0.00088 s, avg_batch_cost: 0.04989 s, avg_samples: 2.4, ips: 48.10164 samples/s, eta: 1:03:29, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:54:56] ppocr INFO: epoch: [34/200], global_step: 12720, lr: 0.000952, acc: 0.749998, norm_edit_dis: 0.968254, CTCLoss: 0.780951, NRTRLoss: 1.398764, loss: 2.052150, avg_reader_cost: 0.00078 s, avg_batch_cost: 0.04778 s, avg_samples: 2.5, ips: 52.32611 samples/s, eta: 1:03:27, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:54:57] ppocr INFO: epoch: [34/200], global_step: 12740, lr: 0.000952, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.465765, NRTRLoss: 1.467725, loss: 2.036675, avg_reader_cost: 0.00078 s, avg_batch_cost: 0.04929 s, avg_samples: 2.6, ips: 52.74415 samples/s, eta: 1:03:25, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:54:58] ppocr INFO: epoch: [34/200], global_step: 12760, lr: 0.000952, acc: 0.749998, norm_edit_dis: 0.976366, CTCLoss: 1.076281, NRTRLoss: 1.329596, loss: 2.559453, avg_reader_cost: 0.00075 s, avg_batch_cost: 0.04861 s, avg_samples: 2.8, ips: 57.60094 samples/s, eta: 1:03:22, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:54:59] ppocr INFO: epoch: [34/200], global_step: 12780, lr: 0.000952, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.527968, NRTRLoss: 1.407272, loss: 1.958473, avg_reader_cost: 0.00075 s, avg_batch_cost: 0.04824 s, avg_samples: 2.8, ips: 58.04285 samples/s, eta: 1:03:20, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:00] ppocr INFO: epoch: [34/200], global_step: 12800, lr: 0.000951, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.316380, NRTRLoss: 1.307829, loss: 1.607515, avg_reader_cost: 0.00074 s, avg_batch_cost: 0.04754 s, avg_samples: 2.6, ips: 54.68925 samples/s, eta: 1:03:18, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:00] ppocr INFO: epoch: [34/200], global_step: 12820, lr: 0.000951, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.114079, NRTRLoss: 1.340032, loss: 1.617437, avg_reader_cost: 0.00074 s, avg_batch_cost: 0.04695 s, avg_samples: 2.5, ips: 53.24509 samples/s, eta: 1:03:15, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:01] ppocr INFO: epoch: [34/200], global_step: 12840, lr: 0.000951, acc: 0.624998, norm_edit_dis: 0.950893, CTCLoss: 0.863305, NRTRLoss: 1.592515, loss: 2.312771, avg_reader_cost: 0.00074 s, avg_batch_cost: 0.04796 s, avg_samples: 2.6, ips: 54.20964 samples/s, eta: 1:03:13, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:02] ppocr INFO: epoch: [34/200], global_step: 12860, lr: 0.000951, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.187949, NRTRLoss: 1.359807, loss: 1.632990, avg_reader_cost: 0.00080 s, avg_batch_cost: 0.04911 s, avg_samples: 2.8, ips: 57.01206 samples/s, eta: 1:03:11, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:03] ppocr INFO: epoch: [34/200], global_step: 12880, lr: 0.000951, acc: 0.749998, norm_edit_dis: 0.970833, CTCLoss: 0.928666, NRTRLoss: 1.487560, loss: 2.599897, avg_reader_cost: 0.00076 s, avg_batch_cost: 0.04884 s, avg_samples: 2.8, ips: 57.32565 samples/s, eta: 1:03:09, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:04] ppocr INFO: epoch: [34/200], global_step: 12900, lr: 0.000951, acc: 0.749998, norm_edit_dis: 0.965278, CTCLoss: 1.179478, NRTRLoss: 1.373095, loss: 2.609725, avg_reader_cost: 0.00073 s, avg_batch_cost: 0.04804 s, avg_samples: 2.8, ips: 58.28647 samples/s, eta: 1:03:06, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:05] ppocr INFO: epoch: [34/200], global_step: 12920, lr: 0.000950, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.587089, NRTRLoss: 1.312174, loss: 1.912312, avg_reader_cost: 0.00077 s, avg_batch_cost: 0.04988 s, avg_samples: 2.6, ips: 52.12496 samples/s, eta: 1:03:04, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:06] ppocr INFO: epoch: [34/200], global_step: 12940, lr: 0.000950, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.244553, NRTRLoss: 1.343056, loss: 1.721657, avg_reader_cost: 0.00084 s, avg_batch_cost: 0.04851 s, avg_samples: 2.5, ips: 51.53760 samples/s, eta: 1:03:02, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:07] ppocr INFO: epoch: [34/200], global_step: 12960, lr: 0.000950, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.371499, NRTRLoss: 1.447015, loss: 2.192312, avg_reader_cost: 0.00076 s, avg_batch_cost: 0.04794 s, avg_samples: 2.6, ips: 54.23617 samples/s, eta: 1:03:00, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:08] ppocr INFO: epoch: [34/200], global_step: 12980, lr: 0.000950, acc: 0.749998, norm_edit_dis: 0.936508, CTCLoss: 1.412941, NRTRLoss: 1.460027, loss: 3.009994, avg_reader_cost: 0.00074 s, avg_batch_cost: 0.04837 s, avg_samples: 3.0, ips: 62.02318 samples/s, eta: 1:02:57, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:09] ppocr INFO: epoch: [34/200], global_step: 13000, lr: 0.000950, acc: 0.499998, norm_edit_dis: 0.948661, CTCLoss: 1.198798, NRTRLoss: 1.608791, loss: 2.876916, avg_reader_cost: 0.00073 s, avg_batch_cost: 0.04824 s, avg_samples: 2.5, ips: 51.82027 samples/s, eta: 1:02:55, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:10] ppocr INFO: epoch: [34/200], global_step: 13020, lr: 0.000949, acc: 0.749998, norm_edit_dis: 0.947222, CTCLoss: 1.937520, NRTRLoss: 1.422909, loss: 3.408546, avg_reader_cost: 0.00077 s, avg_batch_cost: 0.04903 s, avg_samples: 2.7, ips: 55.06316 samples/s, eta: 1:02:53, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:11] ppocr INFO: epoch: [34/200], global_step: 13040, lr: 0.000949, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.252140, NRTRLoss: 1.419294, loss: 1.851779, avg_reader_cost: 0.00079 s, avg_batch_cost: 0.04983 s, avg_samples: 2.9, ips: 58.20165 samples/s, eta: 1:02:51, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB [2024/10/10 23:55:12] ppocr INFO: epoch: [34/200], global_step: 13056, lr: 0.000949, acc: 0.999995, norm_edit_dis: 1.000000, CTCLoss: 0.556433, NRTRLoss: 1.398072, loss: 1.967755, avg_reader_cost: 0.00060 s, avg_batch_cost: 0.03855 s, avg_samples: 2.0, ips: 51.88111 samples/s, eta: 1:02:49, max_mem_reserved: 11296 MB, max_mem_allocated: 11077 MB

eval model:: 0%| | 0/61 [00:00<?, ?it/s] eval model:: 2%|▏ | 1/61 [00:00<00:06, 9.31it/s] eval model:: 15%|█▍ | 9/61 [00:00<00:01, 48.09it/s] eval model:: 28%|██▊ | 17/61 [00:00<00:00, 60.59it/s] eval model:: 41%|████ | 25/61 [00:00<00:00, 66.46it/s] eval model:: 54%|█████▍ | 33/61 [00:00<00:00, 69.66it/s] eval model:: 67%|██████▋ | 41/61 [00:00<00:00, 71.53it/s] eval model:: 80%|████████ | 49/61 [00:00<00:00, 72.67it/s] eval model:: 93%|█████████▎| 57/61 [00:00<00:00, 73.25it/s] eval model:: 100%|██████████| 61/61 [00:01<00:00, 49.46it/s] [2024/10/10 23:55:13] ppocr INFO: cur metric, acc: 0.8647540629199154, norm_edit_dis: 0.9783078684382168, fps: 327.1838605393461 [2024/10/10 23:55:13] ppocr INFO: best metric, acc: 0.9016393073098644, is_float16: False, norm_edit_dis: 0.9640742923909182, fps: 330.00967898380986, best_epoch: 1 [2024/10/10 23:55:15] ppocr INFO: inference model is saved to /home/mcn/PaddleX/output/cme/latest/inference/inference [2024/10/10 23:55:15] ppocr INFO: Export inference config file to /home/mcn/PaddleX/output/cme/latest/inference/inference.yml [2024/10/10 23:55:16] ppocr INFO: Already save model info in /home/mcn/PaddleX/output/cme/latest [2024/10/10 23:55:16] ppocr INFO: save model in /home/mcn/PaddleX/output/cme/latest/latest [2024/10/10 23:55:17] ppocr INFO: inference model is saved to /home/mcn/PaddleX/output/cme/iter_epoch_34/inference/inference [2024/10/10 23:55:17] ppocr INFO: Export inference config file to /home/mcn/PaddleX/output/cme/iter_epoch_34/inference/inference.yml [2024/10/10 23:55:18] ppocr INFO: Already save model info in /home/mcn/PaddleX/output/cme/iter_epoch_34 [2024/10/10 23:55:18] ppocr INFO: save model in /home/mcn/PaddleX/output/cme/iter_epoch_34/iter_epoch_34 Traceback (most recent call last): File "/home/mcn/PaddleX/paddlex/repo_manager/repos/PaddleOCR/tools/train.py", line 264, in main(config, device, logger, vdl_writer, seed) File "/home/mcn/PaddleX/paddlex/repo_manager/repos/PaddleOCR/tools/train.py", line 217, in main program.train( File "/home/mcn/PaddleX/paddlex/repo_manager/repos/PaddleOCR/tools/program.py", line 344, in train preds = model(images, data=batch[1:]) File "/root/anaconda3/envs/pdx/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in call return self.forward(*inputs, kwargs) File "/home/mcn/PaddleX/paddlex/repo_manager/repos/PaddleOCR/ppocr/modeling/architectures/base_model.py", line 85, in forward x = self.backbone(x) File "/root/anaconda3/envs/pdx/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in call return self.forward(*inputs, *kwargs) File "/home/mcn/PaddleX/paddlex/repo_manager/repos/PaddleOCR/ppocr/modeling/backbones/rec_hgnet.py", line 287, in forward x = stage(x) File "/root/anaconda3/envs/pdx/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in call return self.forward(inputs, kwargs) File "/home/mcn/PaddleX/paddlex/repo_manager/repos/PaddleOCR/ppocr/modeling/backbones/rec_hgnet.py", line 191, in forward x = self.blocks(x) File "/root/anaconda3/envs/pdx/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in call return self.forward(*inputs, kwargs) File "/root/anaconda3/envs/pdx/lib/python3.10/site-packages/paddle/nn/layer/container.py", line 615, in forward input = layer(input) File "/root/anaconda3/envs/pdx/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in call return self.forward(*inputs, *kwargs) File "/home/mcn/PaddleX/paddlex/repo_manager/repos/PaddleOCR/ppocr/modeling/backbones/rec_hgnet.py", line 147, in forward x = self.att(x) File "/root/anaconda3/envs/pdx/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in call return self.forward(inputs, kwargs) File "/home/mcn/PaddleX/paddlex/repo_manager/repos/PaddleOCR/ppocr/modeling/backbones/rec_hgnet.py", line 92, in forward x = self.conv(x) File "/root/anaconda3/envs/pdx/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in call return self.forward(*inputs, **kwargs) File "/root/anaconda3/envs/pdx/lib/python3.10/site-packages/paddle/nn/layer/conv.py", line 711, in forward out = F.conv._conv_nd( File "/root/anaconda3/envs/pdx/lib/python3.10/site-packages/paddle/nn/functional/conv.py", line 127, in _conv_nd pre_bias = _C_ops.conv2d( MemoryError:


C++ Traceback (most recent call last):

0 paddle::pybind::eager_api_conv2d(_object, _object, _object) 1 conv2d_ad_func(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator >, std::vector<int, std::allocator >, std::string, std::vector<int, std::allocator >, int, std::string) 2 paddle::experimental::conv2d(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&) 3 void phi::ConvCudnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&, phi::DenseTensor) 4 void phi::ConvCudnnKernelImplV7<float, phi::GPUContext>(phi::DenseTensor const, phi::DenseTensor const, phi::GPUContext const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, phi::backends::gpu::DataLayout, phi::backends::gpu::DataLayout, bool, bool, int, phi::DenseTensor) 5 phi::DnnWorkspaceHandle::ReallocWorkspace(unsigned long) 6 paddle::memory::allocation::Allocator::Allocate(unsigned long) 7 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long) 8 paddle::memory::allocation::Allocator::Allocate(unsigned long) 9 paddle::memory::allocation::Allocator::Allocate(unsigned long) 10 paddle::memory::allocation::Allocator::Allocate(unsigned long) 11 paddle::memory::allocation::Allocator::Allocate(unsigned long) 12 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long) 13 std::string phi::enforce::GetCompleteTraceBackString(std::string&&, char const, int) 14 common::enforce::GetCurrentTraceBackStringabi:cxx11


Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 128.000000MB memory on GPU 0, 11.664124GB memory has been allocated and available memory is only 88.187500MB.

Please check whether there is any other process using GPU 0.

  1. If yes, please stop them, or start PaddlePaddle on another GPU.
  2. If no, please decrease the batch size of your model. (at ../paddle/fluid/memory/allocation/cuda_allocator.cc:86)

环境

  1. 请提供您使用的PaddlePaddle和PaddleX的版本号 paddlepaddle3.0b
  2. 请提供您使用的操作系统信息,如Linux/Windows/MacOS LINUX
  3. 请问您使用的Python版本是? 3.10
  4. 请问您使用的CUDA/cuDNN的版本号是? 11.8,
cuicheng01 commented 1 month ago

您的问题已收到,您的显存是多大呢?PaddleX是哪个分支呢?

alanOO7 commented 1 month ago

您的问题已收到,您的显存是多大呢?PaddleX是哪个分支呢?

a2000,12g,,版本是3.0-beta1版本

cuicheng01 commented 1 month ago

收到,该问题已确认且已修复,可以使用最新的paddle版本,如在CUDA11.8上安装,安装的命令可以是python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu118/,更多的安装方式可以参考paddle官方文档,修复的paddle版本马上会发布~

alanOO7 commented 2 weeks ago

收到,该问题已确认且已修复,可以使用最新的paddle版本,如在CUDA11.8上安装,安装的命令可以是python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu118/,更多的安装方式可以参考paddle官方文档,修复的paddle版本马上会发布~

image 还没发布吗,还是溢出

cuicheng01 commented 2 weeks ago

您好,现在paddle 3.0beta2已经发布了,可以安装这个尝试python -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/