V2识别模型训练过程报错，是爆显存了吗？

Alanhzl commented 4 months ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Docker desktop，镜像是registry.baidubce.com/paddlepaddle/paddle:2.6.0-gpu-cuda11.7-cudnn8.4-trt8.4
版本号/Version：Paddle：v2.12
PaddleOCR： v2.7

使用配置文件为rec_chinese_lite_train_v2.0.yml

`Global: use_gpu: true epoch_num: 500 log_smooth_window: 20 print_batch_step: 10 save_model_dir: ./output/rec_chinese_lite_v2.0-11 save_epoch_step: 3 eval_batch_step: [2000, 2000] cal_metric_during_train: True pretrained_model: checkpoints: save_inference_dir: use_visualdl: False infer_img: doc/imgs_words/ch/word_1.jpg character_dict_path: train_data/rec/new_dic.txt max_text_length: 25 infer_mode: False use_space_char: True save_res_path: ./output/rec/predicts_chinese_lite_v2.0.txt use_wandb: True

Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Cosine learning_rate: 0.001 warmup_epoch: 3 regularizer: name: 'L2' factor: 0.00001

Architecture: model_type: rec algorithm: CRNN Transform: Backbone: name: MobileNetV3 scale: 0.5 model_name: small small_stride: [1, 2, 2, 2] Neck: name: SequenceEncoder encoder_type: rnn hidden_size: 48 Head: name: CTCHead fc_decay: 0.00001

Loss: name: CTCLoss

PostProcess: name: CTCLabelDecode

Metric: name: RecMetric main_indicator: acc

Train: dataset: name: SimpleDataSet data_dir: ./train_data/new/ label_file_list: ["./train_data/new/tr_train_10w.txt","./train_data/new/chinese_dataset.txt","./train_data/new/synthetic_chinese_string_dataset.txt"] ratio_list: [0.1,0.005,0.34] transforms:

DecodeImage: # load image img_mode: BGR channel_first: False
RecAug:
CTCLabelEncode: # Class handling label
RecResizeImg: image_shape: [3, 32, 320]
KeepKeys: keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order loader: shuffle: True batch_size_per_card: 16 drop_last: True num_workers: 8

Eval: dataset: name: SimpleDataSet data_dir: ./train_data/new label_file_list: ["./train_data/new/tr_train_10w.txt","./train_data/new/chinese_dataset.txt","./train_data/new/synthetic_chinese_string_dataset.txt"] ratio_list: [0.01,0.0005,0.034] transforms:

DecodeImage: # load image img_mode: BGR channel_first: False
CTCLabelEncode: # Class handling label
RecResizeImg: image_shape: [3, 32, 320]
KeepKeys: keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order loader: shuffle: False drop_last: False batch_size_per_card: 16 num_workers: 8

wandb: project: PaddleOCR-tools entity: xzy-ocr name: test-train-6`

训练集大约2w+，训练到298/500轮次后报错： Exception in thread Thread-11 (_thread_body): Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/wandb/filesync/step_upload.py", line 172, in _thread_body self._handle_event(event) File "/usr/local/lib/python3.10/dist-packages/wandb/filesync/step_upload.py", line 236, in _handle_event self._start_upload_job(event) File "/usr/local/lib/python3.10/dist-packages/wandb/filesync/step_upload.py", line 257, in _start_upload_job self._spawn_upload_sync(event) File "/usr/local/lib/python3.10/dist-packages/wandb/filesync/step_upload.py", line 292, in _spawn_upload_sync self._pool.submit(run_and_notify) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 169, in submit raise RuntimeError('cannot schedule new futures after ' RuntimeError: cannot schedule new futures after interpreter shutdown

使用的显卡是1050TI，显存4G，这是爆显存了吗？

TingquanGao commented 4 months ago

用的paddle版本是v2.12吗？建议使用2.5.2或2.6.0的最新版再试试。

Alanhzl commented 4 months ago

用的paddle版本是v2.12吗？建议使用2.5.2或2.6.0的最新版再试试。

PaddleLite是v2.12的，PaddleOCR是V2.7 出现这个报错后，倒是能通过断电训练来继续。在数据集设置为1w时能跑完全程，设置为2w后我试了3次，都是在300轮次左右出现这个错误中断了。

TingquanGao commented 4 months ago

使用的Paddle版本是多少呢？

Alanhzl commented 4 months ago

2.6.0

使用的Paddle版本是多少呢？

TingquanGao commented 4 months ago

报错看起来是python内的多线程报错。建议降低python版本到3.7，paddle也降低到2.2.2版本，paddleocr使用2.5.0tag。想问下，为什么要使用这个模型呢？现在已经有ppocrv4了。

Alanhzl commented 4 months ago

报错看起来是python内的多线程报错。建议降低python版本到3.7，paddle也降低到2.2.2版本，paddleocr使用2.5.0tag。想问下，为什么要使用这个模型呢？现在已经有ppocrv4了。

python3 tools/export_model.py -c configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml -o Global.pretrained_model="./pretrain_models/ch_PP-OCRv3_rec_train/best_accuracy" Global.save_inference_dir="./pretrain_models/ch_PP-OCRv3_rec_train/inference" ./opt --model_file=./rec/rec_v3/Student/inference.pdmodel --param_file=./rec/rec_v3/Student/inference.pdiparams --optimize_out=./rec/rec_v3/Student/ch_ppocr_v3.0_rec_opt --valid_targets=arm --optimize_out_type=naive_buffer

因为下载V3的训练模型后，按照表格里面的配置文件转成推理模型，然后再用PaddleLite-v2.12编译出来的opt工具转成nb模型，会报错

Check failed: it != attrs().end(): No attributes called beta found for swish

即使是pip install paddlelite 安装v2.13rc0来转nb模型也有这个报错。到处找不到解决方法，然后似乎有的人能正常转，有的一样报这个错，所以我才用的V2模型。

麻烦看看这个问题

PaddlePaddle / PaddleOCR

V2识别模型训练过程报错，是爆显存了吗？ #11716