PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Apache License 2.0
41.15k stars 7.54k forks source link

V2识别模型训练过程报错,是爆显存了吗? #11716

Closed Alanhzl closed 3 months ago

Alanhzl commented 4 months ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

使用配置文件为rec_chinese_lite_train_v2.0.yml

`Global: use_gpu: true epoch_num: 500 log_smooth_window: 20 print_batch_step: 10 save_model_dir: ./output/rec_chinese_lite_v2.0-11 save_epoch_step: 3 eval_batch_step: [2000, 2000] cal_metric_during_train: True pretrained_model: checkpoints: save_inference_dir: use_visualdl: False infer_img: doc/imgs_words/ch/word_1.jpg character_dict_path: train_data/rec/new_dic.txt max_text_length: 25 infer_mode: False use_space_char: True save_res_path: ./output/rec/predicts_chinese_lite_v2.0.txt use_wandb: True

Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Cosine learning_rate: 0.001 warmup_epoch: 3 regularizer: name: 'L2' factor: 0.00001

Architecture: model_type: rec algorithm: CRNN Transform: Backbone: name: MobileNetV3 scale: 0.5 model_name: small small_stride: [1, 2, 2, 2] Neck: name: SequenceEncoder encoder_type: rnn hidden_size: 48 Head: name: CTCHead fc_decay: 0.00001

Loss: name: CTCLoss

PostProcess: name: CTCLabelDecode

Metric: name: RecMetric main_indicator: acc

Train: dataset: name: SimpleDataSet data_dir: ./train_data/new/ label_file_list: ["./train_data/new/tr_train_10w.txt","./train_data/new/chinese_dataset.txt","./train_data/new/synthetic_chinese_string_dataset.txt"] ratio_list: [0.1,0.005,0.34] transforms:

Eval: dataset: name: SimpleDataSet data_dir: ./train_data/new label_file_list: ["./train_data/new/tr_train_10w.txt","./train_data/new/chinese_dataset.txt","./train_data/new/synthetic_chinese_string_dataset.txt"] ratio_list: [0.01,0.0005,0.034] transforms:

wandb: project: PaddleOCR-tools entity: xzy-ocr name: test-train-6`

训练集大约2w+,训练到298/500轮次后报错: Exception in thread Thread-11 (_thread_body): Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/wandb/filesync/step_upload.py", line 172, in _thread_body self._handle_event(event) File "/usr/local/lib/python3.10/dist-packages/wandb/filesync/step_upload.py", line 236, in _handle_event self._start_upload_job(event) File "/usr/local/lib/python3.10/dist-packages/wandb/filesync/step_upload.py", line 257, in _start_upload_job self._spawn_upload_sync(event) File "/usr/local/lib/python3.10/dist-packages/wandb/filesync/step_upload.py", line 292, in _spawn_upload_sync self._pool.submit(run_and_notify) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 169, in submit raise RuntimeError('cannot schedule new futures after ' RuntimeError: cannot schedule new futures after interpreter shutdown

使用的显卡是1050TI,显存4G,这是爆显存了吗?

TingquanGao commented 4 months ago

用的paddle版本是v2.12吗?建议使用2.5.2或2.6.0的最新版再试试。

Alanhzl commented 4 months ago

用的paddle版本是v2.12吗?建议使用2.5.2或2.6.0的最新版再试试。

PaddleLite是v2.12的,PaddleOCR是V2.7 出现这个报错后,倒是能通过断电训练来继续。在数据集设置为1w时能跑完全程,设置为2w后我试了3次,都是在300轮次左右出现这个错误中断了。

TingquanGao commented 4 months ago

使用的Paddle版本是多少呢?

Alanhzl commented 4 months ago

image

2.6.0

使用的Paddle版本是多少呢?

TingquanGao commented 4 months ago

报错看起来是python内的多线程报错。建议降低python版本到3.7,paddle也降低到2.2.2版本,paddleocr使用2.5.0tag。想问下,为什么要使用这个模型呢?现在已经有ppocrv4了。

Alanhzl commented 4 months ago

报错看起来是python内的多线程报错。建议降低python版本到3.7,paddle也降低到2.2.2版本,paddleocr使用2.5.0tag。想问下,为什么要使用这个模型呢?现在已经有ppocrv4了。

python3 tools/export_model.py -c configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml -o Global.pretrained_model="./pretrain_models/ch_PP-OCRv3_rec_train/best_accuracy" Global.save_inference_dir="./pretrain_models/ch_PP-OCRv3_rec_train/inference" ./opt --model_file=./rec/rec_v3/Student/inference.pdmodel --param_file=./rec/rec_v3/Student/inference.pdiparams --optimize_out=./rec/rec_v3/Student/ch_ppocr_v3.0_rec_opt --valid_targets=arm --optimize_out_type=naive_buffer

因为下载V3的训练模型后,按照表格里面的配置文件转成推理模型,然后再用PaddleLite-v2.12编译出来的opt工具转成nb模型,会报错

Check failed: it != attrs().end(): No attributes called beta found for swish

即使是pip install paddlelite 安装v2.13rc0来转nb模型也有这个报错。到处找不到解决方法,然后似乎有的人能正常转,有的一样报这个错,所以我才用的V2模型。

麻烦看看这个问题