PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
43.9k stars 7.8k forks source link

自训练模型报错,及单机多卡配置后默认单卡 #3410

Closed alittejoke closed 1 year ago

alittejoke commented 3 years ago

训练机paddlepaddle-gpu-2.02post110 双3060 测试机paddlepaddle-gpu-2.02post110 双3060 测试机paddlepaddle-gpu-1.85post107 单1650s 1. image 训练后转为inference模型是如何与 image 这个里面的models与params匹配的 直接修改吗?

2.加载过程中报错 16270106691295

3.如何设置单机多卡训练 单机单卡能够正常进行训练 windows系统 16270109291103 我的windows 所以改为 os.system(set CUDA_VISIBLE_DEVICES=0,1) 我也是设置之后发现了同样的问题 训练还是只调用一张卡

同时尝试修改了train.py里边的dist的判断条件默认为dist 后来又尝试了

多卡并行计算

dist.spawn(train)

但还是device=0 另一张卡根本没有调用

  1. 训练参数是修改configs\rec\ch_ppocr_v2.0\rec_chinese_common_train_v2.0.yml 相关配置为 Global: use_gpu: true epoch_num: 300 log_smooth_window: 20 print_batch_step: 10 save_model_dir: ./output/rec_chinese_common_v2.0 save_epoch_step: 3

    evaluation is run every 5000 iterations after the 4000th iteration

    eval_batch_step: [0, 2000] cal_metric_during_train: True pretrained_model: D:\PaddleOCR\pretrain_models\rec_r34_vd_none_bilstm_ctc_v2.0_train/best_accuracy checkpoints: save_inference_dir: D:\PaddleOCR\pretrain_models\inference use_visualdl: False infer_img: doc/imgs_words/ch/word_1.jpg

    for data or label process

    character_dict_path: D:\PaddleOCR\ppocr/utils/ppocr_keys_v1.txt character_type: ch max_text_length: 25 infer_mode: False use_space_char: True save_res_path: ./output/rec/predicts_chinese_r34common_v2.0.txt

Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Cosine learning_rate: 0.0005 regularizer: name: 'L2' factor: 0.00002

Architecture: model_type: rec algorithm: CRNN Transform: Backbone: name: ResNet layers: 34 Neck: name: SequenceEncoder encoder_type: rnn hidden_size: 256 Head: name: CTCHead fc_decay: 0.00002

Loss: name: CTCLoss

PostProcess: name: CTCLabelDecode

Metric: name: RecMetric main_indicator: acc

Train: dataset: name: PGDataSet data_dir: D:/PaddleOCR/train_data/train_img/ label_file_list: ["D:/PaddleOCR/train_data/train_img/rec_gt.txt"] transforms:

Eval: dataset: name: PGDataSet data_dir: D:/PaddleOCR/train_data/test_img/ label_file_list: ["D:/PaddleOCR/train_data/test_img/rec_gt.txt"] transforms:

6.期待请尽快明确的回复,多谢!!!

littletomatodonkey commented 3 years ago
  1. 安装下paddle2.0.2吧,你的paddle看着是1.8.5的,有点旧了
  2. windows目前应该还不支持多卡运行(多卡依赖于nccl),或者你可以试下distributed.launch启动下,可以参考train.sh脚本
alittejoke commented 3 years ago

已经安装2.0.2 问题没有解决 测试机里就有2.0.2环境

alittejoke commented 3 years ago

distributed.launch 启动失败 原因是windows不支持nccl

littletomatodonkey commented 3 years ago

嗯嗯,那目前只能单卡在windows上跑,另外,看你的报错,用的还是1.8.5的paddle,建议再看下python环境是否正确哈

alittejoke commented 3 years ago

image 1.inference模型如何转为params与model模型? 2.是否直接替换 paddleocr 1.0.1.whl 安装后 site_packages下的paddleocr\tools\infer\utility.py 替换为PaddleOCR训练项目 下的 PaddleOCR\tools\infer\utility.py

alittejoke commented 3 years ago

我看到两个中的读模型的部分不同

alittejoke commented 3 years ago

我更换版本1.85为2.02后报错消失了 ,但没有检测结果,结果显示为 16270346354158

alittejoke commented 3 years ago

企业微信截图_162713565170 paddlepaddle-gpu1.8.5post107 paddleocr版本1.0.1 第一张图片是两张图使用model,params模型能够正常运行 无标题 paddlepaddle-gpu2.0.2post110 paddleocr版本2.0.2 第二张图片是同样的2张图使用rec-inference-server模型报错

请问第二张图paddlepaddle-gpu2.0.2post110报错原因 第二个问题 inference模型如何转换为model,params模型

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.