Unable to resume model training

asif-ca commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：20.04.6 LTS
版本号/Version：Paddle：2.5.1 PaddleOCR：2.5 问题相关组件/Related components：
运行指令/Command Code：python3 tools/train.py -c configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml -o Global.checkpoints=output/rec_new_train/latest
完整报错/Complete Error Message：[2023/10/16 11:30:45] ppocr INFO: resume from output/rec_new_train/latest [2023/10/16 11:30:45] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 10000 iterations [2023/10/16 11:30:45] ppocr INFO: best metric, acc: 0.9489788677900999, is_float16: False, norm_edit_dis: 0.9697556926937365, Teacher_acc: 0.9486444026809259, Teacher_norm_edit_dis: 0.9702948101480247, fps: 577.635715075861, best_epoch: 47, start_epoch: 51

After printing this message, the process ends automatically, and training does not start all paths, etc are correct...

Am I doing something wrong?

@andyjpaddle @ZeyuChen @haobibo @bingooo @shiyutang @Evezerest please have a look

我们提供了AceIssueSolver来帮助你解答问题，你是否想要它来解答(请填写yes/no)?/We provide AceIssueSolver to solve issues, do you want it? (Please write yes/no): Yes

shiyutang commented 1 year ago

you can use pdb to see where the program ends. As the info show, the model has been loaded.

asif-ca commented 1 year ago

@shiyutang Really thanks for your reply

I fixed the issue by setting the epoch_num in ymlfile greater than the previous epoch_numfor which the model was trained Actually it was quitting the loop because I was trying to resume training by changing data for 10 epochs because the fine-tuned model has white space issue it was unable to detect white space issue.

Any direction @shiyutang is really appreciated for the white space issue below is what I tried from docs

I added 50,000 images of white space like multiple words in the image (as suggested here) in the dataset of 90 thousand images for recognizer and resumed training for 20 epochs but still, I have 0 improvements in detecting white spaces in recognition

As per my understanding of the white space issue from here I need to add more images with white space in the dataset so I did this way 000000035 000000038 000000025

And labels file like this:

000000035.jpg   chipset natal
000000038.jpg   acdbline usable
000000025.jpg   csa offenses

But nothing improved, the pre-trained model is able to detect white spaces (even though that missing the spaces sometimes but still able to detect them) but the fine-tuned model is really unable to detect white spaces in text

Am I doing something wrong in the dataset as per described here?

Please suggest!

danteblink commented 1 year ago

I'm facing the same issue. Did you find a way to overcome the problem with the white spaces after fine-tuning?

asif-ca commented 1 year ago

@danteblink, it is crucial to add more data with white spaces to improve recognition accuracy. Please ensure that you add maximum data with white spaces to achieve the best results. For further guidance, please refer to this article.

Initially, I fine-tuned the model on nearly half a million images for 50 epochs. However, most of the images only contained single words, and I only added 30000 images with two words containing white spaces, which was not a correct ratio. As a result, the model was unable to detect white spaces.

After that, I fine-tuned the same base model with synthesized words that included white spaces for almost 100 thousand images(total) almost 60-70 thousand images had white spaces. This time, I observed that the model was able to detect white spaces. Currently, I am working on collecting real data to train the model further.

Also, try det_db_unclip_ratio for some higher values like 2, 3, etc.

custom_ocr = PaddleOCR(use_angle_cls=True,
                rec_model_dir='/content/rec_trained_model',
                det_db_unclip_ratio=2.9,
                )

danteblink commented 1 year ago

@asif-ca Thank you for your response. I have improved the white space detection.

asif-ca commented 7 months ago

Was Able to resume traing, explanation given here

PaddlePaddle / PaddleOCR

Unable to resume model training #11090