Closed leminhyen2 closed 3 years ago
We have provided pre-trained model of CRNN. It is recommand that load pre-trained model for your fine-tune training following the command blow:
wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_rec_pre.tar && tar xf ch_ppocr_server_v2.0_rec_pre.tar
python3 tools/train.py -c /config/file/ rec_chinese_common_v2.0.yml -o Global. pretrained_model=./ch_ppocr_server_v2.0_rec_pre/best_accuracy
@LDOUBLEV after 2 failed attempts, I managed to trained the common model to recognize the set of traditional Chinese characters that I needed. However, it doesn't recognize the pre-trained characters in the old Chinese dict file anymore. I'm still a newbie in machine learning so pardon my inexperience.
Is there anyway I can add the set of new Chinese characters to the current dict file and trained it to recognize those new characters only while still retain the knowledge of the original characters?
Since now I had both your pre-trained common model and my own specialized model, is there a way to combine these two together?
謝謝
Firstly, you should add new characters to the end of old Chinese dict file.
And then, Prepare a large amount of data containing new characters for fine-tune training, Of course, training data containing original characters are also required in order to avoid a decrease in the recognition accuracy of original characters. The more data the better.
Finally, start training follows:
https://github.com/PaddlePaddle/PaddleOCR/issues/1583#issuecomment-751433641
Also, is it convenient to provide some missing characters?
@LDOUBLEV Can I also add the new characters at the beginning of the file? Your training guide mentioned that all the characters that appeared will be reindexed/mapped to the provided dictionary, so technically new characters can be anywhere in the dictionary as long as they show up in the training images?
The traditional "Chinese characters" I mentioned are Kokuji, a set of kanji characters only appeared in Japan which are used in nouns, verbs, and names. Some examples are Shigi (鴫 – snipe), Kochi (鯒 – flathead), Namazu (鯰 – catfish) Here is a more extensive list of them https://www.sljfaq.org/afaq/kokuji-list.html
What would be your recommended ratio to "remind"/re-train old characters in your ppocr_keys_v1.txt and to recognize new set of kokujis? Is one image per one old character enough? Or maybe 10 images per one old character? How many times should a new character appear on images (100, 1000 times)?
Thank you in advance
What would be your recommended ratio to "remind"/re-train old characters in your ppocr_keys_v1.txt and to recognize new set of kokujis? Is one image per one old character enough? Or maybe 10 images per one old character? How many times should a new character appear on images (100, 1000 times)?
The form of training data can be referred to as link.
You can synthesize a large amount of training data through text render or PaddleOCRLabel. The effect of the ratio of new and old characters on the recognition accuracy can be verified by experiments.
@leminhyen2 Can you share your training? I'm fine-tuning japanese model with a little bit change in the dictionary (I just add character ¥), but the result is so bad with another existed words in the dictionary (my finetune data only has number data contains ¥ symbol )
@Sayaka91 did you make any progress on your custom training
@Sayaka91 did you make any progress on your custom training
I trained my custom training with more data, contains existed characters and new characters. The result is better but not get my expectation, there are a lot of characters and font types that i cannot create sample image
@Sayaka91 can you share your training notebook if there is any, if not it will be good to learn the nuances that I can expect if I try to do it on my own, I am basically trying to add in symbols like £,© into the eng_dict
Hi, I want to add a set of traditional Chinese characters currently not in the ppocr_keys_v1.txt
I tried added them at the end of the file, after line 6623. However, after my attempt to fine-tune the common model, my new model became corrupted. Even testing on word_1.jpg yield a long gibberish line like below. How may I add new words to the dictionary and fine-tune the existing model?
[2020/12/26 21:23:21] root INFO: load pretrained model from ['output/rec_chinese_common_v2.0/best_accuracy'] [2020/12/26 21:23:21] root INFO: infer_img: doc/imgs_words/ch/word_1.jpg [2020/12/26 21:23:21] root INFO: result: ('酝殊腆乓厩汛租厩租浍租浍租', 0.00038282474) [2020/12/26 21:23:21] root INFO: success!