How to add new words to dictionary and fine-tune existing Chinese model?

PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

https://paddlepaddle.github.io/PaddleOCR/

Apache License 2.0

42.43k stars 7.65k forks source link

How to add new words to dictionary and fine-tune existing Chinese model? #1583

Closed leminhyen2 closed 3 years ago

leminhyen2 commented 3 years ago

Hi, I want to add a set of traditional Chinese characters currently not in the ppocr_keys_v1.txt

I tried added them at the end of the file, after line 6623. However, after my attempt to fine-tune the common model, my new model became corrupted. Even testing on word_1.jpg yield a long gibberish line like below. How may I add new words to the dictionary and fine-tune the existing model?

[2020/12/26 21:23:21] root INFO: load pretrained model from ['output/rec_chinese_common_v2.0/best_accuracy'] [2020/12/26 21:23:21] root INFO: infer_img: doc/imgs_words/ch/word_1.jpg [2020/12/26 21:23:21] root INFO: result: ('酝殊腆乓厩汛租厩租浍租浍租', 0.00038282474) [2020/12/26 21:23:21] root INFO: success!

LDOUBLEV commented 3 years ago

We have provided pre-trained model of CRNN. It is recommand that load pre-trained model for your fine-tune training following the command blow:

wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_rec_pre.tar && tar xf  ch_ppocr_server_v2.0_rec_pre.tar
python3 tools/train.py -c /config/file/ rec_chinese_common_v2.0.yml   -o  Global. pretrained_model=./ch_ppocr_server_v2.0_rec_pre/best_accuracy

leminhyen2 commented 3 years ago

@LDOUBLEV after 2 failed attempts, I managed to trained the common model to recognize the set of traditional Chinese characters that I needed. However, it doesn't recognize the pre-trained characters in the old Chinese dict file anymore. I'm still a newbie in machine learning so pardon my inexperience.

Is there anyway I can add the set of new Chinese characters to the current dict file and trained it to recognize those new characters only while still retain the knowledge of the original characters?

Since now I had both your pre-trained common model and my own specialized model, is there a way to combine these two together?

謝謝

LDOUBLEV commented 3 years ago

Firstly, you should add new characters to the end of old Chinese dict file.

And then, Prepare a large amount of data containing new characters for fine-tune training, Of course, training data containing original characters are also required in order to avoid a decrease in the recognition accuracy of original characters. The more data the better.

Finally, start training follows:
https://github.com/PaddlePaddle/PaddleOCR/issues/1583#issuecomment-751433641

LDOUBLEV commented 3 years ago

Also, is it convenient to provide some missing characters?

leminhyen2 commented 3 years ago

@LDOUBLEV Can I also add the new characters at the beginning of the file? Your training guide mentioned that all the characters that appeared will be reindexed/mapped to the provided dictionary, so technically new characters can be anywhere in the dictionary as long as they show up in the training images?

The traditional "Chinese characters" I mentioned are Kokuji, a set of kanji characters only appeared in Japan which are used in nouns, verbs, and names. Some examples are Shigi (鴫 – snipe), Kochi (鯒 – flathead), Namazu (鯰 – catfish) Here is a more extensive list of them https://www.sljfaq.org/afaq/kokuji-list.html

What would be your recommended ratio to "remind"/re-train old characters in your ppocr_keys_v1.txt and to recognize new set of kokujis? Is one image per one old character enough? Or maybe 10 images per one old character? How many times should a new character appear on images (100, 1000 times)?

Thank you in advance

LDOUBLEV commented 3 years ago

What would be your recommended ratio to "remind"/re-train old characters in your ppocr_keys_v1.txt and to recognize new set of kokujis? Is one image per one old character enough? Or maybe 10 images per one old character? How many times should a new character appear on images (100, 1000 times)?

The form of training data can be referred to as link.

You can synthesize a large amount of training data through text render or PaddleOCRLabel. The effect of the ratio of new and old characters on the recognition accuracy can be verified by experiments.

Sayaka91 commented 1 year ago

@leminhyen2 Can you share your training? I'm fine-tuning japanese model with a little bit change in the dictionary (I just add character ¥), but the result is so bad with another existed words in the dictionary (my finetune data only has number data contains ¥ symbol )

shsagnik commented 11 months ago

@Sayaka91 did you make any progress on your custom training

Sayaka91 commented 11 months ago

@Sayaka91 did you make any progress on your custom training

I trained my custom training with more data, contains existed characters and new characters. The result is better but not get my expectation, there are a lot of characters and font types that i cannot create sample image

shsagnik commented 11 months ago

@Sayaka91 can you share your training notebook if there is any, if not it will be good to learn the nuances that I can expect if I try to do it on my own, I am basically trying to add in symbols like £,© into the eng_dict