Shreeshrii / tessdata_shreetest

finetuned traineddata files for tesseract 4.0.0 for testing
153 stars 30 forks source link

How to fine tune your digits_comma file #4

Open carmonasl opened 5 years ago

carmonasl commented 5 years ago

Hello @Shreeshrii , many thanks for training new files for tesseract, the digits_comma one has improved the accuracy of my model a lot, almost 100% as you can see here https://stackoverflow.com/questions/53866109/pre-processing-image-tesseract-improvement

I'd like to fine tune it with additional fonts so tesseract doesn't mix up 3s, 5s and 9s. Could you please tell me the best procedure to follow for this or a link to a tutorial to know how you did it?

Many thanks again :)

safijari commented 5 years ago

This would be very helpful for me as well.

Shreeshrii commented 5 years ago

Please see https://github.com/Shreeshrii/tessdata_shreetest/issues/1#issuecomment-425947492

For details of training done.

Please follow the training wiki page for fine tuning for impact.

I can share the bash script I used, but it is highly dependent on the file paths.

On Thu, 20 Dec 2018, 10:48 Jariullah Safi <notifications@github.com wrote:

This would be very helpful for me as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Shreeshrii/tessdata_shreetest/issues/4#issuecomment-449042694, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2nArY4r6PuPRBFXfxGwhApjqlq5ks5u67FjgaJpZM4Zb4cV .

carmonasl commented 5 years ago

@Shreeshrii many thanks for the advice! It'd be awesome if I could use the script as a starting point :)

Shreeshrii commented 5 years ago

Please see https://github.com/Shreeshrii/tessdata_shreetest/commit/b69b7e6ba6c7b0bd15f1b5541ac8fa5746383ad4 for the scripts used.

Shreeshrii commented 5 years ago

You can also see https://github.com/Shreeshrii/tessdata_ocrb for a sample finetune training with just one font.

safijari commented 5 years ago

This is awesome. Thank you so much.

I tend to always get confused in the wiki (maybe I'm not frequenting the right article?) so this would help out a lot.

carmonasl commented 5 years ago

Yes thank you very much!!!, @safijari do you mind if we work together to learn faster?

Shreeshrii commented 5 years ago

Change tessdata_dir ~/tesseract/tessdata To the directory matching your tessdata_prefix which has eng.traineddata

On Thu, 20 Dec 2018, 18:48 carmonasl <notifications@github.com wrote:

I get the error "Failed loading language 'eng'". It uses the TESSDATA_PREFIX=/home/alberto/tesseract/tessdata and I can see the eng.traineddata in that folder. Is there anything I need to change?

I'm sorry for the question @Shreeshrii https://github.com/Shreeshrii I'm sure this is quite stupid

!/bin/bash

~/tesseract/src/training/tesstrain.sh --fonts_dir ~/fonts --lang eng --linedata_only --noextract_font_properties --langdata_dir ~/langdata --tessdata_dir ~/tesseract/tessdata --training_text ./eng.digits.training_text --workspace_dir ~/tmp --output_dir ~/tesstutorial/digits --fontlist "Abel Regular" "Montserrat Regular" "Roboto Medium"

rm -rf ~/tesstutorial/digits_from_full

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Shreeshrii/tessdata_shreetest/issues/4#issuecomment-449179694, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o_Y_8Hq606Uuz6R-rbOvvVhivNO1ks5u7CHIgaJpZM4Zb4cV .

carmonasl commented 5 years ago

Many thanks, it was due to a previous installation via tesseract-ocr, I deleted everything and built it from source and I was able to generate the lstmf files.

Now It says "/home/alberto/tesstutorial/digits_from_full/digits_plus_checkpoint is not a recognition model, tryying training checkpoint... Failed to load model from /home/alberto/tesstutorial/digits_from_full/digits_plus_checkpoint"

I feel I'm quite close, thank you very much @Shreeshrii

carmonasl commented 5 years ago

I've tried with the OCRB example and I obtain the same result:

Edit: I've seen that another person had the similar problem https://github.com/tesseract-ocr/tesseract/issues/1069 but I'm already using the latest version of eng.traineddata and I've checked the file eng.lstm exists

23:version:size=80, offset=23466570 Loaded file /home/alberto/tesstutorial/ocrb_from_full/eng.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 111 to 14! Num (Extended) outputs,weights in Series: 1,36,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys64:64, 20736 Lfx96:96, 61824 Lrx96:96, 74112 Lfx512:512, 1247232 Fc14:14, 0 Total weights = 1404064 Previous null char=110 mapped to 13 Continuing from /home/alberto/tesstutorial/ocrb_from_full/eng.lstm Loaded 404/404 pages (1-404) of document /home/alberto/tesstutorial/ocrb/eng.Abel_Regular.exp0.lstmf Iteration 0: ALIGNED TRUTH : 9902013313 Iteration 0: BEST OCR TEXT : . 64 -4 -4 - . . 4 4 . 4 File /tmp/eng-2018-12-21.lpQ/eng.Abel_Regular.exp0.lstmf page 34 : !intmode:Error:Assert failed:in file ../../../src/lstm/weightmatrix.cpp, line 268 ./finetune-ocrb.sh: line 30: 23761 Segmentation fault (core dumped) OMP_THREAD_LIMIT=1 lstmtraining --model_output ~/tesstutorial/ocrb_from_full/ocrb_plus --traineddata ~/tesstutorial/ocrb/eng/eng.traineddata --continue_from ~/tesstutorial/ocrb_from_full/eng.lstm --old_traineddata ~/tesseract/tessdata/eng.traineddata --train_listfile ~/tesstutorial/ocrb/eng.training_files.txt --debug_interval -1 --max_iterations 410 Failed to read continue from: /home/alberto/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint /home/alberto/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint is not a recognition model, trying training checkpoint... Failed to load model from: /home/alberto/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint Loaded 404/404 pages (1-404) of document /home/alberto/tesstutorial/ocrb/eng.Abel_Regular.exp0.lstmf Warning: LSTMTrainer deserialized an LSTMRecognizer! At iteration 0, stage 0, Eval Char error rate=5.3387429, Word error rate=19.315182 Loaded 404/404 pages (1-404) of document /home/alberto/tesstutorial/ocrb/eng.Abel_Regular.exp0.lstmf Warning: LSTMTrainer deserialized an LSTMRecognizer! At iteration 0, stage 0, Eval Char error rate=3.4334789, Word error rate=10.338284

Shreeshrii commented 5 years ago

You need to use the eng.traineddata from tessdata_best repository for the extraction of lstm model i.e. to extract /home/alberto/tesstutorial/ ocrb_from_full/eng.lstm

On Thu, 20 Dec 2018, 21:54 carmonasl <notifications@github.com wrote:

I've tried with the OCRB example and I obtain the same result:

23:version:size=80, offset=23466570 Loaded file /home/alberto/tesstutorial/ocrb_from_full/eng.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 111 to 14! Num (Extended) outputs,weights in Series: 1,36,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys64:64, 20736 Lfx96:96, 61824 Lrx96:96, 74112 Lfx512:512, 1247232 Fc14:14, 0 Total weights = 1404064 Previous null char=110 mapped to 13 Continuing from /home/alberto/tesstutorial/ocrb_from_full/eng.lstm Loaded 404/404 pages (1-404) of document /home/alberto/tesstutorial/ocrb/eng.Abel_Regular.exp0.lstmf Iteration 0: ALIGNED TRUTH : 9902013313 Iteration 0: BEST OCR TEXT : . 64 -4 -4 - . . 4 4 . 4 File /tmp/eng-2018-12-21.lpQ/eng.Abel_Regular.exp0.lstmf page 34 : !intmode:Error:Assert failed:in file ../../../src/lstm/weightmatrix.cpp, line 268 ./finetune-ocrb.sh: line 30: 23761 Segmentation fault (core dumped) OMP_THREAD_LIMIT=1 lstmtraining --model_output ~/tesstutorial/ocrb_from_full/ocrb_plus --traineddata ~/tesstutorial/ocrb/eng/eng.traineddata --continue_from ~/tesstutorial/ocrb_from_full/eng.lstm --old_traineddata ~/tesseract/tessdata/eng.traineddata --train_listfile ~/tesstutorial/ocrb/eng.training_files.txt --debug_interval -1 --max_iterations 410 Failed to read continue from: /home/alberto/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint /home/alberto/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint is not a recognition model, trying training checkpoint... Failed to load model from: /home/alberto/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint Loaded 404/404 pages (1-404) of document /home/alberto/tesstutorial/ocrb/eng.Abel_Regular.exp0.lstmf Warning: LSTMTrainer deserialized an LSTMRecognizer! At iteration 0, stage 0, Eval Char error rate=5.3387429, Word error rate=19.315182 Loaded 404/404 pages (1-404) of document /home/alberto/tesstutorial/ocrb/eng.Abel_Regular.exp0.lstmf Warning: LSTMTrainer deserialized an LSTMRecognizer! At iteration 0, stage 0, Eval Char error rate=3.4334789, Word error rate=10.338284

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Shreeshrii/tessdata_shreetest/issues/4#issuecomment-449224034, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o8a64ZWM2AGjwFWx5-trDmveXlC_ks5u7E1-gaJpZM4Zb4cV .

carmonasl commented 5 years ago

I have rerun the makedata script with the eng.traineddata from tessdata_best repository and now I get the following error:

=== Phase UP: Generating unicharset and unichar properties files === [vie dic 21 04:21:09 CET 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/eng-2018-12-21.cdc/eng.unicharset --norm_mode 1 /tmp/eng-2018-12-21.cdc/eng.Montserrat.exp0.box /tmp/eng-2018-12-21.cdc/eng.Roboto_Medium.exp0.box Extracting unicharset from box file /tmp/eng-2018-12-21.cdc/eng.Montserrat.exp0.box Extracting unicharset from box file /tmp/eng-2018-12-21.cdc/eng.Roboto_Medium.exp0.box Wrote unicharset file /tmp/eng-2018-12-21.cdc/eng.unicharset [vie dic 21 04:21:09 CET 2018] /usr/local/bin/set_unicharset_properties -U /tmp/eng-2018-12-21.cdc/eng.unicharset -O /tmp/eng-2018-12-21.cdc/eng.unicharset -X /tmp/eng-2018-12-21.cdc/eng.xheights --script_dir=/home/alberto/tesseract/tessdata/langdata Loaded unicharset of size 15 from file /tmp/eng-2018-12-21.cdc/eng.unicharset Setting unichar properties Setting script properties Writing unicharset to file /tmp/eng-2018-12-21.cdc/eng.unicharset

=== Phase E: Generating lstmf files === Using TESSDATA_PREFIX=/home/alberto/tesseract/tessdata_best [vie dic 21 04:21:09 CET 2018] /usr/local/bin/tesseract /tmp/eng-2018-12-21.cdc/eng.Montserrat.exp0.tif /tmp/eng-2018-12-21.cdc/eng.Montserrat.exp0 --psm 6 lstm.train [vie dic 21 04:21:09 CET 2018] /usr/local/bin/tesseract /tmp/eng-2018-12-21.cdc/eng.Roboto_Medium.exp0.tif /tmp/eng-2018-12-21.cdc/eng.Roboto_Medium.exp0 --psm 6 lstm.train read_params_file: Can't open lstm.train read_params_file: Can't open lstm.train Tesseract Open Source OCR Engine v4.0.0-115-ge3a3 with Leptonica Page 1 Tesseract Open Source OCR Engine v4.0.0-115-ge3a3 with Leptonica Page 1 Page 2 Page 2 Page 3 Page 3 Page 4 Page 4 Page 5 Page 5 Page 6 Page 6 Page 7 Page 7 Page 8 Page 8 Page 9 Page 9 ERROR: /tmp/eng-2018-12-21.cdc/eng.Montserrat.exp0.lstmf does not exist or is not readable

Many thanks for your patience, much appreciated

carmonasl commented 5 years ago

When I run the makedata script with the original eng.traineddata it works. I've run the second script with the eng.traineddata from tessdata_best and it has generated my new traineddata!

Is there anything wrong with using one file for each script?

Shreeshrii commented 5 years ago

Error was

read_params_file: Can't open lstm.train

It is probably under tessdata/configs

On Thu, 20 Dec 2018, 22:38 carmonasl <notifications@github.com wrote:

When I run the makedata script with the original eng.traineddata it works. I've run the second script with the eng.traineddata from tessdata_best and it has generated my new traineddata!

Is there anything wrong with using one file for each script?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Shreeshrii/tessdata_shreetest/issues/4#issuecomment-449230911, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7DEPc9PVLH7e6sgsgSSYctk-5tiks5u7FengaJpZM4Zb4cV .

carmonasl commented 5 years ago

You are awesome man thank you very much!

lamchun1110 commented 5 years ago

Hi, I'm new to Tesseract. Do you have any ideas to make a trainedata with digits and slash?

Shreeshrii commented 5 years ago

@lamchun1110

I have uploaded 2 new traineddata files which have digits, slash as well as some other punctuation too.

see digits_layer.traineddata - replace layer for digits with various punctuation marks digitsall_layer.traineddata - replace top layer - digits with punctuation -more fonts