Fine Tuning is leading to completely wrong transcription

jaddoughman commented 5 years ago

Environment:

tesseract 4.0.0 leptonica-1.76.0 libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11 Found AVX2 Found AVX Found SSE

Platfrom:

Darwin Kernel Version 18.2.0 ; RELEASE_X86_64 x86_64

Current Behavior:

Note: Your Arabic fined tuned model is also decreasing the accuracy.

Tesseract 4.0 using the best ara.traineddata file is recalling about 85% of the data, which is pretty good. I'm attempting to train Tesseract using Fine Tuning for impact. I used the GitHub project OCR-D Train to generate the .box and .lstmf files required for training, since my training data is composed of text line images. After generating the required .box and .lstmf files, I trained tesseract with a couple of lines to 400 iterations, but the the generated transcription with the fined tuned model looks a lot like "ل.َ1ح*جُ ح( .َو!ة.اع5 ّة'عآة'ا ن'جة.!ع. ”.َئءؤئجآ| ن!.5ل". I exhausted all the possibilities by training to max_iteration 0 and and a low target_error_rate, but the results were similar.

The transcription generated by the new model can be found below (Fine Tuned.txt): Fine Tuned.txt

The transcription generated by the original Arabic model can be found below (Arabic Trained Model.txt): Arabic Trained Model.txt

The fine tuned model can be found below (test1.traineddata): test1.traineddata.zip

I attempted to train from scratch using 4000 text line images, but they weren't enough to make a difference and didn't seem logical if your trained model is recalling more than 80% of my data.

A sample of my training data which includes the .box and .lstmf is attached below: training data.zip

Shreeshrii commented 5 years ago

Thanks for the feedback. I had done the fine tuning as multiple tests. It seems some version of file was useful to some people. I haven't done any accuracy tests.

jaddoughman commented 5 years ago

What is the reason behind the accuracy drop ? Doesn't fine tune train above the original traineddata provided by Tesseract ?

Shreeshrii commented 5 years ago

It should provide improved accuracy as per the tutorial notes by Ray Smith at Google who has done the training of official trained data.

I suggest you keep this discussion in tesseract forum to get better responses from Tess developers.

Shreeshrii / tessdata_shreetest