Closed jaddoughman closed 5 years ago
Thanks for the feedback. I had done the fine tuning as multiple tests. It seems some version of file was useful to some people. I haven't done any accuracy tests.
What is the reason behind the accuracy drop ? Doesn't fine tune train above the original traineddata provided by Tesseract ?
It should provide improved accuracy as per the tutorial notes by Ray Smith at Google who has done the training of official trained data.
I suggest you keep this discussion in tesseract forum to get better responses from Tess developers.
Environment:
tesseract 4.0.0 leptonica-1.76.0 libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11 Found AVX2 Found AVX Found SSE
Platfrom:
Darwin Kernel Version 18.2.0 ; RELEASE_X86_64 x86_64
Current Behavior:
Note: Your Arabic fined tuned model is also decreasing the accuracy.
Tesseract 4.0 using the best ara.traineddata file is recalling about 85% of the data, which is pretty good. I'm attempting to train Tesseract using Fine Tuning for impact. I used the GitHub project OCR-D Train to generate the .box and .lstmf files required for training, since my training data is composed of text line images. After generating the required .box and .lstmf files, I trained tesseract with a couple of lines to 400 iterations, but the the generated transcription with the fined tuned model looks a lot like "ل.َ1ح*جُ ح( .َو!ة.اع5 ّة'عآة'ا ن'جة.!ع. ”.َئءؤئجآ| ن!.5ل". I exhausted all the possibilities by training to max_iteration 0 and and a low target_error_rate, but the results were similar.
The transcription generated by the new model can be found below (Fine Tuned.txt): Fine Tuned.txt
The transcription generated by the original Arabic model can be found below (Arabic Trained Model.txt): Arabic Trained Model.txt
The fine tuned model can be found below (test1.traineddata): test1.traineddata.zip
I attempted to train from scratch using 4000 text line images, but they weren't enough to make a difference and didn't seem logical if your trained model is recalling more than 80% of my data.
A sample of my training data which includes the .box and .lstmf is attached below: training data.zip