Shreeshrii / tessdata_shreetest

finetuned traineddata files for tesseract 4.0.0 for testing
153 stars 30 forks source link

Failed loading language #5

Open jandier opened 5 years ago

jandier commented 5 years ago

Hello Shreeshrii, first of all many thanks for your support and help. I do come across your name a lot... I'm new to tesseract and trained data and I'm exploring its possibilities. I want to recognize numbers. First i used the eng.traineddata, no errors, but the accuracy is not enough. Then i ran into your digits.traineddata. When I replace the "eng" to "digits", using your traineddata, I get some errors. my code: var tesseract:G8Tesseract = G8Tesseract(language: "digits") errors: Failed loading language 'digits' Tesseract couldn't load any languages! 2019-04-14 10:32:30.388507+0200 tes4[1335:25368] ERROR! Can't init Tesseract engine.

I'm using Tesseract 4.0.0 on iOS. The digits.traineddata-file is in my tessdata folder where my eng.traineddata-file is. Other traineddata like "digits1", "digits_comma" does not work either. Hope you can help me pointing out the steps to take. Thank you!

Shreeshrii commented 5 years ago

Have you tried them from command line?

You need to use --oem 1 since it only has the LSTM model not the legacy model.

Shreeshrii commented 5 years ago

What is the output of?

tesseract -v
tesseract --list-langs
Shreeshrii commented 5 years ago
ubuntu@tesseract-ocr:~$ cd tessdata_best

ubuntu@tesseract-ocr:~/tessdata_best$ wget https://github.com/Shreeshrii/tessdata_shreetest/raw/master/digits.traineddata
...
2019-04-14 10:29:40 (134 MB/s) - ‘digits.traineddata’ saved [11293175/11293175]

ubuntu@tesseract-ocr:~/tessdata_best$ cd ../TEST

ubuntu@tesseract-ocr:~/TEST$ tesseract NUM2.png - -l digits
33109
94027
33480
94301
10577
19035
90067
02493
jandier commented 5 years ago

Thank you for your quick response. I think I'm missing some part. I don't understand what you mean with: "You need to use --oem 1..." I have installed Tesseract through command line (Terminal) and put digits.traineddata in the tessdata-folder. (something like /usr/local/Cellar/tesseract/4.0.0_1/share/tessdata/ ...) tesseract -v gives me => tesseract 4.0.0 tesseract --list-langs gives me => List of available languages (4): digits, eng, osd, snum. When running tesseract NUM2.png - -l digits (replacing NUM2 with name of my file) it sometimes gives me: Empty page! or just white space or something like: "Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 477". But no results corresponding to the number you see on the picture. Actually i use Tesseract in a XCode-project. I imported (drag & drop) the traineddata in the tessdata-folder in my XCode-project like I did with the eng.traineddata. "eng" gives me no errors, "digits" gives me the above mentioned error. The version of Tesseract is 4.0.0. (output in my debug console) The digits.traineddata is 11.3 MB when downloaded and I drag it into my project and put it into my tessdata-folder. My code in viewController.swift:

import UIKit import TesseractOCR

class ViewController: UIViewController, G8TesseractDelegate { var tesseract:G8Tesseract = G8Tesseract(language: "digits")

override func viewDidLoad() {
    super.viewDidLoad()
    // Do any additional setup after loading the view, typically from a nib.
    let version = G8Tesseract.version()
    print (version ?? "somenumber")
}

So I am stuck at this point "eng" does work but is not so accurate in the results returned, depending on the picture... So i was looking for better trained data for numbers....and when using "digits" i get the error as "Failed loading language ...." Even retried it after installing Tesseract in terminal. Any suggestions? Thank you!

Shreeshrii commented 5 years ago

Please share a couple of images for testing.

Shreeshrii commented 5 years ago

Also try with your image name

tesseract NUM.png - -l digits --oem 1 --psm 6

jandier commented 5 years ago

when using --oem 1 --psm 6 than it works with these examples

nr0 nr2 nr3 nr6

nr10bisc nr10c

nr13

jandier commented 5 years ago

these examples do not work

nr1 nr4 nr5 nr6bis nr7 nr8 nr9 nr10 nr10bis nr11

nr11c2

nr16

nr16c

nrmessi

nrmessic
Shreeshrii commented 5 years ago

So, the traineddta file is being loaded as eveident from command line usage.

I am not familiar with ios/xcode so cannot help with your error in it - Failed loading language 'digits' You may need to set OcrEngineMode to OEM_LSTM_ONLY

Ref: https://github.com/tesseract-ocr/tesseract/blob/b1e305f38c7771f483cf314d6b5552cdf6222978/src/ccstruct/publictypes.h#L268

Regarding non-recognition of single digits, it seems to be a problem with tesseract engine rather than traineddata. I will open an issue there with link to your posts here.

I suggest that you post in the tesseract forum - https://groups.google.com/forum/#!forum/tesseract-ocr for others to provide input.

On Mon, Apr 15, 2019 at 1:36 AM jandier notifications@github.com wrote:

these examples do not work

[image: nr1] https://user-images.githubusercontent.com/32553400/56098508-71c73f00-5f01-11e9-9fcf-7fb8448f3a57.png [image: nr4] https://user-images.githubusercontent.com/32553400/56098509-71c73f00-5f01-11e9-93d9-38c9c92b5dbf.png [image: nr5] https://user-images.githubusercontent.com/32553400/56098510-71c73f00-5f01-11e9-85ba-ce056029542d.png [image: nr6bis] https://user-images.githubusercontent.com/32553400/56098511-725fd580-5f01-11e9-81b5-dda513b2cff1.jpg [image: nr7] https://user-images.githubusercontent.com/32553400/56098512-725fd580-5f01-11e9-9a87-2d754990e56d.png [image: nr8] https://user-images.githubusercontent.com/32553400/56098513-725fd580-5f01-11e9-8527-77ed18fe3473.png [image: nr9] https://user-images.githubusercontent.com/32553400/56098514-725fd580-5f01-11e9-850e-2e00c5b03df2.png [image: nr10] https://user-images.githubusercontent.com/32553400/56098515-725fd580-5f01-11e9-8c31-246eedfc4e23.jpg [image: nr10bis] https://user-images.githubusercontent.com/32553400/56098516-72f86c00-5f01-11e9-9d6a-b55616b31393.jpg [image: nr11] https://user-images.githubusercontent.com/32553400/56098517-72f86c00-5f01-11e9-8650-c528d03739f5.jpg [image: nr11c2] https://user-images.githubusercontent.com/32553400/56098518-72f86c00-5f01-11e9-9246-71d567f57aa7.png [image: nr16] https://user-images.githubusercontent.com/32553400/56098519-72f86c00-5f01-11e9-8287-f10a17bb5ac1.jpg [image: nr16c] https://user-images.githubusercontent.com/32553400/56098520-72f86c00-5f01-11e9-99a8-eb9bdef251fa.png [image: nrmessi] https://user-images.githubusercontent.com/32553400/56098521-72f86c00-5f01-11e9-9582-e9972385ce9e.jpg [image: nrmessic] https://user-images.githubusercontent.com/32553400/56098522-73910280-5f01-11e9-82bc-d7a3cea9f90b.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Shreeshrii/tessdata_shreetest/issues/5#issuecomment-483053018, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o27Y-K8oKuqvwfyH6lD2iC5NU3vxks5vg4pNgaJpZM4cugGY .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

jandier commented 5 years ago

Hi Shreeshrii, OcrEngineMode to OEM_LSTM_ONLY did the trick. in iOS/Xcode I did:
var tesseract:G8Tesseract = G8Tesseract(language: "digits", engineMode: G8OCREngineMode.lstmOnly)

No more errors on 'Failed loading language' THX!

What do you mean by 'seems to be a problem with tesseract engine rather than traineddata'? Doesn't the engine need the traineddata file to better recognise the images to be processed? Some extra information would be much appreciated (newbie and still learning) Do you have, beside 'ImproveQuality' of the image, any suggestions which steps I can take for having much better results in number recognition? About the issue you opened (single digits not getting recognized #2389) , I see that you have output: with digits config . How do you get there? Thank you.

Shreeshrii commented 5 years ago

Doesn't the engine need the traineddata file to better recognise the images to be processed?

Both work in tandem. The problem with small images not getting recognized in default page segmentation mode (psm) is across languages hence my guess is that it maybe related to a minimum image size in the code.

better results in number recognition

Try to resize image to get the height of number to be recognized to about 36 pixels and see if that improves recognition.

digits config

https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/configs/digits

It whitelists recognition to numbers and - and .

You can use these config variable either directly in command line or via a config file. (xcode may have a way to use them). It had not been working with LSTM mode till recently so you need to use the latest code from github.

see the following for an example of how to use

https://stackoverflow.com/questions/4944830/how-to-make-tesseract-to-recognize-only-numbers-when-they-are-mixed-with-letter

Shreeshrii commented 5 years ago

tesseract NUM.png - -l eng --oem 1 --psm 6 digits

tesseract NUM.png - -l eng --oem 1 --psm 6 -c tessedit_char_whitelist=0123456789

Shreeshrii commented 5 years ago

Also see https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc

--psm 8 or --psm 10 may work better for only numbers.