Open remon-nashid opened 11 years ago
Hi, The first error seems to be ,,"kidkosmic.kidkosmic.exp0.tif convert: no decode delegate for this image" It seems to me that is an imagemagick error. I'd try to google it, to confirm it.
One other thing: TesseractTrainer was initially written for v3.0.1. I do not know if the same training process still applies. If not, feel free to fork it and contribute if you feel up to it. I'll try to find some time to update the project, and to change the ridiculous file names..
Cheers Le 23 déc. 2012 12:12, "remon-georgy" notifications@github.com a écrit :
Thanks for this handy tool, it's really helpful except that I couldn't get it to work :).
I'm trying to train Tesseract with a new English font called KidKosmic with the following command
--training-text eng.kidkosmic.exp0 --font-path kidkosmic.ttf --font-name kidkosmic --font-properties font_properties --verbose And here is the output ```Generating individual tif image page0.tif Generating multipage-tif kidkosmic.kidkosmic.exp0.tif convert: no decode delegate for this image format`page0.tif' @ error/constitute.c/ReadImage/550. convert: no images defined `kidkosmic.kidkosmic.exp0.tif' @ error/convert.c/ConvertImageCommand/3078. Removing all individual tif images Generating boxfile kidkosmic.kidkosmic.exp0.box Tesseract Open Source OCR Engine v3.02.02 with Leptonica Cannot open input file: kidkosmic.kidkosmic.exp0.tif Extracting unicharset from kidkosmic.kidkosmic.exp0.box Wrote unicharset file ./unicharset. Warning: No shape table file present: shapetable Reading kidkosmic.kidkosmic.exp0.tr ... Error: Unable to open kidkosmic.kidkosmic.exp0.tr! signal_termination_handler:Error:Signal_termination_handler called:Code 3000 Reading kidkosmic.kidkosmic.exp0.tr ... Error: Unable to open kidkosmic.kidkosmic.exp0.tr! signal_termination_handler:Error:Signal_termination_handler called:Code 3000 Traceback (most recent call last): File "../TesseractTrainer/**main**.py", line 50, in <module> trainer.training() # generate a multipage tif from args.training_text, train on it and generate a traineddata file File "[home]bin/TesseractTrainer/lib/tesseract_training.py", line 155, in training self._rename_files() File "[home]bin/TesseractTrainer/lib/tesseract_training.py", line 131, in _rename_files os.rename('%s' % (generated_file), '%s.%s' % (self.dictionary_name, generated_file)) OSError: [Errno 2] No such file or directory Any clues? Fyi, I'm running the script on mac os 10.8 and dependencies insalled. — Reply to this email directly or view it on GitHubhttps://github.com/BaltoRouberol/TesseractTrainer/issues/3.
Thanks for you reply! Yes the first error is an ImageMagick one, however, I do get plenty of additional errors after fixing it :) I agree with you that it is a compatibility issue with Tesseract 3.x where x > 0.
Tesseract 3.02 has introduced a new clustering command: shapeclustering
(see https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Clustering)
It's seems to be important, as the following message appears in your traceback:
Warning: No shape table file present: shapetable
I'll add an automatic version check, and if tesseract >= 3.02, then the shapeclustering
command will be executed.
Stay tuned :)
It seems that we'll have to wait a little more for 3.02 support.
I've added the shapeclustering
command and automatic checking of tesseract version, but tesseract 3.02 fails to perform the blob ←→coordinates match.
All I get is a super-long error log looking like this
APPLY_BOXES: boxfile line 28/a ((421,580),(446,551)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 29/w ((446,580),(471,551)): FAILURE! Couldn't find a matching blob
I've found reports of people experiencing the same behaviour with 3.02 and tried to contribute. See here .
As this bug is a pure tesseract one, I hope you understand I cannot guarantee when I'll be able to support tesseract 3.02.
As an alternative solution, I suggest you fall back on tesseract 3.01, which seems to work fairly well with TesseractTrainer.
Hi! Do you have any deadline for supporting tesseract 3.02? I did try to compile 3.01, but it doesn't compile anymore (out-of-the-box, at least), on latest dists (Ubuntu 12.10). What do you suggest: try hard to compile 3.01 or wait for 3.02 support (I can't be of any help with the last one, sorry... :-)
Well, I believe that aforementioned errors (couldn't find matching blob...etc) are originating from using training text with very very long words (words that can't be wrapped in one line) and it has nothing to do with Tisseract version.
I don't believe so... My text is 26 letters, double spaced... And the author himself suggests "Couldn't find a matching blob" error are pure "tesseract ones"...
@marcolino: you are welcome to test your believe with evidence ;-): https://code.google.com/p/tesseract-ocr/issues/detail?id=698#c16
Which evidence are you talking about? Of course I did read that page, but I'm saying my problem is not the same as the one described on comment 16. This is the tif (transformed to jpeg for size limitations) from my text file, and, as you can see, there are no possible overlaps:
I made two statements: 1) "tesseract-ocr 3.01 doesn't compile on Ubuntu 12.10 (fresh install + build-essentials + tesseract-ocr + libleptonica-dev)" 2) the problem described on comment 16 does not apply to my situation, since my text is 26 single-letter words
TesseractTrainer author made one statement (in the comment of 2012-12-27 09:31:08 in this thread): "As this bug [...couldn't find a matching blob...] is a pure tesseract one, I hope you understand I cannot guarantee when I'll be able to support tesseract 3.02."
remon-georgy said: "aforementioned errors (couldn't find matching blob...etc) are originating from using training text with very very long words"
Please, be specific, or don't be... :-)
Hi, I sadly currently do not have any time to spend on TesseractTrainer, which explains my slow responses and bug fixes.
About v3.02: As you've both read https://code.google.com/p/tesseract-ocr/issues/detail?id=698#c16, you've seen that if was suggested to increase the resolution (>72 DPI) or increase the inter character spacing. I tried to generate a 300 DPI tiff, by multiplying all metrics by 4.16 and setting "3OO DPI" into the tif metadata, using ImageMagick, but it did not help.
@marcolino your example tif suggests that increasing the inter character spacing does not have any effect either.
The only solution I could offer now is trying to compile tesseract 3.01. I reacll that you needed leptonica-lib to compile it. Maybe they are not shipped anymore (wild undocumented guess here)?
Thanks for your reports.
B
@marcolino: I wrote about evidence that "Couldn't find a matching blob" error are pure "tesseract ones"... This is not true at lease for latin script based inputs (situation for hieroglyph, arabic, azian scripts is different IMO). When I tested reported issues it always came out, that problem is in: a) wrong box file b) input image (do not following tesseract requirements).
If you post somewhere your files (image -> try to use 2 color png ;-) & box file) I can analyse it and hopefully to offer you some suggestion. 3.02 version is no (so much ;-) ) sensible for spacing. BTW: you are aware that 26 single letters do not meet requirements, right?
@BaltoRouberol: Root of problem in 698#c16 is not in DPI, but in the boxes. DPI is just minor issue IMO. You have to be aware, that tesseract will convert (binarize) images to 2 colours, and than will run training. Maybe is you visualize "your" and tesseract box files, you can see what makes difference.
@BaltoRouberol: no problem for your slow response, of course... I'll go through the "3.01" solution. You are right, libleptonica-dev is not shipped with latest default ubuntu, but it's as far as an "apt-get install libleptonica-dev"... The problem is tesseract-ocr 3.01 doesn't compile anymore with latest system libraries (while 3.02 does); I didn't deeply investigate, but suspect some structure change in some system library... I hope I will be able to "port back" just the changed portions from 3.02 to 3.01 to build it successfully...
@zdenop: thanks for your support... I'm not OCR expert... My goal is to digitalize as well as I can a bunch of old books (really old and precious Italian books :-)... So I'm trying to automate the training process with TesseractTrainer... I did hope to be able not to "dirty my hands" with box files and input images, but just to:
1) somehow identify the fonts (most books use the same font) with the help - for example - of some online resource like "www.myfonts.com/WhatTheFont/"
2) scan the books with a professional book scanner
3) process the scanned images with a trained tesseract
4) enjoy... :-)
Now I see I have to dig into the interiors of the training process... But I am starting just now, please excuse my ignorances...
So, to answer your requests, my box file and input image are produced for me by TesseractTrainer (before it fails).
I just provide a text file (I'm aware that 26 single letters do not meet requirements, but changing to the suggested minimal text ("The (quick) brown {fox} jumps! over the $3,456.78
I try to post here all the data I use:
I hope it's enough... :-) Please let me know if I can help you some way while investigating this issue...
Thanks again for your interest, everybody!
@marcolino: problem is (in) box file. I posted correct one on pastebin (it will expire in one month). I suggest you to compare it with your version e.g. in kdiff3. I create it with tesseract and I just need to correct one "1" to "l" and m-dash to minus. Are you sure you need to run training if there is such result? Tesseract users experience is that user are not able to create such good language data as Google did for supported languages. (e.g. training is reasonable only for uncommon font like fraktur). Instead for training it make sense to focus on input image quality and image preprocessing.
@BaltoRouberol: Problem of TesseractTrainer is that PIL.ImageFont returns always the same height for different chars ('T', 'g', '.', 'x'). This is not correct. Tesseract 3.02 requires than box file is rectangle of char only without empty space. I think you are not able to create such box file with PIL.
Thanks, zdenop. You are right, I'm not sure I need training... What I miss is the understanding of the kind of work I have to do to perform OCR on many books with different fonts: should I build a box file for each font I have? And, you say, I should "focus on input image quality and image preprocessing": do you know of any (open source) tool to preprocess images for better ocr processing? Thanks again!
@marcolino: this is off-topic for this issue. I suggest you to post example image and ask on tesseract user forum for suggestion. In my opinion scantailor is most complex (with simple user interface) from free software. You should not expect 100% result (even commercial OCR will not provide it).
@zdenop That's very interesting, thanks for your input. I guess that would mean that the whole tif+boxfile generation would have to be re-written using another Image Processing tool (eg: ImageMagick).
See http://www.imagemagick.org/Usage/text/#font_info
At this point, I would be happy to assist anyone willing to fork TesseractTrainer and fix this issue, but I feel I currently do not have the time to fix this (and I'm really sorry about that).
Thanks again!
B
FYI: For anyone looking for further information in to this (one interested in forking the project, perhaps?), another post was made in the Tesseract bug listing related to this particular issue.
https://code.google.com/p/tesseract-ocr/issues/detail?id=698#c17
Thanks for this handy tool, it's really helpful except that I couldn't get it to work :).
I'm trying to train Tesseract with a new English font called KidKosmic with the following command
And here is the output
Any clues?
Fyi, I'm running the script on mac os 10.8 and dependencies insalled.