Open GoogleCodeExporter opened 9 years ago
Issue 1279 has been merged into this issue.
Original comment by zde...@gmail.com
on 22 Apr 2015 at 8:17
First: I don't think Tesseract is going to be particularly suitable for OMR;
for one thing, OMR systems usually have a staff line removal process that
Tesseract doesn't have. You might have better luck with OpenOMR
(https://sourceforge.net/projects/openomr/) or Audiveris
(https://audiveris.kenai.com/)
Second: I'm not sure what the significance of Joined and Broken are, but I
think they need to be there. I created a traineddata file last week, and
couldn't proceed without them.
Original comment by joregan
on 13 May 2015 at 4:26
1) It was not my intention to (mis)use Tesseract for OMR tasks. Our project -
Audiveris - uses Tesseract for recognizing textual items. Musical scores often
contain text strings with musical symbols inside. In the attached example there
is a quarter note in the middle of a string. Other text strings containing
musical symbols are often guitar chords, repeat indications etc.
Currently, running Tesseract on images containing musical symbols produces
wrong characters. I would like to fix it by adding recognition of musical
symbols to the OCR engine.
2) Could someone kindly explain me what these "Joined" and "Broken" indications
mean? Is it an error or an expected behaviour? I wasn't able to find any
documentation. It looks like I need to dig deeply into the (mostly
undocumented) source code.
My interpretation is that Tesseract OCR does currently support a small subset
of the Unicode charset. The musical page seems to be not supported, hence these
"Joined" and "Broken" words.
Thanks in advance for your clarification.
Max
Original comment by maximums...@googlemail.com
on 13 May 2015 at 8:37
Attachments:
Aaah, ok. Years of seeing the weird things people ask about on the mailing list
have made me a little skeptical, I guess :)
They are special characters for internal use. In ccutil/unicharset.cpp, there's:
// List of strings for the SpecialUnicharCodes. Keep in sync with the enum.
const char* UNICHARSET::kSpecialUnicharCodes[SPECIAL_UNICHAR_CODES_COUNT] = {
" ",
"Joined",
"|Broken|0|1"
};
but I can't see anything in particular beyond that. I've asked Ray, hopefully
he'll get a chance to answer.
If I were to hazard a guess -- and please, bear in mind that it's just a guess
-- I would say that Joined is probably for the case of letters that are smudged
(to not have to have ligatures for every combination), and Broken|0|1 is maybe
to have a placeholder when most of a letter is faded.
Tesseract does only support a small subset of Unicode; the aim is to get good
coverage for a particular language, though bearing in mind that foreign words
(such as names) do appear. It helps to cut down on a lot of ambiguities to not
have Cyrillic characters for a language that uses Latin letters and vice versa,
for example.
Aside from the special characters mentioned here, there's very little
hard-coded character treatment, and its growing smaller all the time.
The thing that will really have an impact on the results is that your
unicharset uses the default kerning information. If you can locate a good set
of fonts that feature these characters, we can extract better information from
them, and that will give better results.
Original comment by joregan
on 13 May 2015 at 10:36
Original issue reported on code.google.com by
maximums...@googlemail.com
on 22 Feb 2015 at 11:28Attachments: