improving OCR to mathematical fonts

GoogleCodeExporter commented 9 years ago

Hi,

what about plugging the detexify feature (see the mentionned website below)
to tesseract in order to enlarge the number of recognized signs and to
offer the capabilities of creating LateX-based source documents : 

http://detexify.kirelabs.org/classify.html

Thanks you

Original issue reported on code.google.com by plutones...@gmail.com on 22 Dec 2009 at 6:33

GoogleCodeExporter commented 9 years ago

or this application maybe : 
http://www.inftyproject.org/en/software.html#InftyReader

Original comment by plutones...@gmail.com on 4 Feb 2010 at 2:35

GoogleCodeExporter commented 9 years ago

Detexify looks like it needs online (drawing) input and therefore won't work 
with images.
Infty doesn't seem to be open source, so that won't be of use.
I am open to hosting an intern to work on this topic.

Original comment by theraysm...@gmail.com on 20 May 2010 at 4:32

Changed state: Accepted
Added labels: Priority-Low, Type-Enhancement

GoogleCodeExporter commented 9 years ago

yes, you are correct, Detexify will work on splines, from the drawing. Anyway, a
first step would be to approximate images with splines and subsequently use the
detexify engine.

Concerning the current Google strategy to scan documents and put them on-line, 
it
could be much more efficient to have a real light electronic version (with 
vector
format fonts) instead of an heavy poorly scanned document (with raster 
objects). In
this regard, developing a strategy capable of reconstructing Latex Sources from 
a
scanned scientific document could be very powerful.

Original comment by plutones...@gmail.com on 27 May 2010 at 2:13

GoogleCodeExporter commented 9 years ago

I'm try to train Tesseract for recognizing PDF-images as Latex-Code.
I think the the line-interpretation will make it a bit complicated with 
formulas that are not only one line. So as example a \frac{a}{b} could also be 
an underlined text.
I hope I can find some pattern in the recognized text documents, so that can 
post process them somehow.
It shouldn't be so difficult, because i work with PDFs that are latex-generated.

If you have a suggestion to my plans, please share, thanks.

Original comment by jammi.e...@gmail.com on 20 Apr 2012 at 1:40

GoogleCodeExporter commented 9 years ago

Hi,

My Goal: recognizing PDF-images as Latex-Code
So my input is clean and not rotated. So it seems to be an easy task. Following 
that, I want to tell Tesseract that every black dot is a Symbol/Letter - there 
is no noise.
Is there a easy way to do that? Or do I have to dig in the code?

Also it would be interesting, whether Tesseract recognises overlapping boxes 
(in the box-file) so that a mathematical root would be recognised, but the 
stuff under the root line will be recognised independently.

I will send my progress. Hope I'm right to post here.
Thanks.

Original comment by jammi.e...@gmail.com on 9 May 2012 at 3:24

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

What's the strategy then? Report a bug or request a new feature?

Original comment by plutones...@gmail.com on 10 May 2012 at 7:04

GoogleCodeExporter commented 9 years ago

Actually I don't want to report a bug.
Maybe an issue about recognising subscript and superscript?
I want to recognise formulas and symbols that are not in UTF-8, but in Latex.
Tesseract is not build for that, but I want to improve it a bit in that 
direction - by training.

To follow my last comment, I just don't know enough about the training process, 
to use it wisely.

Greetings

Original comment by jammi.e...@gmail.com on 14 May 2012 at 3:21

Attachments:

picturetoLaTeX.png

GoogleCodeExporter commented 9 years ago

Is there any new advance with this issue? I am interested on scanning 
handwriting notes with math equations and transform them to a LaTeX file.

Original comment by maikol.s...@gmail.com on 12 Jan 2015 at 4:10

GoogleCodeExporter commented 9 years ago

Issue 1372 has been merged into this issue.

Original comment by zde...@gmail.com on 12 Apr 2015 at 3:06

akorentlab / tesseract-ocr

improving OCR to mathematical fonts #270