DayBreakZhang / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
1 stars 0 forks source link

Ehnancement: please provide pdf to pdf ocr #1486

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Take a pdf file obtained by scanning a document. Many copy machines now have 
the ability to produce pdf
2. Try to use tesseract on it. 

What is the expected output? What do you see instead?

Tesseract fails because of unsupported input format. Would be nice to have 
tesseract directly read pdf.

What version of the product are you using? On what operating system?

3.03 RC on linux

Please provide any additional information below.

Clearly, one can workaround the lack of support for pdf input by converting the 
input to tiff before starting tesseract, e.g. with imagemagick. However, this 
often results in image degradation because the actual resolution of the image 
embedded in the pdf file is generally unknown. Would be better to have this 
managed internally in tesseract. Furthermore, some copy machines produce pdfs 
where each page contains multiple images since the copier does segmentation and 
ends up storing different parts of the page at different resolutions. Would be 
nice to have tesseract use up to the last bit of resolution in these cases.

Original issue reported on code.google.com by sergio.c...@gmail.com on 10 Jun 2015 at 9:46

GoogleCodeExporter commented 9 years ago
This is out of scope tesseract.
Tesseract use as input images and not documents (pdf, docx, odt...)

Original comment by zde...@gmail.com on 10 Jun 2015 at 5:49