PRImA-Research-Lab / prima-page-converter

Command line tool to convert page layout files to the latest PAGE XML format. It supports all previous versions of the PAGE format as well as ALTO XML, FineReader XML, and HOCR
Apache License 2.0
23 stars 6 forks source link

PDF to Page-xml #5

Closed mrocr closed 5 years ago

mrocr commented 5 years ago

Will you consider the ability to convert a PDF file to a Page-xml file..

chris1010010 commented 5 years ago

A conversion would not be easy. At the moment we don't have the resources for this. The only possible way is to convert to an image and run TesseractToPage (on our website) to OCR

mrocr commented 5 years ago

Thanks for your consideration

kba commented 4 years ago

https://github.com/PRImA-Research-Lab/prima-page-to-pdf ?

chris1010010 commented 4 years ago

That's the opposite ;-)

kba commented 4 years ago

That's the opposite ;-)

True, but wouldn't it a possible workflow:

pdftoimages
[preprocessing]
ocrd-tesserocr-recognize
prima-page-to-pdf to sandwich text and results?

Really curious, haven't had the time yet to deal with this, but it's certainly a desired feature for many users. Many libraries also offer a bulk PDF download which is easier to scrape than the mets.xml (if users even know about that option).

chris1010010 commented 4 years ago

Yes, that's possible of course (see my entry from May). But you throw away a lot of information (the text) and the results will only be as good as Tesseract

BobLd commented 4 years ago

Maybe have a look at the PageXmlTextExporter class in PdfPig (in C#). See the wiki for more info. It's still an early version but might help...