PRImA-Research-Lab / prima-page-converter

Command line tool to convert page layout files to the latest PAGE XML format. It supports all previous versions of the PAGE format as well as ALTO XML, FineReader XML, and HOCR
Apache License 2.0
23 stars 6 forks source link

Convert hocr to page-xml with only line elements #6

Closed mrocr closed 5 years ago

mrocr commented 5 years ago

@chris1010010 Is there a way to convert a .hocr to page-xml with only having text-lines in the output page-xml, without paragraphs, or words?

chris1010010 commented 5 years ago

@mrocr That's not possible at the moment. What's your use case? Is it to save space?

mrocr commented 5 years ago

@chris1010010 My main goal is to convert pdf files into page-xml, so that I can use it to train P2PaLA converting pdf to page-xml will allow us to rapidly and quickly create groundtruth for training.

Waiting for your reply

mrocr commented 5 years ago

if you can support alto version 3.x, it would be great. since that the version of pdfalto converter

chris1010010 commented 5 years ago

Hi, I'm still trying to understand the problem. P2PaLA does not work when there are paragraphs or words? You can remove word objects from PAGE using the Windows PageConverter tool https://www.primaresearch.org/tools/PAGEConverterValidator Paragraphs are not so easy, text lines always need a parent region.

ALTO 3.x should work. If it doesn't, please send me an example.