Closed mrocr closed 5 years ago
@mrocr That's not possible at the moment. What's your use case? Is it to save space?
@chris1010010 My main goal is to convert pdf files into page-xml, so that I can use it to train P2PaLA converting pdf to page-xml will allow us to rapidly and quickly create groundtruth for training.
Waiting for your reply
if you can support alto version 3.x, it would be great. since that the version of pdfalto converter
Hi, I'm still trying to understand the problem. P2PaLA does not work when there are paragraphs or words? You can remove word objects from PAGE using the Windows PageConverter tool https://www.primaresearch.org/tools/PAGEConverterValidator Paragraphs are not so easy, text lines always need a parent region.
ALTO 3.x should work. If it doesn't, please send me an example.
@chris1010010 Is there a way to convert a .hocr to page-xml with only having text-lines in the output page-xml, without paragraphs, or words?