OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

Line detection via ocrd: PAGE XML is overwritten, result is empty #46

Closed wrznr closed 5 years ago

wrznr commented 6 years ago

Using

ocrd process -m ocrd-assets/dist/mets.xml characterize/exif segment-region/tesserocr

works great (i.e. metadata and regions show up in OUTPUT PAGE XML). However, adding line detection

ocrd process -m ocrd-assets/dist/mets.xml characterize/exif segment-region/tesserocr segment-line/tesserocr

results in “empty” XML:

<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15/pagecontent.xsd">
        <Page imageFileName="http://localhost:5001/00000005.tif">
        </Page>
</PcGts>
wrznr commented 6 years ago

The problem seems to be

page = OcrdPage.from_file(self.workspace.download_file(input_file))

in https://github.com/OCR-D/pyocrd/blob/master/ocrd/processor/segment_line/tesserocr.py#20. It constructs a new page instance for the image while it should fall back on the existing one.

kba commented 5 years ago

Since this is now six months old, and the process subcommand has been outdated for a while, this is hard to reproduce right now. But the bug is still there and we should fix it once #199 is in and test it stays that way.