OCR-D / page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Apache License 2.0
13 stars 5 forks source link

failure on recursive regions #7

Closed bertsky closed 3 years ago

bertsky commented 3 years ago
Traceback (most recent call last):
  File "/data/ocr-d/ocrd_all/venv/bin/page-to-alto", line 8, in <module>
    sys.exit(main())
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/ocrd_page_to_alto/cli.py", line 31, in main
    converter.convert()
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/ocrd_page_to_alto/convert.py", line 140, in convert
    self.convert_reading_order()
  File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/ocrd_page_to_alto/convert.py", line 152, in convert_reading_order
    self.alto_printspace.find('.//*[@ID="%s"]' % id_cur).set('IDNEXT', id_next)
AttributeError: 'NoneType' object has no attribute 'set'
bertsky commented 3 years ago

Sorry, empty pages are not the problem. This happens on recursive structures:

<TextRegion id="r0">
  <TextRegion id="r1">...</TextRegion>
</TextRegion>

I guess the state in alto_printspace at that point does not yet contain the full sequence that page_page.getAllRegions() provides.

On the one hand, it's good that we see a failure if regions get lost. But then how do we fix the recursion?

bertsky commented 3 years ago

I guess the state in alto_printspace at that point does not yet contain the full sequence that page_page.getAllRegions() provides.

Could that simply be depth 1 vs 0?

https://github.com/kba/page-to-alto/blob/4f1e93d0f2173a4f8865090e60f8a8697d715131/ocrd_page_to_alto/convert.py#L150

https://github.com/kba/page-to-alto/blob/4f1e93d0f2173a4f8865090e60f8a8697d715131/ocrd_page_to_alto/convert.py#L286

bertsky commented 3 years ago

Could that simply be depth 1 vs 0?

Indeed!