Closed stweil closed 4 years ago
The error occured with standard workflow and urn:nbn:de:bsz:180-digad-8419.
The error still occurs with revision 974459e064c56eff2a5483571a93b7262e98b902.
Since Tesseract only gives us bboxes here, the invalid polygon must be from the region. I need to know the exact workflow – what do you mean by standard workflow?
Also, this might be another instance of "won't fix because PAGE coordinates must be correct on the input side" (we cannot make all processors robust to all sorts of coordinate invalidities/inconsistencies). So be prepared to wait for a fix in the page segmenter instead...
"Standard" means one of the workflows suggested at https://ocr-d.de/en/workflows. I use this script:
#!/bin/bash
set -x
set -e
export LANG=C.UTF-8
URN=urn:nbn:de:bsz:180-digad-8419
METS=https://digi.bib.uni-mannheim.de/mets/$URN
date --iso-8601=seconds
time -p ocrd workspace --directory $URN clone $METS
cd $URN
time -p ocrd process \
"olena-binarize -I MAX -O OCR-D-BIN -P impl sauvola" \
"anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
"olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P impl kim" \
"cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
"tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
"tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
"segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true" \
"cis-ocropy-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW -P level-of-operation region" \
"cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -P level-of-operation region" \
"tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
"tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model fast/Fraktur_50000000.334_450937" \
"fileformat-transform -I OCR-D-OCR-TESS -O FULLTEXT -P from-to \"page alto\"" \
"fileformat-transform -I OCR-D-OCR-TESS -O OCR-D-OCR-TEXT -P from-to \"page text\""
date --iso-8601=seconds
Thanks @stweil for the neatly encapsulated script. Unfortunately though, I cannot reproduce the problem. Which versions of ocrd_anybaseocr, ocrd_cis and ocrd_segment have you been running?
I used latest ocrd_all with ocrd_tesserocr updated to latest git release.
I used latest ocrd_all with ocrd_tesserocr updated to latest git release. A fresh run reproduced the problem ... All data is available here.
I have tried again with (Dockerized) OCR-D/ocrd_all@dd35c37 (built at 2020-08-28T18:02:22Z) and ocrd_tesserocr 5761661 (that's your 974459e plus the release commit) – it runs smoothly.
Perhaps it's an effect of differences between Ubuntu 18.04 (Docker, my host) and Debian (your host) in Shapely's base libraries?
Can you compare the generated files on your side with my data (see link above) to see where they differ?
I'll repeat the test as soon as @kba has finished a new ocrd_all
release.
The error still occurs. Tested with ocrd_all branch OCR-D/update-2020-09-07 on Debian buster.
BTW, your script cannot have worked like that on the previous ocrd_all release (based on core 2.15), because that was not able to cope with OAI-PMH responses. And it does not work verbatim with the current version either, because you output to FULLTEXT
at the end, but that already exists after ocrd workspace clone
. Also, for ocrd process
, I wonder how you avoid OCR-D/core#589 (I have to use OCR-D/core#594).
Can you compare the generated files on your side with my data (see link above) to see where they differ?
Unfortunately, I have no permissions for your mets.xml
. I can download the fileGrp directories (if I ignore robots.txt), though. Looks like your Olena already has slightly different results (barely visible differences), followed by slight (1-2 pixel) differences in the cropping and (below 1°) deskewing. That might explain why the error is not triggered on my host and on the Docker release.
Unfortunately, I have no permissions for your mets.xml.
I am sorry. That's a known problem (see https://github.com/OCR-D/core/issues/403). Access should work now.
And it does not work verbatim with the current version either, because you output to FULLTEXT at the end, but that already exists after ocrd workspace clone.
I have created FULLTEXT
with a different OCR process in the meantime, so the script needs a slight update (either write to a different file group, remove the old FULLTEXT
or simply omit that processor).
@bertsky, I get the same error on another host with Debian bullseye and a local build of Python 3.7.9 for a different book using this script:
#!/bin/bash
set -x
set -e
export LANG=C.UTF-8
PPN=PPN1024726142
METS=http://gei-digital.gei.de/viewer/metsresolver?id=$PPN
date --iso-8601=seconds
time -p ocrd workspace --directory $PPN clone $METS
cd $PPN
time -p ocrd process \
"olena-binarize -I MAX -O OCR-D-BIN -P impl sauvola" \
"anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
"olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P impl kim" \
"cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
"tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
"tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
"segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true" \
"cis-ocropy-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW -P level-of-operation region" \
"cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -P level-of-operation region" \
"tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
"tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model fast/Fraktur_50000000.334_450937" \
"fileformat-transform -I OCR-D-OCR-TESS -O FULLTEXT2 -P from-to \"page alto\"" \
"fileformat-transform -I OCR-D-OCR-TESS -O OCR-D-OCR-TEXT -P from-to \"page text\""
date --iso-8601=seconds
@stweil could you please repeat from tesserocr-segment-region
onwards – after pulling #152 and https://github.com/OCR-D/ocrd_segment/pull/43 (perhaps using --overwrite
on the same workspace)?
Here is the result from a fresh run:
10:45:23.889 INFO processor.TesserocrSegmentRegion - Detected region 'region0006': 174,1285 955,1356 946,1463 165,1392 (FLOWING_TEXT)
Traceback (most recent call last):
File "/home/stweil/src/github/OCR-D/venv-20200912/bin/ocrd-tesserocr-segment-region", line 8, in <module>
sys.exit(ocrd_tesserocr_segment_region())
File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 16, in ocrd_tesserocr_segment_region
return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in run_processor
processor.process()
File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 173, in process
self._process_page(layout, page, page_image, page_coords, input_file.pageId)
File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 223, in _process_page
polygon2 = polygon_for_parent(polygon, page)
File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 360, in polygon_for_parent
interp = asPolygon(np.round(interp.exterior.coords))
NameError: name 'np' is not defined
NameError: name 'np' is not defined
Does
from numpy import np
fix that? Could be, I was too thorough in cleaning up imports in the last round of refactoring...
Sorry, I had forgotten to include that change in the commit.
But the ordeal is not over yet: there is still one case I can see that can fail – when a polygon is invalid but simplification does not change anything, regardless of the tolerance level. (I have to detect that and re-order the point sequence...)
But the ordeal is not over yet: there is still one case I can see that can fail – when a polygon is invalid but simplification does not change anything, regardless of the tolerance level. (I have to detect that and re-order the point sequence...)
https://github.com/OCR-D/ocrd_tesserocr/pull/152/commits/6bbe873d7eb21f68cc649d98731f9209093d18be should suffice.
The workflow for PPN1024726142 now passes - nearly. There is a new problem when creating the ALTO files which is caused by a negative x coodinate. See issue #153 for more details.
This issue was fixed in the latest code.