Processor tesserocr-segment-line terminates with exception (TopologyException: Input geom 1 is invalid)

stweil commented 4 years ago

21:19:10.443 INFO processor.TesserocrSegmentLine - INPUT FILE 65 / phys396119
21:19:10.577 INFO processor.TesserocrSegmentLine - Page 'phys396119' images will use DPI estimated from segmentation
21:19:10.850 ERROR shapely.geos - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 0 107 at 0 107
Traceback (most recent call last):
  File "/home/stweil/src/github/OCR-D/venv-20200904/bin/ocrd-tesserocr-segment-line", line 8, in <module>
    sys.exit(ocrd_tesserocr_segment_line())
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 26, in ocrd_tesserocr_segment_line
    return ocrd_cli_wrap_processor(TesserocrSegmentLine, *args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in run_processor
    processor.process()
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/segment_line.py", line 119, in process
    interline = line_poly.intersection(region_poly)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/geometry/base.py", line 676, in intersection
    return geom_factory(self.impl['intersection'](self, other))
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/topology.py", line 70, in __call__
    self._check_topology(err, this, other)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/topology.py", line 38, in _check_topology
    self.fn.__name__, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f89253f7c88>

stweil commented 4 years ago

The error occured with standard workflow and urn:nbn:de:bsz:180-digad-8419.

stweil commented 4 years ago

The error still occurs with revision 974459e064c56eff2a5483571a93b7262e98b902.

bertsky commented 4 years ago

Since Tesseract only gives us bboxes here, the invalid polygon must be from the region. I need to know the exact workflow – what do you mean by standard workflow?

Also, this might be another instance of "won't fix because PAGE coordinates must be correct on the input side" (we cannot make all processors robust to all sorts of coordinate invalidities/inconsistencies). So be prepared to wait for a fix in the page segmenter instead...

stweil commented 4 years ago

"Standard" means one of the workflows suggested at https://ocr-d.de/en/workflows. I use this script:

#!/bin/bash

set -x
set -e

export LANG=C.UTF-8

URN=urn:nbn:de:bsz:180-digad-8419
METS=https://digi.bib.uni-mannheim.de/mets/$URN

date --iso-8601=seconds

time -p ocrd workspace --directory $URN clone $METS

cd $URN

time -p ocrd process \
  "olena-binarize -I MAX -O OCR-D-BIN -P impl sauvola" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P impl kim" \
  "cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
  "segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true" \
  "cis-ocropy-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW -P level-of-operation region" \
  "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -P level-of-operation region" \
  "tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
  "tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model fast/Fraktur_50000000.334_450937" \
  "fileformat-transform -I OCR-D-OCR-TESS -O FULLTEXT -P from-to \"page alto\"" \
  "fileformat-transform -I OCR-D-OCR-TESS -O OCR-D-OCR-TEXT -P from-to \"page text\""

date --iso-8601=seconds

bertsky commented 4 years ago

Thanks @stweil for the neatly encapsulated script. Unfortunately though, I cannot reproduce the problem. Which versions of ocrd_anybaseocr, ocrd_cis and ocrd_segment have you been running?

stweil commented 4 years ago

I used latest ocrd_all with ocrd_tesserocr updated to latest git release.

stweil commented 4 years ago

A fresh run reproduced the problem ...

All data is available here.

bertsky commented 4 years ago

I used latest ocrd_all with ocrd_tesserocr updated to latest git release. A fresh run reproduced the problem ... All data is available here.

I have tried again with (Dockerized) OCR-D/ocrd_all@dd35c37 (built at 2020-08-28T18:02:22Z) and ocrd_tesserocr 5761661 (that's your 974459e plus the release commit) – it runs smoothly.

Perhaps it's an effect of differences between Ubuntu 18.04 (Docker, my host) and Debian (your host) in Shapely's base libraries?

stweil commented 4 years ago

Can you compare the generated files on your side with my data (see link above) to see where they differ?

stweil commented 4 years ago

I'll repeat the test as soon as @kba has finished a new ocrd_all release.

stweil commented 4 years ago

The error still occurs. Tested with ocrd_all branch OCR-D/update-2020-09-07 on Debian buster.

bertsky commented 4 years ago

BTW, your script cannot have worked like that on the previous ocrd_all release (based on core 2.15), because that was not able to cope with OAI-PMH responses. And it does not work verbatim with the current version either, because you output to FULLTEXT at the end, but that already exists after ocrd workspace clone. Also, for ocrd process, I wonder how you avoid OCR-D/core#589 (I have to use OCR-D/core#594).

Can you compare the generated files on your side with my data (see link above) to see where they differ?

Unfortunately, I have no permissions for your mets.xml. I can download the fileGrp directories (if I ignore robots.txt), though. Looks like your Olena already has slightly different results (barely visible differences), followed by slight (1-2 pixel) differences in the cropping and (below 1°) deskewing. That might explain why the error is not triggered on my host and on the Docker release.

stweil commented 4 years ago

Unfortunately, I have no permissions for your mets.xml.

I am sorry. That's a known problem (see https://github.com/OCR-D/core/issues/403). Access should work now.

stweil commented 4 years ago

And it does not work verbatim with the current version either, because you output to FULLTEXT at the end, but that already exists after ocrd workspace clone.

I have created FULLTEXT with a different OCR process in the meantime, so the script needs a slight update (either write to a different file group, remove the old FULLTEXT or simply omit that processor).

stweil commented 4 years ago

@bertsky, I get the same error on another host with Debian bullseye and a local build of Python 3.7.9 for a different book using this script:

#!/bin/bash

set -x
set -e

export LANG=C.UTF-8

PPN=PPN1024726142
METS=http://gei-digital.gei.de/viewer/metsresolver?id=$PPN

date --iso-8601=seconds

time -p ocrd workspace --directory $PPN clone $METS

cd $PPN

time -p ocrd process \
  "olena-binarize -I MAX -O OCR-D-BIN -P impl sauvola" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P impl kim" \
  "cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
  "segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true" \
  "cis-ocropy-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW -P level-of-operation region" \
  "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -P level-of-operation region" \
  "tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
  "tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model fast/Fraktur_50000000.334_450937" \
  "fileformat-transform -I OCR-D-OCR-TESS -O FULLTEXT2 -P from-to \"page alto\"" \
  "fileformat-transform -I OCR-D-OCR-TESS -O OCR-D-OCR-TEXT -P from-to \"page text\""

date --iso-8601=seconds

bertsky commented 4 years ago

@stweil could you please repeat from tesserocr-segment-region onwards – after pulling #152 and https://github.com/OCR-D/ocrd_segment/pull/43 (perhaps using --overwrite on the same workspace)?

stweil commented 4 years ago

Here is the result from a fresh run:

10:45:23.889 INFO processor.TesserocrSegmentRegion - Detected region 'region0006': 174,1285 955,1356 946,1463 165,1392 (FLOWING_TEXT)
Traceback (most recent call last):
  File "/home/stweil/src/github/OCR-D/venv-20200912/bin/ocrd-tesserocr-segment-region", line 8, in <module>
    sys.exit(ocrd_tesserocr_segment_region())
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 16, in ocrd_tesserocr_segment_region
    return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in run_processor
    processor.process()
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 173, in process
    self._process_page(layout, page, page_image, page_coords, input_file.pageId)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 223, in _process_page
    polygon2 = polygon_for_parent(polygon, page)
  File "/home/stweil/src/github/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 360, in polygon_for_parent
    interp = asPolygon(np.round(interp.exterior.coords))
NameError: name 'np' is not defined

kba commented 4 years ago

NameError: name 'np' is not defined

Does

from numpy import np

fix that? Could be, I was too thorough in cleaning up imports in the last round of refactoring...

bertsky commented 4 years ago

Sorry, I had forgotten to include that change in the commit.

bertsky commented 4 years ago

But the ordeal is not over yet: there is still one case I can see that can fail – when a polygon is invalid but simplification does not change anything, regardless of the tolerance level. (I have to detect that and re-order the point sequence...)

bertsky commented 4 years ago

But the ordeal is not over yet: there is still one case I can see that can fail – when a polygon is invalid but simplification does not change anything, regardless of the tolerance level. (I have to detect that and re-order the point sequence...)

https://github.com/OCR-D/ocrd_tesserocr/pull/152/commits/6bbe873d7eb21f68cc649d98731f9209093d18be should suffice.

stweil commented 4 years ago

The workflow for PPN1024726142 now passes - nearly. There is a new problem when creating the ALTO files which is caused by a negative x coodinate. See issue #153 for more details.

stweil commented 4 years ago

This issue was fixed in the latest code.

OCR-D / ocrd_tesserocr

Processor tesserocr-segment-line terminates with exception (TopologyException: Input geom 1 is invalid) #149