OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

ocrd-tesserocr-segment-region fails with AttributeError: 'LineString' object has no attribute 'exterior' #151

Closed stweil closed 4 years ago

stweil commented 4 years ago

Error message:

19:25:24.366 INFO processor.TesserocrSegmentRegion - INPUT FILE 88 / PHYS_0089
19:25:24.637 INFO processor.TesserocrSegmentRegion - Page 'PHYS_0089' images will use 400 DPI from image meta-data
19:25:24.637 INFO processor.TesserocrSegmentRegion - Detecting regions in page 'PHYS_0089'
19:25:25.936 INFO processor.TesserocrSegmentRegion - Detected region 'region0000': 899,571 2899,575 2898,932 898,928 (FLOWING_TEXT)
19:25:25.936 INFO processor.TesserocrSegmentRegion - Detected region 'region0001': 1192,943 2672,946 2672,1105 1192,1102 (FLOWING_TEXT)
19:25:25.936 INFO processor.TesserocrSegmentRegion - Detected region 'region0002': 1631,1180 2246,1182 2246,1266 1631,1264 (FLOWING_TEXT)
19:25:25.937 INFO processor.TesserocrSegmentRegion - Detected region 'region0003': 884,1333 2990,1337 2985,4141 879,4137 (FLOWING_TEXT)
Traceback (most recent call last):
  File "/OCR-D/venv-20200912/bin/ocrd-tesserocr-segment-region", line 8, in <module>
    sys.exit(ocrd_tesserocr_segment_region())
  File "/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/OCR-D/venv-20200912/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 16, in ocrd_tesserocr_segment_region
    return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
  File "/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in run_processor
    processor.process()
  File "/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 172, in process
    self._process_page(layout, page, page_image, page_coords, input_file.pageId)
  File "/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 222, in _process_page
    polygon2 = polygon_for_parent(polygon, page)
  File "/OCR-D/venv-20200912/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 339, in polygon_for_parent
    return interp.exterior.coords[:-1] # keep open
AttributeError: 'LineString' object has no attribute 'exterior'
stweil commented 4 years ago

Script used:

#!/bin/bash

set -x
set -e

export LANG=C.UTF-8

URN=urn:nbn:de:bsz:180-digad-35210
METS=https://digi.bib.uni-mannheim.de/mets/$URN

date --iso-8601=seconds

time -p ocrd workspace --directory $URN clone $METS

cd $URN

time -p ocrd process \
  "olena-binarize -I MAX -O OCR-D-BIN -P impl sauvola" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P impl kim" \
  "cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
  "segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true" \
  "cis-ocropy-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW -P level-of-operation region" \
  "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP -P level-of-operation region" \
  "tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
  "tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model fast/Fraktur_50000000.334_450937" \
  "fileformat-transform -I OCR-D-OCR-TESS -O OCR-D-OCR-ALTO -P from-to \"page alto\"" \
  "fileformat-transform -I OCR-D-OCR-TESS -O OCR-D-OCR-TEXT -P from-to \"page text\""

date --iso-8601=seconds
bertsky commented 4 years ago

Thanks @stweil. I can see it – fix is coming...

bertsky commented 4 years ago

…have not tested this on your dataset, so please check out #152