OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
39 stars 11 forks source link

segment-line: Self-intersection at or near point ... #123

Closed jbarth-ubhd closed 4 years ago

jbarth-ubhd commented 4 years ago
ocrd process \
  "olena-binarize -I OCR-D-IMG -O OCR-D-BIN -p '{\"impl\": \"sauvola-ms-split\"}'" \
  "cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-BIN-DENOISE -p '{\"level-of-operation\":\"page\"}'" \
  "anybaseocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW" \
  "anybaseocr-crop -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-CROP" \
  "cis-ocropy-segment -I OCR-D-CROP -O OCR-D-SEG-REG -p '{\"level-of-operation\":\"page\"}'" \
  "tesserocr-segment-line -I OCR-D-SEG-REG -O OCR-D-SEG-LINE" \
  "cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-CLIP -p '{\"level-of-operation\":\"line\"}'" \
  "cis-ocropy-dewarp -I OCR-D-SEG-LINE-CLIP -O OCR-D-SEG-LINE-CLIP-DEWARP" \
  "tesserocr-recognize -I OCR-D-SEG-LINE-CLIP-DEWARP -O OCR-D-OCR -p '{\"textequiv_level\":\"glyph\",\"overwrite_words\":true,\"model\":\"GT4HistOCR_50000000.75_322098+GT4HistOCR_50000000.78_258336+GT4HistOCR_5000000-20.95_147211\"}'"

Original image: https://digi.ub.uni-heidelberg.de/diglitData/jb/02_-_arndt1710_-_000_096.tif (58 MB)

16:14:18.969 INFO processor.TesserocrSegmentLine - INPUT FILE 1 / P_0002                                                                                                                                      [5/1964]
16:14:19.028 ERROR ocrd.workspace - page "P_0002" image (binarized,despeckled,deskewed,cropped; 3031x6660) has not been reshaped properly (3089x6687) during rotation                                                 
16:14:19.029 INFO processor.TesserocrSegmentLine - Page 'P_0002' images will use 1200 DPI from image meta-data                                                                                                        
16:14:22.742 ERROR shapely.geos - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 2023 443 at 2023 443                                                                                 
Traceback (most recent call last):                                                                                                                                                                                    
  File "/home/jb/ocrd_all/venv/bin/ocrd-tesserocr-segment-line", line 8, in <module>                                                                                                                                  
    sys.exit(ocrd_tesserocr_segment_line())                        
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 26, in ocrd_tesserocr_segment_line
    return ocrd_cli_wrap_processor(TesserocrSegmentLine, *args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/decorators.py", line 54, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/processor/base.py", line 57, in run_processor
    processor.process()
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd_tesserocr/segment_line.py", line 114, in process
    line_poly = line_poly.intersection(region_poly).convex_hull
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/shapely/geometry/base.py", line 649, in intersection
    return geom_factory(self.impl['intersection'](self, other))
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/shapely/topology.py", line 70, in __call__
    self._check_topology(err, this, other)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/shapely/topology.py", line 38, in _check_topology
    self.fn.__name__, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f271e147390>
Traceback (most recent call last):
  File "/home/jb/ocrd_all/venv/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/cli/process.py", line 26, in process_cli
    run_tasks(mets, log_level, page_id, tasks)
  File "/home/jb/ocrd_all/venv/lib/python3.6/site-packages/ocrd/task_sequence.py", line 131, in run_tasks
    raise Exception("%s exited with non-zero return value %s" % (task.executable, returncode))
Exception: ocrd-tesserocr-segment-line exited with non-zero return value 1
bertsky commented 4 years ago

@jbarth-ubhd first, please allow me to comment on your choice of workflow:

"anybaseocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW"

I recommend avoiding deskewing from ocrd_anybaseocr. It's just a rebrand of ocropus/ocrolib facilities, but it does not respect our coordinate consistency principle (by rotating the image without also enlarging it, thereby throwing away information at the corners and making follow-up steps in the workflow unpredictable – cf https://github.com/kba/ocrd_anybaseocr/issues/47).

Instead, if you use ocrd-cis-ocropy-deskew you get a rebrand of ocropus/ocrolib that is not only more correct, but also more accurate: it adds a confidence threshold on top of the old implementation.

"cis-ocropy-segment -I OCR-D-CROP -O OCR-D-SEG-REG -p '{\"level-of-operation\":\"page\"}'" 

I strongly recommend against this on the page level. This is a very crude attempt at building regions from lines. There are much better implementations for that. (However, line segmentation within regions is quite competetive with this processor)

Use ocrd-tesserocr-segment-region, possibly in combination with ocrd-segment-repair plausibilize=true (for post-processing) and ocrd-cis-ocropy-clip on the region level (to suppress non-text within text regions).

"tesserocr-segment-line -I OCR-D-SEG-REG -O OCR-D-SEG-LINE" 
"cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-CLIP -p '{\"level-of-operation\":\"line\"}'" 

If you use a bbox-only line segmentation, I highly recommend doing polygonalization right after that, not just clipping. Clipping is a dumb last resort operation. In contrast, polygonalization can be very accurate.

I recommend ocrd-cis-ocropy-resegment instead. (But also try using ocrd-cis-ocropy-segment on the region level, which is already polygonal, instead of Tesseract line segmentation plus postprocessing.)

Cf. https://hackmd.io/@FKFH0M1sR2SdJZwK5U8Cfg/S1YQ4NeNr#/3/4 ff.


Now as to the bug you found: thanks for reporting – I can reproduce. Even #120, which is related, does not help.

It appears that the results from ocrd-cis-ocropy-segment on the page level are riddled with coordinate invalidities and inconsistencies. (I should probably take this variant down completely.)

I could catch those cases here in the line segmentation of course. But this – again – raises the larger question of whether or not processors should be robust to invalid input. Instead of adding this kind of extra robustness to nearly all existing processors (and thereby bloating their code-base), IMHO we should enforce correct output with our validators.

@kba @wrznr please comment!

jbarth-ubhd commented 4 years ago

Just before I forget to mention this later: this workflow is the »Good results for all pages« recommended on https://ocr-d.de/en/workflows .

bertsky commented 4 years ago

Just before I forget to mention this later: this workflow is the »Good results for all pages« recommended on https://ocr-d.de/en/workflows .

My goodness! I wish I knew how they came up with that methodologically...

wrznr commented 4 years ago

@bertsky Not much left to comment on: I always tell people not to use the segmentation facilities implemented in OCRopus. We should focus on Tesseract (morphology-based) and asap. kraken (data-driven).

processors should be robust to invalid input

No they do not have to. At least not for input which has been generated automatically by non-recommended workflows. We should think about extending kwalitee in the direction of recommended workflows however.

bertsky commented 4 years ago

I always tell people not to use the segmentation facilities implemented in OCRopus. We should focus on Tesseract (morphology-based) and asap. kraken (data-driven).

I am confused:

processors should be robust to invalid input

No they do not have to

Fine, then this is agreed. Input can be assumed be valid and consistent, otherwise no guarantees at accurate or even complete output. Maybe this should be stated more prominently in the specs!

At least not for input which has been generated automatically by non-recommended workflows

Note: this is a currently recommended workflow (unfortunately).

wrznr commented 4 years ago

I am confused

No need to! I did not say anything about Ocropy. However, I think we should focus on one morphology-based segmentation method rather than being busy with keeping multiple once to date (i.e. with changes in core and spec). Tesseract seems to me the most likely candidate.

kraken

https://github.com/mittagessen/kraken/tree/blla Pls. note the small reserve asap. in my proposal.

wrznr commented 4 years ago

Note: this is a currently recommended workflow (unfortunately).

A discussion about those recommendations is immediate. It is also a question who recommends what. Obviously the two of us (which could be called the most active users of the OCR-D tools for productive text digitization) do not recommend Ocropy for region segmentation.

bertsky commented 4 years ago

I did not say anything about Ocropy

I misread your statement. You were only concerned with page/region segmentation. (I still do recommend Ocropy's line segmentation, because it's polygonal.)

However, I think we should focus on one morphology-based segmentation method

Agreed, but if we had a good X-Y cut or other rule-based approach available (and wrapped) without too much effort, I would be happy to keep it alive as an alternative implementation. Each method has its weaknesses. (Ideally we can later combine different segmentations anyway.)

bertsky commented 4 years ago

Agreed, but if we had a good X-Y cut or other rule-based approach available (and wrapped) without too much effort, I would be happy to keep it alive as an alternative implementation. Each method has its weaknesses. (Ideally we can later combine different segmentations anyway.)

This is now addressed (you could say solved) by cisocrgroup/ocrd_cis#47.

Also avoids the root cause of this issue in ocrd-tesserocr-segment-line (bad input from ocrd_cis), namely invalid polygons.