OCR-D / ocrd_segment

OCR-D-compliant page segmentation
MIT License
66 stars 15 forks source link

Processor ocrd-segment-repair exits with exception #45

Closed stweil closed 4 years ago

stweil commented 4 years ago

Log output:

12:06:01.982 INFO ocrd.task_sequence.run_tasks - Start processing task 'segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -p '{"plausibilize": true, "sanitize": false, "plausibilize_merge_min_overlap": 0.9}''
Traceback (most recent call last):
  File "/venv-20200919/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/venv-20200919/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/venv-20200919/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/venv-20200919/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/venv-20200919/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/venv-20200919/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/venv-20200919/lib/python3.7/site-packages/ocrd/cli/process.py", line 28, in process_cli
    run_tasks(mets, log_level, page_id, tasks, overwrite)
  File "/venv-20200919/lib/python3.7/site-packages/ocrd/task_sequence.py", line 149, in run_tasks
    raise Exception("%s exited with non-zero return value %s. STDOUT:\n%s\nSTDERR:\n%s" % (task.executable, returncode, out, err))
Exception: ocrd-segment-repair exited with non-zero return value 1. STDOUT:

STDERR:
12:06:02.420 INFO processor.RepairSegmentation - INPUT FILE 0 / PHYS_0001
12:06:02.423 INFO ocrd.page_validator - Validating input file 'FILE_0001_OCR-D-SEG-REG'
12:06:02.439 INFO processor.RepairSegmentation - INPUT FILE 1 / PHYS_0002
12:06:02.440 INFO ocrd.page_validator - Validating input file 'FILE_0002_OCR-D-SEG-REG'
Traceback (most recent call last):
  File "/venv-20200919/local/sub-venv/headless-tf1/bin/ocrd-segment-repair", line 8, in <module>
    sys.exit(ocrd_segment_repair())
  File "/venv-20200919/local/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/venv-20200919/local/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/venv-20200919/local/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/venv-20200919/local/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/venv-20200919/local/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_segment/cli.py", line 16, in ocrd_segment_repair
    return ocrd_cli_wrap_processor(RepairSegmentation, *args, **kwargs)
  File "/venv-20200919/local/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/venv-20200919/local/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in run_processor
    processor.process()
  File "/venv-20200919/local/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_segment/repair.py", line 94, in process
    parents = list(set([region.parent_object_ for region in page.get_AllRegions(classes=['Text'])]))
  File "/venv-20200919/local/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_models/ocrd_page_generateds.py", line 2905, in __hash__
    return hash(self.id)
AttributeError: 'PageType' object has no attribute 'id'
bertsky commented 4 years ago

Sorry, I should have tested better. I used set to get a unique list of parents for regions here, but top-level PageType does not support being put into a set or dict because it does not support __hash__ (yet).

@kba I think we could easily fix this by extending ocrd_models.ocrd_page_user_methods.__hash__ such that for PageType it does not try to use .id but .pcGtsId (or by amending PageType directly with an id property equivalent to pcGtsId).

kba commented 4 years ago

@kba I think we could easily fix this by extending ocrd_models.ocrd_page_user_methods.hash such that for PageType it does not try to use .id but .pcGtsId (or by amending PageType directly with an id property equivalent

On it.

kba commented 4 years ago

@kba I think we could easily fix this by extending ocrd_models.ocrd_page_user_methods.hash such that for PageType it does not try to use .id but .pcGtsId (or by amending PageType directly with an id property equivalent

https://github.com/OCR-D/core/pull/610 For PageType, uses imageFilename as the hashable attribute, pcGtsId for PcGtsType, id when available and alternatively raise an exception that this element is unhashable.