OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

Side effects of running tesserocr-recognize as a worker #196

Closed MehmedGIT closed 11 months ago

MehmedGIT commented 11 months ago

Running tesserocr-recognize as a processing worker has some side effects. It is worth mentioning that the logging error does not occur when running in a docker environment.

1) Dump of the ocrd tool json file when starting the worker with: ocrd-tesserocr-recognize worker --database ... --queue ...

Output ``` { "executable": "ocrd-tesserocr-recognize", "categories": [ "Text recognition and optimization" ], "description": "Segment and/or recognize text with Tesseract (using annotated derived images, or masking and cropping images from coordinate polygons) on any level of the PAGE hierarchy.", "input_file_grp": [ "OCR-D-SEG-PAGE", "OCR-D-SEG-REGION", "OCR-D-SEG-TABLE", "OCR-D-SEG-LINE", "OCR-D-SEG-WORD" ], "output_file_grp": [ "OCR-D-SEG-REGION", "OCR-D-SEG-TABLE", "OCR-D-SEG-LINE", "OCR-D-SEG-WORD", "OCR-D-SEG-GLYPH", "OCR-D-OCR-TESS" ], "steps": [ "layout/segmentation/region", "layout/segmentation/line", "recognition/text-recognition" ], "parameters": { "dpi": { "type": "number", "format": "float", "description": "pixel density in dots per inch (overrides any meta-data in the images)", "default": 0 }, "padding": { "type": "number", "format": "integer", "default": 0, "description": "Extend detected region/cell/line/word rectangles by this many (true) pixels, or extend existing region/line/word images (i.e. the annotated AlternativeImage if it exists or the higher-level image cropped to the bounding box and masked by the polygon otherwise) by this many (background/white) pixels on each side before recognition." }, "segmentation_level": { "type": "string", "enum": [ "region", "cell", "line", "word", "glyph", "none" ], "default": "word", "description": "Highest PAGE XML hierarchy level to remove existing annotation from and detect segments for (before iterating downwards); if ``none``, does not attempt any new segmentation; if ``cell``, starts at table regions, detecting text regions (cells). Ineffective when lower than ``textequiv_level``." }, "textequiv_level": { "type": "string", "enum": [ "region", "cell", "line", "word", "glyph", "none" ], "default": "word", "description": "Lowest PAGE XML hierarchy level to re-use or detect segments for and add the TextEquiv results to (before projecting upwards); if ``none``, adds segmentation down to the glyph level, but does not attempt recognition at all; if ``cell``, stops short before text lines, adding text of text regions inside tables (cells) or on page level only." }, "overwrite_segments": { "type": "boolean", "default": false, "description": "If ``segmentation_level`` is not none, but an element already contains segments, remove them and segment again. Otherwise use the existing segments of that element." }, "overwrite_text": { "type": "boolean", "default": true, "description": "If ``textequiv_level`` is not none, but a segment already contains TextEquivs, remove them and replace with recognised text. Otherwise add new text as alternative. (Only the first entry is projected upwards.)" }, "shrink_polygons": { "type": "boolean", "default": false, "description": "When detecting any segments, annotate polygon coordinates instead of bounding box rectangles by projecting the convex hull of all symbols." }, "block_polygons": { "type": "boolean", "default": false, "description": "When detecting regions, annotate polygon coordinates instead of bounding box rectangles by querying Tesseract accordingly." }, "find_tables": { "type": "boolean", "default": true, "description": "When detecting regions, recognise tables as table regions (Tesseract's ``textord_tabfind_find_tables=1``)." }, "find_staves": { "type": "boolean", "default": false, "description": "When detecting regions, recognize music staves as non-text, suppressing it in the binary image (Tesseract's ``pageseg_apply_music_mask``). Note that this might wrongly detect tables as staves." }, "sparse_text": { "type": "boolean", "default": false, "description": "When detecting regions, use 'sparse text' page segmentation mode (finding as much text as possible in no particular order): only text regions, single lines without vertical or horizontal space." }, "raw_lines": { "type": "boolean", "default": false, "description": "When detecting lines, do not attempt additional segmentation (baseline+xheight+ascenders/descenders prediction) on line images. Can increase accuracy for certain workflows. Disable when line segments/images may contain components of more than 1 line, or larger gaps/white-spaces." }, "char_whitelist": { "type": "string", "default": "", "description": "When recognizing text, enumeration of character hypotheses (from the model) to allow exclusively; overruled by blacklist if set." }, "char_blacklist": { "type": "string", "default": "", "description": "When recognizing text, enumeration of character hypotheses (from the model) to suppress; overruled by unblacklist if set." }, "char_unblacklist": { "type": "string", "default": "", "description": "When recognizing text, enumeration of character hypotheses (from the model) to allow inclusively." }, "tesseract_parameters": { "type": "object", "default": {}, "description": "Dictionary of additional Tesseract runtime variables (cf. tesseract --print-parameters), string values." }, "xpath_parameters": { "type": "object", "default": {}, "description": "Set additional Tesseract runtime variables according to results of XPath queries into the segment. (As a convenience, `@language` and `@script` also match their upwards `@primary*` and `@secondary*` variants where applicable.) (Example: {'ancestor::TextRegion/@type=\"page-number\"': {'char_whitelist': '0123456789-'}, 'contains(@custom,\"ISBN\")': {'char_whitelist': '0123456789-'}})" }, "xpath_model": { "type": "object", "default": {}, "description": "Prefer models mapped according to results of XPath queries into the segment. (As a convenience, `@language` and `@script` also match their upwards `@primary*` and `@secondary*` variants where applicable.) If no queries / mappings match (or under the default empty parameter), then fall back to `model`. If there are multiple matches, combine their results. (Example: {'starts-with(@script,\"Latn\")': 'Latin', 'starts-with(@script,\"Grek\")': 'Greek', '@language=\"Latin\"': 'lat', '@language=\"Greek\"': 'grc+ell', 'ancestor::TextRegion/@type=\"page-number\"': 'eng'})" }, "auto_model": { "type": "boolean", "default": false, "description": "Prefer models performing best (by confidence) per segment (if multiple given in `model`). Repeats the OCR of the best model once (i.e. slower). (Use as a fallback to xpath_model if you do not trust script/language detection.)" }, "model": { "type": "string", "format": "uri", "content-type": "application/octet-stream", "description": "The tessdata text recognition model to apply (an ISO 639-3 language specification or some other basename, e.g. deu-frak or Fraktur)." }, "oem": { "type": "string", "enum": [ "TESSERACT_ONLY", "LSTM_ONLY", "TESSERACT_LSTM_COMBINED", "DEFAULT" ], "default": "DEFAULT", "description": "Tesseract OCR engine mode to use:\n* Run Tesseract only - fastest,\n* Run just the LSTM line recognizer. (>=v4.00),\n*Run the LSTM recognizer, but allow fallback to Tesseract when things get difficult. (>=v4.00),\n*Run both and combine results - best accuracy." } }, "resource_locations": [ "module" ], "resources": [ { "url": "https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/Fraktur_50000000.334_450937.traineddata", "name": "Fraktur_GT4HistOCR.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model trained on GT4HistOCR", "size": 1058487 }, { "url": "https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ONB/tessdata_best/ONB_1.195_300718_989100.traineddata", "name": "ONB.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model based on Austrian National Library newspaper data", "size": 4358948 }, { "url": "https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata", "name": "frak2021.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model based on Austrian National Library newspaper data", "size": 3421140 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/equ.traineddata", "name": "equ.traineddata", "parameter_usage": "without-extension", "description": "Tesseract legacy model for mathematical equations", "size": 2251950 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/osd.traineddata", "name": "osd.traineddata", "parameter_usage": "without-extension", "description": "Tesseract legacy model for orientation and script detection", "size": 10562727 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/eng.traineddata", "name": "eng.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model for contemporary (computer typesetting and offset printing) English", "size": 4113088 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/deu.traineddata", "name": "deu.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model for contemporary (computer typesetting and offset printing) German", "size": 1525436 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/frk.traineddata", "name": "frk.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model for historical (Fraktur typesetting and letterpress printing) German", "size": 6423052 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/script/Fraktur.traineddata", "name": "Fraktur.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model for historical Latin script with Fraktur typesetting", "size": 10915632 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/script/Latin.traineddata", "name": "Latin.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model for contemporary and historical Latin script", "size": 89384811 }, { "url": "https://github.com/tesseract-ocr/tesseract/archive/main.tar.gz", "name": "configs", "description": "Tesseract configs (parameter sets) for use with the standalone tesseract CLI", "size": 1915529, "type": "archive", "path_in_archive": "tesseract-main/tessdata/configs" } ] } ```

2) After consuming ocrd processing messages, the processing worker produces multiples of the following error:

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.7/logging/__init__.py", line 1028, in emit
    stream.write(msg + self.terminator)
ValueError: I/O operation on closed file.
Call stack:
  File "/home/mm/venv37-ocrd/bin/ocrd-tesserocr-recognize", line 8, in <module>
    sys.exit(ocrd_tesserocr_recognize())
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 43, in ocrd_tesserocr_recognize
    return ocrd_cli_wrap_processor(TesserocrRecognize, *args, **kwargs)
  File "/home/mm/Desktop/core/ocrd/ocrd/decorators/__init__.py", line 64, in ocrd_cli_wrap_processor
    check_and_run_network_agent(processorClass, subcommand, address, database, queue)
  File "/home/mm/Desktop/core/ocrd/ocrd/decorators/__init__.py", line 168, in check_and_run_network_agent
    processing_worker.start_consuming()
  File "/home/mm/Desktop/core/ocrd_network/ocrd_network/processing_worker.py", line 168, in start_consuming
    self.rmq_consumer.start_consuming()
  File "/home/mm/Desktop/core/ocrd_network/ocrd_network/rabbitmq_utils/consumer.py", line 76, in start_consuming
    self._channel.start_consuming()
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 1883, in start_consuming
    self._process_data_events(time_limit=None)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 2044, in _process_data_events
    self.connection.process_data_events(time_limit=time_limit)
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 851, in process_data_events
    self._dispatch_channel_events()
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 567, in _dispatch_channel_events
    impl_channel._get_cookie()._dispatch_events()
  File "/home/mm/venv37-ocrd/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 1511, in _dispatch_events
    evt.properties, evt.body)
  File "/home/mm/Desktop/core/ocrd_network/ocrd_network/processing_worker.py", line 148, in on_consumed_message
    self.process_message(processing_message=processing_message)
  File "/home/mm/Desktop/core/ocrd_network/ocrd_network/processing_worker.py", line 249, in process_message
    self.log.info(f'Result message: {result_message.__dict__}')
Message: "Result message: {'job_id': '72383f9e-1e55-412a-b033-adb654c10422', 'state': 'SUCCESS', 'workspace_id': None, 'path_to_mets': '/home/mm/Desktop/ocrd_network_files/example_ws2/data/mets.xml'}"
Arguments: ()

Fixing the unusual json dump could also fix the error above.

kba commented 11 months ago
  1. Dump of the ocrd tool json file when starting the worker with: ocrd-tesserocr-recognize worker --database ... --queue ...

This is because in https://github.com/OCR-D/core/blob/master/ocrd/ocrd/decorators/__init__.py#L155 we have

    processor = ProcessorClass(workspace=None, dump_json=True)

Any idea why that is there?

MehmedGIT commented 11 months ago

Ah, yes... This is a leftover for a hack we used to have for getting the ocrd tool json in the past. If I remember correctly, it was not working properly without setting the dump_json flag. I should see if simply removing it is enough to solve the issue.

MehmedGIT commented 11 months ago

I'm confused why that potentially fails only the ocrd_tesserocr. Will investigate more tomorrow.

kba commented 11 months ago

I'm confused why that potentially fails only the ocrd_tesserocr. Will investigate more tomorrow.

So am I, confused, if you find out more, pls share. I'm focussing on debugging the logging behavior in docker.

MehmedGIT commented 11 months ago

Good news - simply removing the flag fixes the logging error as well. I am creating a PR in the core.