Closed MehmedGIT closed 11 months ago
- Dump of the ocrd tool json file when starting the worker with:
ocrd-tesserocr-recognize worker --database ... --queue ...
This is because in https://github.com/OCR-D/core/blob/master/ocrd/ocrd/decorators/__init__.py#L155 we have
processor = ProcessorClass(workspace=None, dump_json=True)
Any idea why that is there?
Ah, yes... This is a leftover for a hack we used to have for getting the ocrd tool json in the past. If I remember correctly, it was not working properly without setting the dump_json
flag. I should see if simply removing it is enough to solve the issue.
I'm confused why that potentially fails only the ocrd_tesserocr
. Will investigate more tomorrow.
I'm confused why that potentially fails only the
ocrd_tesserocr
. Will investigate more tomorrow.
So am I, confused, if you find out more, pls share. I'm focussing on debugging the logging behavior in docker.
Good news - simply removing the flag fixes the logging error as well. I am creating a PR in the core.
Running
tesserocr-recognize
as a processing worker has some side effects. It is worth mentioning that the logging error does not occur when running in a docker environment.1) Dump of the ocrd tool json file when starting the worker with:
ocrd-tesserocr-recognize worker --database ... --queue ...
Output
``` { "executable": "ocrd-tesserocr-recognize", "categories": [ "Text recognition and optimization" ], "description": "Segment and/or recognize text with Tesseract (using annotated derived images, or masking and cropping images from coordinate polygons) on any level of the PAGE hierarchy.", "input_file_grp": [ "OCR-D-SEG-PAGE", "OCR-D-SEG-REGION", "OCR-D-SEG-TABLE", "OCR-D-SEG-LINE", "OCR-D-SEG-WORD" ], "output_file_grp": [ "OCR-D-SEG-REGION", "OCR-D-SEG-TABLE", "OCR-D-SEG-LINE", "OCR-D-SEG-WORD", "OCR-D-SEG-GLYPH", "OCR-D-OCR-TESS" ], "steps": [ "layout/segmentation/region", "layout/segmentation/line", "recognition/text-recognition" ], "parameters": { "dpi": { "type": "number", "format": "float", "description": "pixel density in dots per inch (overrides any meta-data in the images)", "default": 0 }, "padding": { "type": "number", "format": "integer", "default": 0, "description": "Extend detected region/cell/line/word rectangles by this many (true) pixels, or extend existing region/line/word images (i.e. the annotated AlternativeImage if it exists or the higher-level image cropped to the bounding box and masked by the polygon otherwise) by this many (background/white) pixels on each side before recognition." }, "segmentation_level": { "type": "string", "enum": [ "region", "cell", "line", "word", "glyph", "none" ], "default": "word", "description": "Highest PAGE XML hierarchy level to remove existing annotation from and detect segments for (before iterating downwards); if ``none``, does not attempt any new segmentation; if ``cell``, starts at table regions, detecting text regions (cells). Ineffective when lower than ``textequiv_level``." }, "textequiv_level": { "type": "string", "enum": [ "region", "cell", "line", "word", "glyph", "none" ], "default": "word", "description": "Lowest PAGE XML hierarchy level to re-use or detect segments for and add the TextEquiv results to (before projecting upwards); if ``none``, adds segmentation down to the glyph level, but does not attempt recognition at all; if ``cell``, stops short before text lines, adding text of text regions inside tables (cells) or on page level only." }, "overwrite_segments": { "type": "boolean", "default": false, "description": "If ``segmentation_level`` is not none, but an element already contains segments, remove them and segment again. Otherwise use the existing segments of that element." }, "overwrite_text": { "type": "boolean", "default": true, "description": "If ``textequiv_level`` is not none, but a segment already contains TextEquivs, remove them and replace with recognised text. Otherwise add new text as alternative. (Only the first entry is projected upwards.)" }, "shrink_polygons": { "type": "boolean", "default": false, "description": "When detecting any segments, annotate polygon coordinates instead of bounding box rectangles by projecting the convex hull of all symbols." }, "block_polygons": { "type": "boolean", "default": false, "description": "When detecting regions, annotate polygon coordinates instead of bounding box rectangles by querying Tesseract accordingly." }, "find_tables": { "type": "boolean", "default": true, "description": "When detecting regions, recognise tables as table regions (Tesseract's ``textord_tabfind_find_tables=1``)." }, "find_staves": { "type": "boolean", "default": false, "description": "When detecting regions, recognize music staves as non-text, suppressing it in the binary image (Tesseract's ``pageseg_apply_music_mask``). Note that this might wrongly detect tables as staves." }, "sparse_text": { "type": "boolean", "default": false, "description": "When detecting regions, use 'sparse text' page segmentation mode (finding as much text as possible in no particular order): only text regions, single lines without vertical or horizontal space." }, "raw_lines": { "type": "boolean", "default": false, "description": "When detecting lines, do not attempt additional segmentation (baseline+xheight+ascenders/descenders prediction) on line images. Can increase accuracy for certain workflows. Disable when line segments/images may contain components of more than 1 line, or larger gaps/white-spaces." }, "char_whitelist": { "type": "string", "default": "", "description": "When recognizing text, enumeration of character hypotheses (from the model) to allow exclusively; overruled by blacklist if set." }, "char_blacklist": { "type": "string", "default": "", "description": "When recognizing text, enumeration of character hypotheses (from the model) to suppress; overruled by unblacklist if set." }, "char_unblacklist": { "type": "string", "default": "", "description": "When recognizing text, enumeration of character hypotheses (from the model) to allow inclusively." }, "tesseract_parameters": { "type": "object", "default": {}, "description": "Dictionary of additional Tesseract runtime variables (cf. tesseract --print-parameters), string values." }, "xpath_parameters": { "type": "object", "default": {}, "description": "Set additional Tesseract runtime variables according to results of XPath queries into the segment. (As a convenience, `@language` and `@script` also match their upwards `@primary*` and `@secondary*` variants where applicable.) (Example: {'ancestor::TextRegion/@type=\"page-number\"': {'char_whitelist': '0123456789-'}, 'contains(@custom,\"ISBN\")': {'char_whitelist': '0123456789-'}})" }, "xpath_model": { "type": "object", "default": {}, "description": "Prefer models mapped according to results of XPath queries into the segment. (As a convenience, `@language` and `@script` also match their upwards `@primary*` and `@secondary*` variants where applicable.) If no queries / mappings match (or under the default empty parameter), then fall back to `model`. If there are multiple matches, combine their results. (Example: {'starts-with(@script,\"Latn\")': 'Latin', 'starts-with(@script,\"Grek\")': 'Greek', '@language=\"Latin\"': 'lat', '@language=\"Greek\"': 'grc+ell', 'ancestor::TextRegion/@type=\"page-number\"': 'eng'})" }, "auto_model": { "type": "boolean", "default": false, "description": "Prefer models performing best (by confidence) per segment (if multiple given in `model`). Repeats the OCR of the best model once (i.e. slower). (Use as a fallback to xpath_model if you do not trust script/language detection.)" }, "model": { "type": "string", "format": "uri", "content-type": "application/octet-stream", "description": "The tessdata text recognition model to apply (an ISO 639-3 language specification or some other basename, e.g. deu-frak or Fraktur)." }, "oem": { "type": "string", "enum": [ "TESSERACT_ONLY", "LSTM_ONLY", "TESSERACT_LSTM_COMBINED", "DEFAULT" ], "default": "DEFAULT", "description": "Tesseract OCR engine mode to use:\n* Run Tesseract only - fastest,\n* Run just the LSTM line recognizer. (>=v4.00),\n*Run the LSTM recognizer, but allow fallback to Tesseract when things get difficult. (>=v4.00),\n*Run both and combine results - best accuracy." } }, "resource_locations": [ "module" ], "resources": [ { "url": "https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/Fraktur_50000000.334_450937.traineddata", "name": "Fraktur_GT4HistOCR.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model trained on GT4HistOCR", "size": 1058487 }, { "url": "https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ONB/tessdata_best/ONB_1.195_300718_989100.traineddata", "name": "ONB.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model based on Austrian National Library newspaper data", "size": 4358948 }, { "url": "https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata", "name": "frak2021.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model based on Austrian National Library newspaper data", "size": 3421140 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/equ.traineddata", "name": "equ.traineddata", "parameter_usage": "without-extension", "description": "Tesseract legacy model for mathematical equations", "size": 2251950 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/osd.traineddata", "name": "osd.traineddata", "parameter_usage": "without-extension", "description": "Tesseract legacy model for orientation and script detection", "size": 10562727 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/eng.traineddata", "name": "eng.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model for contemporary (computer typesetting and offset printing) English", "size": 4113088 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/deu.traineddata", "name": "deu.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model for contemporary (computer typesetting and offset printing) German", "size": 1525436 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/frk.traineddata", "name": "frk.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model for historical (Fraktur typesetting and letterpress printing) German", "size": 6423052 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/script/Fraktur.traineddata", "name": "Fraktur.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model for historical Latin script with Fraktur typesetting", "size": 10915632 }, { "url": "https://github.com/tesseract-ocr/tessdata_fast/raw/main/script/Latin.traineddata", "name": "Latin.traineddata", "parameter_usage": "without-extension", "description": "Tesseract LSTM model for contemporary and historical Latin script", "size": 89384811 }, { "url": "https://github.com/tesseract-ocr/tesseract/archive/main.tar.gz", "name": "configs", "description": "Tesseract configs (parameter sets) for use with the standalone tesseract CLI", "size": 1915529, "type": "archive", "path_in_archive": "tesseract-main/tessdata/configs" } ] } ```2) After consuming ocrd processing messages, the processing worker produces multiples of the following error:
Fixing the unusual json dump could also fix the error above.