Closed mikegerber closed 5 years ago
IIRC It looks for lines only in regions. You need to run ocrd-tesserocr-segment-regions before .
Ah, that would explain it.
Segmenting lines from the regions doesn't work either. Am I using it correctly?
cd `mktemp -d`
wget https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/8d8aa287-94ca-48e3-84a8-1ee602871550/data/lohenstein_agrippina_1665.ocrd.zip
dtrx lohenstein_agrippina_1665.ocrd.zip
cd lohenstein_agrippina_1665.ocrd/data
ocrd-tesserocr-segment-region -l DEBUG -m mets.xml -I OCR-D-IMG -O OCR-D-SEG-REGION
ocrd-tesserocr-segment-line -l DEBUG -m mets.xml -I OCR-D-REGION -O OCR-D-SEG-LINE
cat OCR-D-SEG-LINE/OCR-D-SEG-LINE_0001
⇒
16:17:07.477 INFO root - Overriding log level globally to DEBUG
16:17:07.477 DEBUG ocrd.resolver - workspace_from_url
mets_url='file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
baseurl='file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data'
dst_dir='None'
16:17:07.477 DEBUG ocrd.resolver - Copying mets url 'file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml' to '/tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
16:17:07.477 DEBUG ocrd.resolver - Target and source mets are identical
16:17:07.477 DEBUG ocrd.processor - Running processor <class 'ocrd_tesserocr.segment_region.TesserocrSegmentRegion'>
16:17:07.478 DEBUG ocrd.processor - Processor instance <ocrd_tesserocr.segment_region.TesserocrSegmentRegion object at 0x7f6194e9c080> (ocrd-tesserocr-segment-region v0.1.3 doing layout/segmentation/region)
16:17:07.569 DEBUG processor.TesserocrSegmentRegion - Detecting regions with tesseract
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0000': 587,97 687,97 687,154 587,154
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0001': 277,146 1070,146 1070,225 277,225
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0002': 199,219 823,219 823,269 199,269
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0003': 200,263 744,263 744,307 200,307
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0004': 199,338 874,338 874,389 199,389
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0005': 162,386 790,386 790,602 162,602
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0006': 203,569 485,569 485,606 203,606
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0007': 199,578 1040,578 1040,744 199,744
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0008': 16,713 679,713 679,879 16,879
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0009': 205,861 908,861 908,921 205,921
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0010': 125,920 497,920 497,969 125,969
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0011': 207,951 852,951 852,1003 207,1003
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0012': 0,956 793,956 793,1246 0,1246
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0013': 44,1173 639,1173 639,1295 44,1295
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0014': 573,1280 692,1280 692,1329 573,1329
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0015': 44,1216 70,1216 70,1530 44,1530
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0016': 870,1253 1087,1253 1087,1319 870,1319
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0017': 191,1320 1091,1320 1091,1542 191,1542
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0018': 259,1524 1091,1524 1091,1585 259,1585
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0019': 0,1580 688,1580 688,1643 0,1643
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0020': 0,1599 1091,1599 1091,1757 0,1757
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0021': 801,1646 1097,1646 1097,1697 801,1697
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0022': 1171,1532 1213,1532 1213,1572 1171,1572
16:17:07.859 DEBUG ocrd.workspace - outputfile file_grp=OCR-D-SEG-REGION local_filename=OCR-D-SEG-REGION/OCR-D-SEG-REGION_0001 content=True
16:17:07.861 DEBUG processor.TesserocrSegmentRegion - Detecting regions with tesseract
16:17:08.246 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0000': 92,144 1145,144 1145,732 92,732
16:17:08.246 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0001': 133,731 1079,731 1079,976 133,976
16:17:08.247 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0002': 45,972 1097,972 1097,1690 45,1690
16:17:08.247 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0003': 81,1680 1066,1680 1066,1776 81,1776
16:17:08.247 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0004': 540,1820 550,1820 550,1827 540,1827
16:17:08.247 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0005': 932,1864 1211,1864 1211,1898 932,1898
16:17:08.247 DEBUG ocrd.workspace - outputfile file_grp=OCR-D-SEG-REGION local_filename=OCR-D-SEG-REGION/OCR-D-SEG-REGION_0002 content=True
16:17:08.248 DEBUG processor.TesserocrSegmentRegion - Detecting regions with tesseract
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0000': 0,114 1094,114 1094,520 0,520
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0001': 0,503 1080,503 1080,728 0,728
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0002': 142,765 1025,765 1025,857 142,857
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0003': 0,722 1084,722 1084,989 0,989
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0004': 494,943 956,943 956,974 494,974
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0005': 947,967 956,967 956,979 947,979
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0006': 82,954 1059,954 1059,1025 82,1025
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0007': 0,1002 1074,1002 1074,1449 0,1449
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0008': 0,1464 1096,1464 1096,1774 0,1774
16:17:08.566 DEBUG ocrd.workspace - outputfile file_grp=OCR-D-SEG-REGION local_filename=OCR-D-SEG-REGION/OCR-D-SEG-REGION_0003 content=True
16:17:08.570 INFO ocrd.workspace - Saving mets '/tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
16:17:09.047 INFO root - Overriding log level globally to DEBUG
16:17:09.048 DEBUG ocrd.resolver - workspace_from_url
mets_url='file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
baseurl='file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data'
dst_dir='None'
16:17:09.048 DEBUG ocrd.resolver - Copying mets url 'file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml' to '/tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
16:17:09.048 DEBUG ocrd.resolver - Target and source mets are identical
16:17:09.048 DEBUG ocrd.processor - Running processor <class 'ocrd_tesserocr.segment_line.TesserocrSegmentLine'>
16:17:09.048 DEBUG ocrd.processor - Processor instance <ocrd_tesserocr.segment_line.TesserocrSegmentLine object at 0x7f8fce35b5f8> (ocrd-tesserocr-segment-line v0.1.3 doing layout/segmentation/line)
16:17:09.132 INFO ocrd.workspace - Saving mets '/tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
cat: OCR-D-SEG-LINE/OCR-D-SEG-LINE_0001: No such file or directory
My bad:
ocrd-tesserocr-segment-line -l DEBUG -m mets.xml -I OCR-D-REGION -O OCR-D-SEG-LINE
should have read:
ocrd-tesserocr-segment-line -l DEBUG -m mets.xml -I OCR-D-SEG-REGION -O OCR-D-SEG-LINE
This gives results! :tada:
Should the first call, using a non-existent file group, have given an error message?
Should the first call, using a non-existent file group, have given an error message?
For input, yes. Issue in core?
So, for orcd-tesseract-segment-line the issue is:
% ocrd-tesserocr-segment-line --help
Usage: ocrd-tesserocr-segment-line [OPTIONS]
Options:
-V, --version Show version
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Log level
-J, --dump-json Dump tool description as JSON and exit
-p, --parameter PATH
-g, --page-id TEXT ID(s) of the pages to process
-O, --output-file-grp TEXT File group(s) used as output.
-I, --input-file-grp TEXT File group(s) used as input.
-w, --working-dir TEXT Working Directory
-m, --mets TEXT METS URL to validate
--help Show this message and exit.
This gives no description and no help that I would need to give it regions as input (nor that it gives lines as output, but I can assume that given the name of the tool.) :thinking:
(This behaviour is in contrast to ocrd-ocropy-segment which gives lines from images, without the regions, hence my confusion.)
This gives no description and no help that I would need to give it regions as input (nor that it gives lines as output, but I can assume that given the name of the tool.)
This is exactly the output which is required by the spec. You get what you want to know by running ocrd-tesserocr-segment-line -J
:
{
"executable": "ocrd-tesserocr-segment-line",
"categories": [
"Layout analysis"
],
"description": "Segment page into regions with tesseract",
"input_file_grp": [
"OCR-D-SEG-BLOCK"
],
"output_file_grp": [
"OCR-D-SEG-LINE"
],
"steps": [
"layout/segmentation/line"
],
"parameters": {}
}
Unfortunately, the current ocrd-tool.json confuses the description of ocrd-tesserocr-segment-region
and ocrd-tesserocr-segment-line
. I will fix that soon!
(This behaviour is in contrast to ocrd-ocropy-segment which gives lines from images, without the regions, hence my confusion.)
That's because ocropy has no notion of regions, which is a problem with PAGE actually – we always have to define a "dummy" region (and cannot work with other region segmentations at the moment).
@kba could the cli wrapper use the json to augment --help
?
@kba could the cli wrapper use the json to augment --help?
Sure, documentation is one of the reasons why we have ocrd-tool.json. Should be straightforward to output the description (or a yet to be defined usage
field) above the parameters.
I do like the fact that the help for the parameters is always the same though, for uniformity. If you know one tool, you should know them all.
@mikegerber Can we consider the original problem as solved?
Follow up issues opened in OCR-D/spec#115 and OCR-D/core#253
ocrd-tesserocr-segment-line does not give results for any of the files I tested. For example:
yields: