OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

ocrd-tesserocr-segment-line does not find any lines #47

Closed mikegerber closed 5 years ago

mikegerber commented 5 years ago

ocrd-tesserocr-segment-line does not give results for any of the files I tested. For example:

cd `mktemp -d`
wget https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/8d8aa287-94ca-48e3-84a8-1ee602871550/data/lohenstein_agrippina_1665.ocrd.zip
dtrx lohenstein_agrippina_1665.ocrd.zip
cd lohenstein_agrippina_1665.ocrd/data
ocrd-tesserocr-segment-line -l DEBUG -m mets.xml -I OCR-D-IMG -O OCR-D-SEG-LINE
cat OCR-D-SEG-LINE/OCR-D-SEG-LINE_0001

yields:

<?xml version="1.0" encoding="UTF-8"?>
<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15">
    <pc:Metadata>
        <pc:Creator>OCR-D/core 1.0.0b9</pc:Creator>
        <pc:Created>2019-06-20T16:08:42.841929</pc:Created>
        <pc:LastChange>2019-06-20T16:08:42.841929</pc:LastChange>
    </pc:Metadata>
    <pc:Page imageFilename="OCR-D-IMG/OCR-D-IMG_0001" imageWidth="1214" imageHeight="1916"/>
</pc:PcGts>
% pip list | grep tesserocr     
ocrd-tesserocr             0.2.2       
tesserocr                  2.4.0       
kba commented 5 years ago

IIRC It looks for lines only in regions. You need to run ocrd-tesserocr-segment-regions before .

mikegerber commented 5 years ago

Ah, that would explain it.

mikegerber commented 5 years ago

Segmenting lines from the regions doesn't work either. Am I using it correctly?

cd `mktemp -d`
wget https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/8d8aa287-94ca-48e3-84a8-1ee602871550/data/lohenstein_agrippina_1665.ocrd.zip
dtrx lohenstein_agrippina_1665.ocrd.zip
cd lohenstein_agrippina_1665.ocrd/data
ocrd-tesserocr-segment-region -l DEBUG -m mets.xml -I OCR-D-IMG    -O OCR-D-SEG-REGION
ocrd-tesserocr-segment-line   -l DEBUG -m mets.xml -I OCR-D-REGION -O OCR-D-SEG-LINE
cat OCR-D-SEG-LINE/OCR-D-SEG-LINE_0001

16:17:07.477 INFO root - Overriding log level globally to DEBUG
16:17:07.477 DEBUG ocrd.resolver - workspace_from_url
mets_url='file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
baseurl='file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data'
dst_dir='None'
16:17:07.477 DEBUG ocrd.resolver - Copying mets url 'file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml' to '/tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
16:17:07.477 DEBUG ocrd.resolver - Target and source mets are identical
16:17:07.477 DEBUG ocrd.processor - Running processor <class 'ocrd_tesserocr.segment_region.TesserocrSegmentRegion'>
16:17:07.478 DEBUG ocrd.processor - Processor instance <ocrd_tesserocr.segment_region.TesserocrSegmentRegion object at 0x7f6194e9c080> (ocrd-tesserocr-segment-region v0.1.3 doing layout/segmentation/region)
16:17:07.569 DEBUG processor.TesserocrSegmentRegion - Detecting regions with tesseract
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0000': 587,97 687,97 687,154 587,154
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0001': 277,146 1070,146 1070,225 277,225
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0002': 199,219 823,219 823,269 199,269
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0003': 200,263 744,263 744,307 200,307
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0004': 199,338 874,338 874,389 199,389
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0005': 162,386 790,386 790,602 162,602
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0006': 203,569 485,569 485,606 203,606
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0007': 199,578 1040,578 1040,744 199,744
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0008': 16,713 679,713 679,879 16,879
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0009': 205,861 908,861 908,921 205,921
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0010': 125,920 497,920 497,969 125,969
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0011': 207,951 852,951 852,1003 207,1003
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0012': 0,956 793,956 793,1246 0,1246
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0013': 44,1173 639,1173 639,1295 44,1295
16:17:07.858 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0014': 573,1280 692,1280 692,1329 573,1329
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0015': 44,1216 70,1216 70,1530 44,1530
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0016': 870,1253 1087,1253 1087,1319 870,1319
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0017': 191,1320 1091,1320 1091,1542 191,1542
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0018': 259,1524 1091,1524 1091,1585 259,1585
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0019': 0,1580 688,1580 688,1643 0,1643
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0020': 0,1599 1091,1599 1091,1757 0,1757
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0021': 801,1646 1097,1646 1097,1697 801,1697
16:17:07.859 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0022': 1171,1532 1213,1532 1213,1572 1171,1572
16:17:07.859 DEBUG ocrd.workspace - outputfile file_grp=OCR-D-SEG-REGION local_filename=OCR-D-SEG-REGION/OCR-D-SEG-REGION_0001 content=True
16:17:07.861 DEBUG processor.TesserocrSegmentRegion - Detecting regions with tesseract
16:17:08.246 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0000': 92,144 1145,144 1145,732 92,732
16:17:08.246 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0001': 133,731 1079,731 1079,976 133,976
16:17:08.247 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0002': 45,972 1097,972 1097,1690 45,1690
16:17:08.247 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0003': 81,1680 1066,1680 1066,1776 81,1776
16:17:08.247 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0004': 540,1820 550,1820 550,1827 540,1827
16:17:08.247 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0005': 932,1864 1211,1864 1211,1898 932,1898
16:17:08.247 DEBUG ocrd.workspace - outputfile file_grp=OCR-D-SEG-REGION local_filename=OCR-D-SEG-REGION/OCR-D-SEG-REGION_0002 content=True
16:17:08.248 DEBUG processor.TesserocrSegmentRegion - Detecting regions with tesseract
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0000': 0,114 1094,114 1094,520 0,520
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0001': 0,503 1080,503 1080,728 0,728
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0002': 142,765 1025,765 1025,857 142,857
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0003': 0,722 1084,722 1084,989 0,989
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0004': 494,943 956,943 956,974 494,974
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0005': 947,967 956,967 956,979 947,979
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0006': 82,954 1059,954 1059,1025 82,1025
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0007': 0,1002 1074,1002 1074,1449 0,1449
16:17:08.566 DEBUG processor.TesserocrSegmentRegion - Detected region 'region0008': 0,1464 1096,1464 1096,1774 0,1774
16:17:08.566 DEBUG ocrd.workspace - outputfile file_grp=OCR-D-SEG-REGION local_filename=OCR-D-SEG-REGION/OCR-D-SEG-REGION_0003 content=True
16:17:08.570 INFO ocrd.workspace - Saving mets '/tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
16:17:09.047 INFO root - Overriding log level globally to DEBUG
16:17:09.048 DEBUG ocrd.resolver - workspace_from_url
mets_url='file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
baseurl='file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data'
dst_dir='None'
16:17:09.048 DEBUG ocrd.resolver - Copying mets url 'file:///tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml' to '/tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
16:17:09.048 DEBUG ocrd.resolver - Target and source mets are identical
16:17:09.048 DEBUG ocrd.processor - Running processor <class 'ocrd_tesserocr.segment_line.TesserocrSegmentLine'>
16:17:09.048 DEBUG ocrd.processor - Processor instance <ocrd_tesserocr.segment_line.TesserocrSegmentLine object at 0x7f8fce35b5f8> (ocrd-tesserocr-segment-line v0.1.3 doing layout/segmentation/line)
16:17:09.132 INFO ocrd.workspace - Saving mets '/tmp/tmp.Vv00uNRUb9/lohenstein_agrippina_1665.ocrd/data/mets.xml'
cat: OCR-D-SEG-LINE/OCR-D-SEG-LINE_0001: No such file or directory
mikegerber commented 5 years ago

My bad:

ocrd-tesserocr-segment-line   -l DEBUG -m mets.xml -I OCR-D-REGION -O OCR-D-SEG-LINE

should have read:

ocrd-tesserocr-segment-line   -l DEBUG -m mets.xml -I OCR-D-SEG-REGION -O OCR-D-SEG-LINE

This gives results! :tada:

Should the first call, using a non-existent file group, have given an error message?

kba commented 5 years ago

Should the first call, using a non-existent file group, have given an error message?

For input, yes. Issue in core?

mikegerber commented 5 years ago

So, for orcd-tesseract-segment-line the issue is:

% ocrd-tesserocr-segment-line --help     
Usage: ocrd-tesserocr-segment-line [OPTIONS]

Options:
  -V, --version                   Show version
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  -J, --dump-json                 Dump tool description as JSON and exit
  -p, --parameter PATH
  -g, --page-id TEXT              ID(s) of the pages to process
  -O, --output-file-grp TEXT      File group(s) used as output.
  -I, --input-file-grp TEXT       File group(s) used as input.
  -w, --working-dir TEXT          Working Directory
  -m, --mets TEXT                 METS URL to validate
  --help                          Show this message and exit.

This gives no description and no help that I would need to give it regions as input (nor that it gives lines as output, but I can assume that given the name of the tool.) :thinking:

(This behaviour is in contrast to ocrd-ocropy-segment which gives lines from images, without the regions, hence my confusion.)

bertsky commented 5 years ago

This gives no description and no help that I would need to give it regions as input (nor that it gives lines as output, but I can assume that given the name of the tool.)

This is exactly the output which is required by the spec. You get what you want to know by running ocrd-tesserocr-segment-line -J:

{
 "executable": "ocrd-tesserocr-segment-line",
 "categories": [
  "Layout analysis"
 ],
 "description": "Segment page into regions with tesseract",
 "input_file_grp": [
  "OCR-D-SEG-BLOCK"
 ],
 "output_file_grp": [
  "OCR-D-SEG-LINE"
 ],
 "steps": [
  "layout/segmentation/line"
 ],
 "parameters": {}
}

Unfortunately, the current ocrd-tool.json confuses the description of ocrd-tesserocr-segment-region and ocrd-tesserocr-segment-line. I will fix that soon!

(This behaviour is in contrast to ocrd-ocropy-segment which gives lines from images, without the regions, hence my confusion.)

That's because ocropy has no notion of regions, which is a problem with PAGE actually – we always have to define a "dummy" region (and cannot work with other region segmentations at the moment).

mikegerber commented 5 years ago

@kba could the cli wrapper use the json to augment --help?

kba commented 5 years ago

@kba could the cli wrapper use the json to augment --help?

Sure, documentation is one of the reasons why we have ocrd-tool.json. Should be straightforward to output the description (or a yet to be defined usage field) above the parameters.

I do like the fact that the help for the parameters is always the same though, for uniformity. If you know one tool, you should know them all.

wrznr commented 5 years ago

@mikegerber Can we consider the original problem as solved?

kba commented 5 years ago

Follow up issues opened in OCR-D/spec#115 and OCR-D/core#253