OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
117 stars 31 forks source link

ocrd zip bag file group inclusion/exclusion flags are broken (v2.65.0) #1224

Open MehmedGIT opened 1 month ago

MehmedGIT commented 1 month ago

I have tried to include only the DEFAULT and the OCR-D-OCR file groups in the zip bag. The error triggered says that the OCR-D-BINPAGE/FILE_0001_OCR-D-BINPAGE.xml file does not exist.

There are potentially 2 bugs: 1) The file itself exists but is not found 2) The check is performed although it should not - since that file group was excluded.

ocrd zip bag -d /vd18_data/PPN689276648_39pages -m /vd18_data/PPN689276648_39pages/mets.xml -i PPN689276648 -q DEFAULT -q OCR-D-OCR -j 8 ```c# mm@MM-Notebook:/vd18_data$ ocrd zip bag -d /vd18_data/PPN689276648_39pages -m /vd18_data/PPN689276648_39pages/mets.xml -i PPN689276648 -q DEFAULT -q OCR-D-OCR -j 8 13:36:01.006 INFO ocrd.workspace_bagger - Bagging /vd18_data/PPN689276648_39pages to /vd18_data/PPN689276648_39pages.ocrd.zip (temp dir /tmp/ocrd-bagit-za5mu642) 13:36:01.007 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.008 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.008 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.008 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.008 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.009 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.009 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.010 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.010 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.010 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.011 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.011 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.011 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.012 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.012 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.012 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.013 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.013 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.013 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.014 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.014 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.014 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.015 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.015 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.015 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.016 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.016 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.016 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.017 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.017 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.017 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.017 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.018 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.018 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.018 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.019 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.019 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.019 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.020 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.021 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.021 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.022 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.022 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.022 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.022 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.023 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.023 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.023 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.023 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.024 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.024 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.024 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.024 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.025 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.025 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.025 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.025 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.025 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.026 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.026 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.026 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.026 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.027 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.027 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.027 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.027 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.028 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.028 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.028 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.028 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.028 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.029 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.029 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.029 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.029 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.030 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.030 INFO ocrd.workspace_bagger - Bagging OcrdFile 13:36:01.030 INFO ocrd.workspace_bagger - Bagging OcrdFile Traceback (most recent call last): File "/home/mm/venv38-all/bin/ocrd", line 8, in sys.exit(cli()) File "/home/mm/venv38-all/lib/python3.8/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/home/mm/venv38-all/lib/python3.8/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/mm/venv38-all/lib/python3.8/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/mm/venv38-all/lib/python3.8/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/mm/venv38-all/lib/python3.8/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/mm/venv38-all/lib/python3.8/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/home/mm/repos/core/build/__editable__.ocrd-2.65.0-py3-none-any/ocrd/cli/zip.py", line 56, in bag workspace_bagger.bag( File "/home/mm/repos/core/build/__editable__.ocrd-2.65.0-py3-none-any/ocrd/workspace_bagger.py", line 181, in bag total_bytes, total_files = self._bag_mets_files(workspace, bagdir, ocrd_mets, processes, include_fileGrp, exclude_fileGrp) File "/home/mm/repos/core/build/__editable__.ocrd-2.65.0-py3-none-any/ocrd/workspace_bagger.py", line 98, in _bag_mets_files pcgts = page_from_file(page_file) File "/home/mm/repos/core/build/__editable__.ocrd-2.65.0-py3-none-any/ocrd_modelfactory/__init__.py", line 103, in page_from_file raise FileNotFoundError("File not found: '%s' (%s)" % (input_file.local_filename, input_file)) FileNotFoundError: File not found: 'OCR-D-BINPAGE/FILE_0001_OCR-D-BINPAGE.xml' ( ) ```
Content of the directory: ```c# mm@MM-Notebook:/vd18_data$ ls -la ./PPN689276648_39pages/ total 1144 drwxrwxr-x 13 mm mm 4096 Mai 16 15:43 . drwxr-xr-x 18 mm mm 4096 Mai 21 13:15 .. drwxrwxr-x 2 mm mm 4096 Mai 16 12:13 DEFAULT -rw-rw-r-- 1 mm mm 1002007 Mai 21 13:33 mets.xml drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-BINPAGE drwxrwxr-x 2 mm mm 12288 Mai 16 15:42 OCR-D-CLIP drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-DENOISE-OCROPY drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-DESKEW-OCROPY drwxrwxr-x 2 mm mm 106496 Mai 16 15:43 OCR-D-DEWARP -rw-rw-r-- 1 mm mm 555 Mai 16 15:43 ocrd.log drwxrwxr-x 2 mm mm 4096 Mai 16 15:43 OCR-D-OCR drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-SEG-BLOCK-TESSERACT drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-SEGMENT-OCROPY drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-SEGMENT-REPAIR drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-SEG-PAGE-ANYOCR mm@MM-Notebook:/vd18_data$ ls ./PPN689276648_39pages/OCR-D-BINPAGE/ FILE_0001_OCR-D-BINPAGE.IMG-BIN.png FILE_0009_OCR-D-BINPAGE.IMG-BIN.png FILE_0017_OCR-D-BINPAGE.IMG-BIN.png FILE_0025_OCR-D-BINPAGE.IMG-BIN.png FILE_0033_OCR-D-BINPAGE.IMG-BIN.png FILE_0001_OCR-D-BINPAGE.xml FILE_0009_OCR-D-BINPAGE.xml FILE_0017_OCR-D-BINPAGE.xml FILE_0025_OCR-D-BINPAGE.xml FILE_0033_OCR-D-BINPAGE.xml FILE_0002_OCR-D-BINPAGE.IMG-BIN.png FILE_0010_OCR-D-BINPAGE.IMG-BIN.png FILE_0018_OCR-D-BINPAGE.IMG-BIN.png FILE_0026_OCR-D-BINPAGE.IMG-BIN.png FILE_0034_OCR-D-BINPAGE.IMG-BIN.png FILE_0002_OCR-D-BINPAGE.xml FILE_0010_OCR-D-BINPAGE.xml FILE_0018_OCR-D-BINPAGE.xml FILE_0026_OCR-D-BINPAGE.xml FILE_0034_OCR-D-BINPAGE.xml FILE_0003_OCR-D-BINPAGE.IMG-BIN.png FILE_0011_OCR-D-BINPAGE.IMG-BIN.png FILE_0019_OCR-D-BINPAGE.IMG-BIN.png FILE_0027_OCR-D-BINPAGE.IMG-BIN.png FILE_0035_OCR-D-BINPAGE.IMG-BIN.png FILE_0003_OCR-D-BINPAGE.xml FILE_0011_OCR-D-BINPAGE.xml FILE_0019_OCR-D-BINPAGE.xml FILE_0027_OCR-D-BINPAGE.xml FILE_0035_OCR-D-BINPAGE.xml FILE_0004_OCR-D-BINPAGE.IMG-BIN.png FILE_0012_OCR-D-BINPAGE.IMG-BIN.png FILE_0020_OCR-D-BINPAGE.IMG-BIN.png FILE_0028_OCR-D-BINPAGE.IMG-BIN.png FILE_0036_OCR-D-BINPAGE.IMG-BIN.png FILE_0004_OCR-D-BINPAGE.xml FILE_0012_OCR-D-BINPAGE.xml FILE_0020_OCR-D-BINPAGE.xml FILE_0028_OCR-D-BINPAGE.xml FILE_0036_OCR-D-BINPAGE.xml FILE_0005_OCR-D-BINPAGE.IMG-BIN.png FILE_0013_OCR-D-BINPAGE.IMG-BIN.png FILE_0021_OCR-D-BINPAGE.IMG-BIN.png FILE_0029_OCR-D-BINPAGE.IMG-BIN.png FILE_0037_OCR-D-BINPAGE.IMG-BIN.png FILE_0005_OCR-D-BINPAGE.xml FILE_0013_OCR-D-BINPAGE.xml FILE_0021_OCR-D-BINPAGE.xml FILE_0029_OCR-D-BINPAGE.xml FILE_0037_OCR-D-BINPAGE.xml FILE_0006_OCR-D-BINPAGE.IMG-BIN.png FILE_0014_OCR-D-BINPAGE.IMG-BIN.png FILE_0022_OCR-D-BINPAGE.IMG-BIN.png FILE_0030_OCR-D-BINPAGE.IMG-BIN.png FILE_0038_OCR-D-BINPAGE.IMG-BIN.png FILE_0006_OCR-D-BINPAGE.xml FILE_0014_OCR-D-BINPAGE.xml FILE_0022_OCR-D-BINPAGE.xml FILE_0030_OCR-D-BINPAGE.xml FILE_0038_OCR-D-BINPAGE.xml FILE_0007_OCR-D-BINPAGE.IMG-BIN.png FILE_0015_OCR-D-BINPAGE.IMG-BIN.png FILE_0023_OCR-D-BINPAGE.IMG-BIN.png FILE_0031_OCR-D-BINPAGE.IMG-BIN.png FILE_0039_OCR-D-BINPAGE.IMG-BIN.png FILE_0007_OCR-D-BINPAGE.xml FILE_0015_OCR-D-BINPAGE.xml FILE_0023_OCR-D-BINPAGE.xml FILE_0031_OCR-D-BINPAGE.xml FILE_0039_OCR-D-BINPAGE.xml FILE_0008_OCR-D-BINPAGE.IMG-BIN.png FILE_0016_OCR-D-BINPAGE.IMG-BIN.png FILE_0024_OCR-D-BINPAGE.IMG-BIN.png FILE_0032_OCR-D-BINPAGE.IMG-BIN.png FILE_0008_OCR-D-BINPAGE.xml FILE_0016_OCR-D-BINPAGE.xml FILE_0024_OCR-D-BINPAGE.xml FILE_0032_OCR-D-BINPAGE.xml ```

The ocrd workspace correctly lists the existing file groups

mm@MM-Notebook:/vd18_data/PPN689276648_39pages$ ocrd workspace list-group
PRESENTATION
MIN
MAX
DEFAULT
THUMBS
OCR-D-BINPAGE
OCR-D-SEG-PAGE-ANYOCR
OCR-D-DENOISE-OCROPY
OCR-D-DESKEW-OCROPY
OCR-D-SEG-BLOCK-TESSERACT
OCR-D-SEGMENT-REPAIR
OCR-D-CLIP
OCR-D-SEGMENT-OCROPY
OCR-D-DEWARP
OCR-D-OCR

I have tried to do the reverse - exclude every group I do not want - but still the same error output.

ocrd zip bag -d /vd18_data/PPN689276648_39pages -m /vd18_data/PPN689276648_39pages/mets.xml -i PPN689276648 -Q MIN -Q MAX -Q PRESENTATION -Q THUMBS -Q OCR-D-BINPAGE -Q OCR-D-DENOISE-OCROPY -Q OCR-D-DEWARP -Q OCR-D-SEGMENT-OCROPY -Q OCR-D-SEG-PAGE-ANYOCR -Q OCR-D-CLIP -Q OCR-D-DESKEW-OCROPY -Q OCR-D-SEG-BLOCK-TESSERACT -Q OCR-D-SEGMENT-REPAIR -j 8

The more interesting part is that if I exclude just the file groups not already existing on the local file system yet (i.e., MIN, MAX, THUMBS or PRESENTATION) that works just fine and the created zip bag is correct.

To reproduce - PPN689276648_39pages.zip

I will investigate and report more if I can detect where it goes wrong in the code.

bertsky commented 1 month ago

Could this be related to https://github.com/OCR-D/core/issues/1149 (as internally, the bagger also just uses Resolver.download_to_directory as does clone/workspace_from_url)?

MehmedGIT commented 1 month ago

Not sure yet.

joschrew commented 1 month ago

I guess the problem is the mets file. When you exclude filegroups the corresponding files are still present in the mets and thus you get an error when trying to iterate the mets, which is done in the code. I think when excluding, the mets should be regenerated from everything which is to be included. This seems not to be done. So as a kind of "workaround" the unwanted file groups could be deleted before bagging, instead of excluding them when bagging. This might be not a good workaround though because you cannot simply bag parts of a workspace and simply keep the rest.

MehmedGIT commented 1 month ago

So as a kind of "workaround" the unwanted file groups could be deleted before bagging, instead of excluding them when bagging. This might be not a good workaround though because you cannot simply bag parts of a workspace and simply keep the rest.

That may be the only solution actually. Creating a zip bag with a mets file that contains local references to non-existing files in the zip itself (as a result of the exclusion) could cause more problems when the zip is extracted back.