Open MehmedGIT opened 5 months ago
Could this be related to https://github.com/OCR-D/core/issues/1149 (as internally, the bagger also just uses Resolver.download_to_directory as does clone/workspace_from_url)?
Not sure yet.
I guess the problem is the mets file. When you exclude filegroups the corresponding files are still present in the mets and thus you get an error when trying to iterate the mets, which is done in the code. I think when excluding, the mets should be regenerated from everything which is to be included. This seems not to be done. So as a kind of "workaround" the unwanted file groups could be deleted before bagging, instead of excluding them when bagging. This might be not a good workaround though because you cannot simply bag parts of a workspace and simply keep the rest.
So as a kind of "workaround" the unwanted file groups could be deleted before bagging, instead of excluding them when bagging. This might be not a good workaround though because you cannot simply bag parts of a workspace and simply keep the rest.
That may be the only solution actually. Creating a zip bag with a mets file that contains local references to non-existing files in the zip itself (as a result of the exclusion) could cause more problems when the zip is extracted back.
I have tried to include only the
DEFAULT
and theOCR-D-OCR
file groups in the zip bag. The error triggered says that theOCR-D-BINPAGE/FILE_0001_OCR-D-BINPAGE.xml
file does not exist.There are potentially 2 bugs: 1) The file itself exists but is not found 2) The check is performed although it should not - since that file group was excluded.
ocrd zip bag -d /vd18_data/PPN689276648_39pages -m /vd18_data/PPN689276648_39pages/mets.xml -i PPN689276648 -q DEFAULT -q OCR-D-OCR -j 8
```c# mm@MM-Notebook:/vd18_data$ ocrd zip bag -d /vd18_data/PPN689276648_39pages -m /vd18_data/PPN689276648_39pages/mets.xml -i PPN689276648 -q DEFAULT -q OCR-D-OCR -j 8 13:36:01.006 INFO ocrd.workspace_bagger - Bagging /vd18_data/PPN689276648_39pages to /vd18_data/PPN689276648_39pages.ocrd.zip (temp dir /tmp/ocrd-bagit-za5mu642) 13:36:01.007 INFO ocrd.workspace_bagger - Bagging OcrdFileContent of the directory:
```c# mm@MM-Notebook:/vd18_data$ ls -la ./PPN689276648_39pages/ total 1144 drwxrwxr-x 13 mm mm 4096 Mai 16 15:43 . drwxr-xr-x 18 mm mm 4096 Mai 21 13:15 .. drwxrwxr-x 2 mm mm 4096 Mai 16 12:13 DEFAULT -rw-rw-r-- 1 mm mm 1002007 Mai 21 13:33 mets.xml drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-BINPAGE drwxrwxr-x 2 mm mm 12288 Mai 16 15:42 OCR-D-CLIP drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-DENOISE-OCROPY drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-DESKEW-OCROPY drwxrwxr-x 2 mm mm 106496 Mai 16 15:43 OCR-D-DEWARP -rw-rw-r-- 1 mm mm 555 Mai 16 15:43 ocrd.log drwxrwxr-x 2 mm mm 4096 Mai 16 15:43 OCR-D-OCR drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-SEG-BLOCK-TESSERACT drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-SEGMENT-OCROPY drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-SEGMENT-REPAIR drwxrwxr-x 2 mm mm 4096 Mai 16 15:42 OCR-D-SEG-PAGE-ANYOCR mm@MM-Notebook:/vd18_data$ ls ./PPN689276648_39pages/OCR-D-BINPAGE/ FILE_0001_OCR-D-BINPAGE.IMG-BIN.png FILE_0009_OCR-D-BINPAGE.IMG-BIN.png FILE_0017_OCR-D-BINPAGE.IMG-BIN.png FILE_0025_OCR-D-BINPAGE.IMG-BIN.png FILE_0033_OCR-D-BINPAGE.IMG-BIN.png FILE_0001_OCR-D-BINPAGE.xml FILE_0009_OCR-D-BINPAGE.xml FILE_0017_OCR-D-BINPAGE.xml FILE_0025_OCR-D-BINPAGE.xml FILE_0033_OCR-D-BINPAGE.xml FILE_0002_OCR-D-BINPAGE.IMG-BIN.png FILE_0010_OCR-D-BINPAGE.IMG-BIN.png FILE_0018_OCR-D-BINPAGE.IMG-BIN.png FILE_0026_OCR-D-BINPAGE.IMG-BIN.png FILE_0034_OCR-D-BINPAGE.IMG-BIN.png FILE_0002_OCR-D-BINPAGE.xml FILE_0010_OCR-D-BINPAGE.xml FILE_0018_OCR-D-BINPAGE.xml FILE_0026_OCR-D-BINPAGE.xml FILE_0034_OCR-D-BINPAGE.xml FILE_0003_OCR-D-BINPAGE.IMG-BIN.png FILE_0011_OCR-D-BINPAGE.IMG-BIN.png FILE_0019_OCR-D-BINPAGE.IMG-BIN.png FILE_0027_OCR-D-BINPAGE.IMG-BIN.png FILE_0035_OCR-D-BINPAGE.IMG-BIN.png FILE_0003_OCR-D-BINPAGE.xml FILE_0011_OCR-D-BINPAGE.xml FILE_0019_OCR-D-BINPAGE.xml FILE_0027_OCR-D-BINPAGE.xml FILE_0035_OCR-D-BINPAGE.xml FILE_0004_OCR-D-BINPAGE.IMG-BIN.png FILE_0012_OCR-D-BINPAGE.IMG-BIN.png FILE_0020_OCR-D-BINPAGE.IMG-BIN.png FILE_0028_OCR-D-BINPAGE.IMG-BIN.png FILE_0036_OCR-D-BINPAGE.IMG-BIN.png FILE_0004_OCR-D-BINPAGE.xml FILE_0012_OCR-D-BINPAGE.xml FILE_0020_OCR-D-BINPAGE.xml FILE_0028_OCR-D-BINPAGE.xml FILE_0036_OCR-D-BINPAGE.xml FILE_0005_OCR-D-BINPAGE.IMG-BIN.png FILE_0013_OCR-D-BINPAGE.IMG-BIN.png FILE_0021_OCR-D-BINPAGE.IMG-BIN.png FILE_0029_OCR-D-BINPAGE.IMG-BIN.png FILE_0037_OCR-D-BINPAGE.IMG-BIN.png FILE_0005_OCR-D-BINPAGE.xml FILE_0013_OCR-D-BINPAGE.xml FILE_0021_OCR-D-BINPAGE.xml FILE_0029_OCR-D-BINPAGE.xml FILE_0037_OCR-D-BINPAGE.xml FILE_0006_OCR-D-BINPAGE.IMG-BIN.png FILE_0014_OCR-D-BINPAGE.IMG-BIN.png FILE_0022_OCR-D-BINPAGE.IMG-BIN.png FILE_0030_OCR-D-BINPAGE.IMG-BIN.png FILE_0038_OCR-D-BINPAGE.IMG-BIN.png FILE_0006_OCR-D-BINPAGE.xml FILE_0014_OCR-D-BINPAGE.xml FILE_0022_OCR-D-BINPAGE.xml FILE_0030_OCR-D-BINPAGE.xml FILE_0038_OCR-D-BINPAGE.xml FILE_0007_OCR-D-BINPAGE.IMG-BIN.png FILE_0015_OCR-D-BINPAGE.IMG-BIN.png FILE_0023_OCR-D-BINPAGE.IMG-BIN.png FILE_0031_OCR-D-BINPAGE.IMG-BIN.png FILE_0039_OCR-D-BINPAGE.IMG-BIN.png FILE_0007_OCR-D-BINPAGE.xml FILE_0015_OCR-D-BINPAGE.xml FILE_0023_OCR-D-BINPAGE.xml FILE_0031_OCR-D-BINPAGE.xml FILE_0039_OCR-D-BINPAGE.xml FILE_0008_OCR-D-BINPAGE.IMG-BIN.png FILE_0016_OCR-D-BINPAGE.IMG-BIN.png FILE_0024_OCR-D-BINPAGE.IMG-BIN.png FILE_0032_OCR-D-BINPAGE.IMG-BIN.png FILE_0008_OCR-D-BINPAGE.xml FILE_0016_OCR-D-BINPAGE.xml FILE_0024_OCR-D-BINPAGE.xml FILE_0032_OCR-D-BINPAGE.xml ```The ocrd workspace correctly lists the existing file groups
I have tried to do the reverse - exclude every group I do not want - but still the same error output.
The more interesting part is that if I exclude just the file groups not already existing on the local file system yet (i.e.,
MIN
,MAX
,THUMBS
orPRESENTATION
) that works just fine and the created zip bag is correct.To reproduce - PPN689276648_39pages.zip
I will investigate and report more if I can detect where it goes wrong in the code.