OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

RFC: Make workspace cloning more robust #429

Open stweil opened 4 years ago

stweil commented 4 years ago

Currently cloning of a workspace with ocrd workspace clone --download aborts if some files cannot be downloaded.

It would help if instead of aborting the download all other files would be finished.

Example: http://gei-digital.gei.de/viewer/metsresolver?id=PPN1024726142. Obviously the TIFF images are only available locally but not for download over the Internet.

kba commented 4 years ago

It would help if instead of aborting the download all other files would be finished.

Yes and it would be in line with the recent change for mets.xml (skip instead of raise).

kba commented 4 years ago

http://gei-digital.gei.de/viewer/metsresolver?id=PPN1024726142

Do you have another example? I cannot reach that one.

stweil commented 4 years ago

Nor can I. That looks like a temporary failure of the GEI website. So either wait, or look for other Intranda libraries - they might all be similar. I could also provide a local copy, but that will not help much as long as that website is down.

stweil commented 4 years ago

Here is an extract with one of the entries which cause a fatal exception:

<mets:file ID="FILE_0028_PRESENTATION" MIMETYPE="image/tiff">
  <mets:FLocat LOCTYPE="URL" xlink:href="file:///opt/digiverso/viewer/tiff/PPN1024726142/00000028.tif"/>
</mets:file>

It usually does not make sense to try a download for a file: URL, so such URLs could also simply be copied as is even when download was requested.

stweil commented 4 years ago

GEI is online again. I tried several of their METS files, and they all include references to local files which of course cannot be cloned.

kba commented 4 years ago

OK, I can reproduce the problem, on my TODO list.