OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)
https://ocr-d.de/en/spec/
17 stars 5 forks source link

ocrd_tool: allow object for path_in_archive of resources #235

Open kba opened 1 year ago

kba commented 1 year ago

During debugging bertsky/ocrd_detectron2#14 I realized that my assumption that every archive would only contain a single resource was wrong. The detectron2 models consist of a pytorch NN and a YAML description. This requires redundancy in the description and requires downloading the same archive twice.

With this change (and corresponding implementation in core), it would be possible to simplify

- description: DocBank via LayoutLM X101-FPN config
  name: DocBank_X101.yaml
  type: archive
  path_in_archive: X101/X101.yaml
  size: 526
  url: https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip
- description: DocBank via LayoutLM X101-FPN config
  name: DocBank_X101.pth
  type: archive
  path_in_archive: X101/model.pth
  size: 835606605
  url: https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip

to

- description: DocBank via LayoutLM X101-FPN config
  name: DocBank_X101.pth
  type: archive
  path_in_archive:
    DocBank_X101.pth: X101/model.pth
    DocBank_X101.yaml: X101/X101.yaml
  size: 783884362
  url: https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip

Also, this way the progressbar would be working again because the size attribute would always refer to the archive, not the file/folder in the archive.