OCR-D / quiver-benchmarks

Benchmarking OCR-D workflows in Docker
MIT License
2 stars 1 forks source link

Download of GT fails partially #17

Open stweil opened 10 months ago

stweil commented 10 months ago

The script scripts/prepare.shdownloads GT for Reichsanzeiger, but fails to download the GT defined in data_srcs/default_data_sources.txt. Here is protocol of bash -x scripts/prepare.sh:

+ mkdir -p gt
+ echo 'Prepare OCR-D Ground Truth …'
Prepare OCR-D Ground Truth …
+ IFS=
+ read -r URL
++ echo https://github.com/tboenig/16_frak_simple
++ cut -d/ -f4
+ OWNER=tboenig
++ echo https://github.com/tboenig/16_frak_simple
++ cut -d/ -f5
+ REPO=16_frak_simple
+ [[ ! -f gt/16_frak_simple.zip ]]
+ echo 'Downloading 16_frak_simple …'
Downloading 16_frak_simple …
++ curl -L -H 'Accept: application/vnd.github+json' -H 'X-GitHub-Api-Version: 2022-11-28' https://api.github.com/repos/tboenig/16_frak_simple/releases/latest
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100  9963  100  9963    0     0  48600      0 --:--:-- --:--:-- --:--:-- 48600
+ RESULT='{
  "url": "https://api.github.com/repos/tboenig/16_frak_simple/releases/126253523",
  "assets_url": "https://api.github.com/repos/tboenig/16_frak_simple/releases/126253523/assets",
  "upload_url": "https://uploads.github.com/repos/tboenig/16_frak_simple/releases/126253523/assets{?name,label}",
  "html_url": "https://github.com/tboenig/16_frak_simple/releases/tag/v1.1.1",
  "id": 126253523,
  "author": {
    "login": "github-actions[bot]",
    "id": 41898282,
    "node_id": "MDM6Qm90NDE4OTgyODI=",
    "avatar_url": "https://avatars.githubusercontent.com/in/15368?v=4",
    "gravatar_id": "",
    "url": "https://api.github.com/users/github-actions%5Bbot%5D",
    "html_url": "https://github.com/apps/github-actions",
    "followers_url": "https://api.github.com/users/github-actions%5Bbot%5D/followers",
    "following_url": "https://api.github.com/users/github-actions%5Bbot%5D/following{/other_user}",
    "gists_url": "https://api.github.com/users/github-actions%5Bbot%5D/gists{/gist_id}",
    "starred_url": "https://api.github.com/users/github-actions%5Bbot%5D/starred{/owner}{/repo}",
    "subscriptions_url": "https://api.github.com/users/github-actions%5Bbot%5D/subscriptions",
    "organizations_url": "https://api.github.com/users/github-actions%5Bbot%5D/orgs",
    "repos_url": "https://api.github.com/users/github-actions%5Bbot%5D/repos",
    "events_url": "https://api.github.com/users/github-actions%5Bbot%5D/events{/privacy}",
    "received_events_url": "https://api.github.com/users/github-actions%5Bbot%5D/received_events",
    "type": "Bot",
    "site_admin": false
  },
  "node_id": "RE_kwDOIFGkSM4HhnnT",
  "tag_name": "v1.1.1",
  "target_commitish": "main",
  "name": "Release 81_v1.1.1",
  "draft": false,
  "prerelease": false,
  "created_at": "2023-10-23T14:29:21Z",
  "published_at": "2023-10-23T14:30:58Z",
  "assets": [
    {
      "url": "https://api.github.com/repos/tboenig/16_frak_simple/releases/assets/131958447",
      "id": 131958447,
      "node_id": "RA_kwDOIFGkSM4H3Yav",
      "name": "kistler_kraeuter_1500.ocrd.zip",
      "label": "",
      "uploader": {
        "login": "github-actions[bot]",
        "id": 41898282,
        "node_id": "MDM6Qm90NDE4OTgyODI=",
        "avatar_url": "https://avatars.githubusercontent.com/in/15368?v=4",
        "gravatar_id": "",
        "url": "https://api.github.com/users/github-actions%5Bbot%5D",
        "html_url": "https://github.com/apps/github-actions",
        "followers_url": "https://api.github.com/users/github-actions%5Bbot%5D/followers",
        "following_url": "https://api.github.com/users/github-actions%5Bbot%5D/following{/other_user}",
        "gists_url": "https://api.github.com/users/github-actions%5Bbot%5D/gists{/gist_id}",
        "starred_url": "https://api.github.com/users/github-actions%5Bbot%5D/starred{/owner}{/repo}",
        "subscriptions_url": "https://api.github.com/users/github-actions%5Bbot%5D/subscriptions",
        "organizations_url": "https://api.github.com/users/github-actions%5Bbot%5D/orgs",
        "repos_url": "https://api.github.com/users/github-actions%5Bbot%5D/repos",
        "events_url": "https://api.github.com/users/github-actions%5Bbot%5D/events{/privacy}",
        "received_events_url": "https://api.github.com/users/github-actions%5Bbot%5D/received_events",
        "type": "Bot",
        "site_admin": false
      },
      "content_type": "application/zip",
      "state": "uploaded",
      "size": 19379136,
      "download_count": 14,
      "created_at": "2023-10-23T14:31:01Z",
      "updated_at": "2023-10-23T14:31:02Z",
      "browser_download_url": "https://github.com/tboenig/16_frak_simple/releases/download/v1.1.1/kistler_kraeuter_1500.ocrd.zip"
    },
    {
      "url": "https://api.github.com/repos/tboenig/16_frak_simple/releases/assets/131958445",
      "id": 131958445,
[...]
++ jq -r '.assets | .[].browser_download_url'
+ ZIP_URL='https://github.com/tboenig/16_frak_simple/releases/download/v1.1.1/kistler_kraeuter_1500.ocrd.zip
https://github.com/tboenig/16_frak_simple/releases/download/v1.1.1/luther_auszlegunge_1520.ocrd.zip
https://github.com/tboenig/16_frak_simple/releases/download/v1.1.1/metadata-v81.zip
https://github.com/tboenig/16_frak_simple/releases/download/v1.1.1/trota_mordtbrenner_1540.ocrd.zip'
+ curl -L -o gt/16_frak_simple.zip 'https://github.com/tboenig/16_frak_simple/releases/download/v1.1.1/kistler_kraeuter_1500.ocrd.zip
https://github.com/tboenig/16_frak_simple/releases/download/v1.1.1/luther_auszlegunge_1520.ocrd.zip
https://github.com/tboenig/16_frak_simple/releases/download/v1.1.1/metadata-v81.zip
https://github.com/tboenig/16_frak_simple/releases/download/v1.1.1/trota_mordtbrenner_1540.ocrd.zip'
curl: (3) URL using bad/illegal format or missing URL
[...]

So instead of calling curl with single URLs, it is called with all URLs combined in a single argument. That fails of course.

stweil commented 10 months ago

@tboenig, it looks like a change in the released GT causes the breakage: older releases contained a single zip file like for example bagitDump-v79.zip while the latest release contains several zip files. The download script is not prepared to handle that correctly.