bio-guoda / preston

a biodiversity dataset tracker
MIT License
25 stars 1 forks source link

zenodo as remote fails to retrieve hash://md5/eb5e8f37583644943b86d1d9ebd4ded5 #266

Closed jhpoelen closed 9 months ago

jhpoelen commented 9 months ago

when using

preston cat --remote https://zenodo.org hash://md5/eb5e8f37583644943b86d1d9ebd4ded5 

we'd expect to retrieve content associated with -

https://zenodo.org/records/4589980/files/figure.png

as made available via:

Bartonička, T., Bielik, A., & Řehák, Z. (2008). Fig. 4 in Roost Switching and Activity Patterns in the Soprano Pipistrelle,Pipistrellus pygmaeus, during Lactation. In Annales Zoologici Fennici (Vol. 45, Number 6, pp. 503–512). Zenodo. https://doi.org/10.5281/zenodo.4589980

image

However, we got the error result below instead . . . and no content -

java.io.IOException: problem retrieving [hash://md5/eb5e8f37583644943b86d1d9ebd4ded5]
    at bio.guoda.preston.cmd.ContentQueryUtil.getContent(ContentQueryUtil.java:58)
    at bio.guoda.preston.cmd.ContentQueryUtil.copyContent(ContentQueryUtil.java:32)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:69)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:49)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:44)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at bio.guoda.preston.Preston.run(Preston.java:91)
    at bio.guoda.preston.Preston.main(Preston.java:82)
Caused by: bio.guoda.preston.store.DereferenceException: failed to dereference [hash://md5/eb5e8f37583644943b86d1d9ebd4ded5]
    at bio.guoda.preston.store.AliasDereferencer.dereferenceAliasedHash(AliasDereferencer.java:94)
    at bio.guoda.preston.store.AliasDereferencer.get(AliasDereferencer.java:46)
    at bio.guoda.preston.store.AliasDereferencer.get(AliasDereferencer.java:18)
    at bio.guoda.preston.cmd.ContentQueryUtil.getContent(ContentQueryUtil.java:50)
    ... 14 more
Caused by: bio.guoda.preston.store.DereferenceException: failed to dereference [hash://md5/eb5e8f37583644943b86d1d9ebd4ded5]
    at bio.guoda.preston.store.ContentHashDereferencer.get(ContentHashDereferencer.java:25)
    at bio.guoda.preston.store.ContentHashDereferencer.get(ContentHashDereferencer.java:10)
    at bio.guoda.preston.store.AliasDereferencer.dereferenceAliasedHash(AliasDereferencer.java:92)
    ... 17 more
Caused by: java.lang.IllegalArgumentException: argument "in" is null
    at com.fasterxml.jackson.databind.ObjectMapper._assertNotNull(ObjectMapper.java:4737)
    at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2954)
    at bio.guoda.preston.store.KeyTo1LevelZenodoPath.findFirstHit(KeyTo1LevelZenodoPath.java:46)
    at bio.guoda.preston.store.KeyTo1LevelZenodoPath.toPath(KeyTo1LevelZenodoPath.java:34)
    at bio.guoda.preston.store.KeyTo1LevelZenodoBucket.toPath(KeyTo1LevelZenodoBucket.java:24)
    at bio.guoda.preston.store.KeyValueStoreWithDereferencing.get(KeyValueStoreWithDereferencing.java:26)
    at bio.guoda.preston.store.KeyValueStoreWithDereferencing.get(KeyValueStoreWithDereferencing.java:11)
    at bio.guoda.preston.store.KeyValueStoreStickyFailover.get(KeyValueStoreStickyFailover.java:42)
    at bio.guoda.preston.store.KeyValueStoreStickyFailover.get(KeyValueStoreStickyFailover.java:13)
    at bio.guoda.preston.store.KeyValueStoreWithFallback.get(KeyValueStoreWithFallback.java:31)
    at bio.guoda.preston.store.KeyValueStoreWithFallback.get(KeyValueStoreWithFallback.java:8)
    at bio.guoda.preston.store.BlobStoreAppendOnly.get(BlobStoreAppendOnly.java:44)
    at bio.guoda.preston.store.BlobStoreAppendOnly.get(BlobStoreAppendOnly.java:11)
    at bio.guoda.preston.store.ContentHashDereferencer.get(ContentHashDereferencer.java:22)
    ... 19 more
java.lang.RuntimeException: java.io.IOException: problem retrieving [hash://md5/eb5e8f37583644943b86d1d9ebd4ded5]
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:52)
    at bio.guoda.preston.cmd.CmdGet.run(CmdGet.java:44)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1939)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at bio.guoda.preston.Preston.run(Preston.java:91)
    at bio.guoda.preston.Preston.main(Preston.java:82)
jhpoelen commented 9 months ago

Note that previously #210 the content was retrieved.

@slint Could it be that the Zenodo API changed or somehow got unavailable after the major upgrade https://blog.zenodo.org/2023/10/19/2023-10-19-upgrade-issues/ or https://blog.zenodo.org/2023/10/13/2023-10-13-zenodo-rdm/ ?

Is querying by md5 hash still possible via query urls like:

https://zenodo.org/api/records/?q=_files.checksum:%22md5:d11ddcecf3d5cbc627439260bdbfda72%22&all_versions=true

?

currently, a 404 is generated

curl -I "https://zenodo.org/api/records/?q=_files.checksum:%22md5:d11ddcecf3d5cbc627439260bdbfda72%22&all_versions=true"
HTTP/1.1 404 NOT FOUND
server: nginx
date: Thu, 19 Oct 2023 14:22:17 GMT
content-type: application/json

image

jhpoelen commented 9 months ago

I much enjoy Zenodo's effort to keep maintaining their useful resource . . . and still have humor . . .

from https://blog.zenodo.org/2023/10/19/2023-10-19-upgrade-issues/ -

[...] Final lesson learnt: Friday 13th[**] might not have been a good day for a major release after all. [...]

jhpoelen commented 9 months ago

note that the result below suggests that the API is no longer available for some reason -

curl -I "https://zenodo.org/api/records/"\
 | head -n5

yielded

HTTP/1.1 404 NOT FOUND
server: nginx
date: Thu, 19 Oct 2023 14:36:41 GMT
content-type: application/json
content-length: 148
jhpoelen commented 9 months ago

@gsautter @mguidoti are you also experiencing Zenodo API issues?

slint commented 9 months ago

Hi @jhpoelen, sorry for the troubles, the issue is because the filters/terms you're using haven't been translated correctly to their new equivalents on our side.

If you change _files.checksum to files.entries.checksum and ?all_versions to allversions=true the query works:

curl "https://zenodo.org/api/records?q=files.entries.checksum:%22md5:d11ddcecf3d5cbc627439260bdbfda72%22&allversions=1"

# {"hits": {"hits": [{"created": "2022-06-01T21:55:07.419980+00:00", "modified": "2023-08-25T16:22:19.284245+00:00", "id": 6604060, "conceptrecid": "3950589", "doi": "10.5281/zenodo.6604060", "conceptdoi": "10.5281/zenodo.3950589", ... }

This change shouldn't be required though on your side, since our aim is to have full compatibility with legacy terms/parameters, and we're currently in the process of identifying and adding any of these missing mappings.

EDIT: Forgot to mention that the trailing slash on /api/records/ is also not handled (/api/records works though), and is also being ported over.

jhpoelen commented 9 months ago

@slint thanks for your prompt reply and suggestions. I am trying to figure out how to implement some kind of fail-over mechanism to accommodate the slight api variations.

jhpoelen commented 9 months ago

@slint In working towards supporting the current Zenodo API, I noticed that the api responses are slightly different, at least in the "files" section.

Previously, I would see

        "files": [
          {
            "bucket": "cbb44724-b635-4c75-94e8-c7a824efbc72",
            "checksum": "md5:eb5e8f37583644943b86d1d9ebd4ded5",
            "key": "figure.png",
            "links": {
              "self": "https://zenodo.org/api/files/cbb44724-b635-4c75-94e8-c7a824efbc72/figure.png"
            },
            "size": 32594,
            "type": "png"
          },
          {
            "bucket": "cbb44724-b635-4c75-94e8-c7a824efbc72",
            "checksum": "md5:75b362eb1058eff2dcf836cd4293c4ff",
            "key": "figure.svg",
            "links": {
              "self": "https://zenodo.org/api/files/cbb44724-b635-4c75-94e8-c7a824efbc72/figure.svg"
            },
            "size": 33721,
            "type": "svg"
          }
        ],

but now, I see

 "files": [
          {
            "id": "83e146e0-cdd4-4001-a82f-a5cf181731e0",
            "filename": "figure.png",
            "filesize": 32594,
            "checksum": "eb5e8f37583644943b86d1d9ebd4ded5",
            "links": {
              "self": "https://zenodo.org/api/records/4589980/files/figure.png",
              "download": "https://zenodo.org/api/records/4589980/files/figure.png/content"
            }
          },
          {
            "id": "09e4eea2-9a4c-4a0e-8c53-265c5a554154",
            "filename": "figure.svg",
            "filesize": 33721,
            "checksum": "75b362eb1058eff2dcf836cd4293c4ff",
            "links": {
              "self": "https://zenodo.org/api/records/4589980/files/figure.svg",
              "download": "https://zenodo.org/api/records/4589980/files/figure.svg/content"
            }
          }
        ]

note how the checksum is no longer prefixed with their algorithm (e.g., md5). I prefer the explicit prefix, and hope you'll find a way to reintroduce this somehow.

Also note how the "self" link now points to meta-data related to the file:

curl "https://zenodo.org/api/records/4589980/files/figure.svg" | jq .

yields -

{
  "key": "figure.svg",
  "storage_class": "L",
  "checksum": "md5:75b362eb1058eff2dcf836cd4293c4ff",
  "size": 33721,
  "created": "2021-03-09T08:05:08.814424+00:00",
  "updated": "2021-03-09T08:08:22.072185+00:00",
  "status": "completed",
  "metadata": null,
  "mimetype": "image/svg+xml",
  "version_id": "1d87cfae-4655-48a6-921f-ec86526e0e56",
  "file_id": "09e4eea2-9a4c-4a0e-8c53-265c5a554154",
  "bucket_id": "cbb44724-b635-4c75-94e8-c7a824efbc72",
  "links": {
    "self": "https://zenodo.org/api/records/4589980/files/figure.svg",
    "content": "https://zenodo.org/api/records/4589980/files/figure.svg/content"
  }
}

whereas the https://zenodo.org/api/records/4589980/files/figure.svg/content retrieves the associated content.

Is this intended?

jhpoelen commented 9 months ago

By implementing support for the new post 2023-10-13 Zenodo API, Preston is now able to produce the expected results again -

preston cat --remote https://zenodo.org hash://md5/eb5e8f37583644943b86d1d9ebd4ded5

yields

image

with

preston cat --remote https://zenodo.org hash://md5/eb5e8f37583644943b86d1d9ebd4ded5 | md5sum
eb5e8f37583644943b86d1d9ebd4ded5  -
jhpoelen commented 9 months ago

@cboettig @mielliott @mbjones @seltmann

I think this Zenodo API change is a great example of how content-addressing helps to stabilize expected changes in URLs, Web APIs and other location based ways to point to resources.

If folks refer to their figure by hash://md5/eb5e8f37583644943b86d1d9ebd4ded5 instead of https://zenodo.org/api/files/cbb44724-b635-4c75-94e8-c7a824efbc72/figure.png (Zenodo API now points to https://zenodo.org/api/records/4589980/files/figure.png/content), they wouldn't have to worry too much about all these expected changes in infrastructures. Instead, intermediary content libraries / resolvers / services can help to continue to get the referenced content. Also, the infrastructure folks would have to worry less about making sure that all the URLs they've ever issued are redirected to their associated content for years to come. I imagine that a ton of work is needed to keep these redirects working and to keep track of giant lists of historic URLs. And even with all that work, 404s are likely to keep happening.

Note that I was still able to get a copy of a scientific dataset, like a historical CO2 Record from the Vostok Ice Core, by asking for their content id: https://linker.bio/hash://md5/e27c99a7f701dab97b7d09c467acf468 via DataOne instead of Zenodo, showing how redundancy helps to increase reliability.