Closed jhpoelen closed 9 months ago
Note that previously #210 the content was retrieved.
@slint Could it be that the Zenodo API changed or somehow got unavailable after the major upgrade https://blog.zenodo.org/2023/10/19/2023-10-19-upgrade-issues/ or https://blog.zenodo.org/2023/10/13/2023-10-13-zenodo-rdm/ ?
Is querying by md5 hash still possible via query urls like:
https://zenodo.org/api/records/?q=_files.checksum:%22md5:d11ddcecf3d5cbc627439260bdbfda72%22&all_versions=true
?
currently, a 404 is generated
curl -I "https://zenodo.org/api/records/?q=_files.checksum:%22md5:d11ddcecf3d5cbc627439260bdbfda72%22&all_versions=true"
HTTP/1.1 404 NOT FOUND
server: nginx
date: Thu, 19 Oct 2023 14:22:17 GMT
content-type: application/json
I much enjoy Zenodo's effort to keep maintaining their useful resource . . . and still have humor . . .
from https://blog.zenodo.org/2023/10/19/2023-10-19-upgrade-issues/ -
[...] Final lesson learnt: Friday 13th[**] might not have been a good day for a major release after all. [...]
note that the result below suggests that the API is no longer available for some reason -
curl -I "https://zenodo.org/api/records/"\
| head -n5
yielded
HTTP/1.1 404 NOT FOUND
server: nginx
date: Thu, 19 Oct 2023 14:36:41 GMT
content-type: application/json
content-length: 148
@gsautter @mguidoti are you also experiencing Zenodo API issues?
Hi @jhpoelen, sorry for the troubles, the issue is because the filters/terms you're using haven't been translated correctly to their new equivalents on our side.
If you change _files.checksum
to files.entries.checksum
and ?all_versions
to allversions=true
the query works:
curl "https://zenodo.org/api/records?q=files.entries.checksum:%22md5:d11ddcecf3d5cbc627439260bdbfda72%22&allversions=1"
# {"hits": {"hits": [{"created": "2022-06-01T21:55:07.419980+00:00", "modified": "2023-08-25T16:22:19.284245+00:00", "id": 6604060, "conceptrecid": "3950589", "doi": "10.5281/zenodo.6604060", "conceptdoi": "10.5281/zenodo.3950589", ... }
This change shouldn't be required though on your side, since our aim is to have full compatibility with legacy terms/parameters, and we're currently in the process of identifying and adding any of these missing mappings.
EDIT: Forgot to mention that the trailing slash on /api/records/
is also not handled (/api/records
works though), and is also being ported over.
@slint thanks for your prompt reply and suggestions. I am trying to figure out how to implement some kind of fail-over mechanism to accommodate the slight api variations.
@slint In working towards supporting the current Zenodo API, I noticed that the api responses are slightly different, at least in the "files" section.
Previously, I would see
"files": [
{
"bucket": "cbb44724-b635-4c75-94e8-c7a824efbc72",
"checksum": "md5:eb5e8f37583644943b86d1d9ebd4ded5",
"key": "figure.png",
"links": {
"self": "https://zenodo.org/api/files/cbb44724-b635-4c75-94e8-c7a824efbc72/figure.png"
},
"size": 32594,
"type": "png"
},
{
"bucket": "cbb44724-b635-4c75-94e8-c7a824efbc72",
"checksum": "md5:75b362eb1058eff2dcf836cd4293c4ff",
"key": "figure.svg",
"links": {
"self": "https://zenodo.org/api/files/cbb44724-b635-4c75-94e8-c7a824efbc72/figure.svg"
},
"size": 33721,
"type": "svg"
}
],
but now, I see
"files": [
{
"id": "83e146e0-cdd4-4001-a82f-a5cf181731e0",
"filename": "figure.png",
"filesize": 32594,
"checksum": "eb5e8f37583644943b86d1d9ebd4ded5",
"links": {
"self": "https://zenodo.org/api/records/4589980/files/figure.png",
"download": "https://zenodo.org/api/records/4589980/files/figure.png/content"
}
},
{
"id": "09e4eea2-9a4c-4a0e-8c53-265c5a554154",
"filename": "figure.svg",
"filesize": 33721,
"checksum": "75b362eb1058eff2dcf836cd4293c4ff",
"links": {
"self": "https://zenodo.org/api/records/4589980/files/figure.svg",
"download": "https://zenodo.org/api/records/4589980/files/figure.svg/content"
}
}
]
note how the checksum is no longer prefixed with their algorithm (e.g., md5). I prefer the explicit prefix, and hope you'll find a way to reintroduce this somehow.
Also note how the "self" link now points to meta-data related to the file:
curl "https://zenodo.org/api/records/4589980/files/figure.svg" | jq .
yields -
{
"key": "figure.svg",
"storage_class": "L",
"checksum": "md5:75b362eb1058eff2dcf836cd4293c4ff",
"size": 33721,
"created": "2021-03-09T08:05:08.814424+00:00",
"updated": "2021-03-09T08:08:22.072185+00:00",
"status": "completed",
"metadata": null,
"mimetype": "image/svg+xml",
"version_id": "1d87cfae-4655-48a6-921f-ec86526e0e56",
"file_id": "09e4eea2-9a4c-4a0e-8c53-265c5a554154",
"bucket_id": "cbb44724-b635-4c75-94e8-c7a824efbc72",
"links": {
"self": "https://zenodo.org/api/records/4589980/files/figure.svg",
"content": "https://zenodo.org/api/records/4589980/files/figure.svg/content"
}
}
whereas the https://zenodo.org/api/records/4589980/files/figure.svg/content
retrieves the associated content.
Is this intended?
By implementing support for the new post 2023-10-13 Zenodo API, Preston is now able to produce the expected results again -
preston cat --remote https://zenodo.org hash://md5/eb5e8f37583644943b86d1d9ebd4ded5
yields
with
preston cat --remote https://zenodo.org hash://md5/eb5e8f37583644943b86d1d9ebd4ded5 | md5sum
eb5e8f37583644943b86d1d9ebd4ded5 -
@cboettig @mielliott @mbjones @seltmann
I think this Zenodo API change is a great example of how content-addressing helps to stabilize expected changes in URLs, Web APIs and other location based ways to point to resources.
If folks refer to their figure by hash://md5/eb5e8f37583644943b86d1d9ebd4ded5 instead of https://zenodo.org/api/files/cbb44724-b635-4c75-94e8-c7a824efbc72/figure.png (Zenodo API now points to https://zenodo.org/api/records/4589980/files/figure.png/content), they wouldn't have to worry too much about all these expected changes in infrastructures. Instead, intermediary content libraries / resolvers / services can help to continue to get the referenced content. Also, the infrastructure folks would have to worry less about making sure that all the URLs they've ever issued are redirected to their associated content for years to come. I imagine that a ton of work is needed to keep these redirects working and to keep track of giant lists of historic URLs. And even with all that work, 404s are likely to keep happening.
Note that I was still able to get a copy of a scientific dataset, like a historical CO2 Record from the Vostok Ice Core, by asking for their content id: https://linker.bio/hash://md5/e27c99a7f701dab97b7d09c467acf468 via DataOne instead of Zenodo, showing how redundancy helps to increase reliability.
when using
we'd expect to retrieve content associated with -
https://zenodo.org/records/4589980/files/figure.png
as made available via:
Bartonička, T., Bielik, A., & Řehák, Z. (2008). Fig. 4 in Roost Switching and Activity Patterns in the Soprano Pipistrelle,Pipistrellus pygmaeus, during Lactation. In Annales Zoologici Fennici (Vol. 45, Number 6, pp. 503–512). Zenodo. https://doi.org/10.5281/zenodo.4589980
However, we got the error result below instead . . . and no content -