fatiando / pooch

A friend to fetch your data files
https://www.fatiando.org/pooch
Other
611 stars 74 forks source link

`KeyError: 'key'` in ZenodoRepository.download_url() after Zenodo migration #371

Closed khaeru closed 11 months ago

khaeru commented 11 months ago

Zenodo recently migrated to InvenioRDM, as described here (cf. #350). Since then, the service has had sporadic downtime; see a header message on https://zenodo.com ("Oct 15 08:30 UTC: We are continuing to work on resolving identified issues") and sporadic downtime here.

It appears that Zenodo's API responses have changed in a way that causes errors in pooch. Excerpting from the below output, I see:

{
  'files': [
    {
      'id': '96ec5297-801c-4fe8-b797-2804e88784c6',
      'filename': 'MESSAGEix-GLOBIOM_1.1_R11_no-policy_baseline.xlsx',
      'filesize': 135453950,
      'checksum': '222193405c25c3c29cc21cbae5e035f4',
      'links': {'self': 'https://zenodo.org/api/records/5793870/files/96ec5297-801c-4fe8-b797-2804e88784c6'}
    }
  ]
}

The single record in the "files" collection does not have a key 'key'.

Zenodo published What's changed? and What's new? pages describing the migration, but they don't indicate any API changes. Its API documentation doesn't indicate a new version. So I am not sure if:

  1. This is a well-advertised, expected, permanent change of Zenodo's API, to which pooch has not yet adapted, or perhaps more likely
  2. This is an unintended, erroneous change that may (sooner or later, if they are aware) be corrected by Zenodo. (More evidence for this case: the URL https://zenodo.org/api/records/5793870/files/96ec5297-801c-4fe8-b797-2804e88784c6 appearing in the record above gives a 404 error.)

Regardless of which is the case, a fix would be welcome! However I recognize in either case it is difficult to adapt to an API change which is either not documented or accidental.

Full code that generated the error

import pooch

args = dict(
    base_url="doi:10.5281/zenodo.5793870",
    registry={
        "MESSAGEix-GLOBIOM_1.1_R11_no-policy_baseline.xlsx": (
            "md5:222193405c25c3c29cc21cbae5e035f4"
        ),
    },
)

p = pooch.create(path=".", **args)

result = p.fetch(list(args["registry"].keys())[0])

print(result)

As well, to help diagnose/debug, I have edited pooch.downloads.ZenodoRepository.download_url(), inserting the line:

print(f"{self.api_response = }")

Full error message

Downloading file 'MESSAGEix-GLOBIOM_1.1_R11_no-policy_baseline.xlsx' from 'doi:10.5281/zenodo.5793870/MESSAGEix-GLOBIOM_1.1_R11_no-policy_baseline.xlsx' to '/home/khaeru/vc/iiasa/models'.                                                    
self.api_response = {'created': '2023-05-23T21:20:08.602906+00:00', 'modified': '2023-06-15T09:49:10.161370+00:00', 'id': 5793870, 'conceptrecid': '5793869', 'doi': '10.5281/zenodo.5793870', 'conceptdoi': '10.5281/zenodo.5793869', 'doi_url': 'https://doi.org/10.5281/zenodo.5793870', 'metadata': {'title': 'MESSAGEix-GLOBIOM R11 no-policy baseline', 'doi': '10.5281/zenodo.5793870', 'publication_date': '2023-05-23', 'description': '<p>This dataset contains the parameterization of a no-policy baseline scenario of the global 11-regional <a href="https://docs.messageix.org/projects/global/en/">MESSAGEix-GLOBIOM</a> integrated assessment model. <a href="https://docs.messageix.org/projects/models/en/latest/pkg-data/node.html#region-aggregation-r11">Regions</a>, <a href="https://docs.messageix.org/projects/models/en/latest/pkg-data/year.html">time periods</a>, <a href="https://docs.messageix.org/projects/models/en/latest/pkg-data/codelists.html#commodities-commodity-yaml">commodities</a>, <a href="https://docs.messageix.org/projects/models/en/latest/pkg-data/codelists.html#commodities-commodity-yaml">technologies</a> and <a href="https://docs.messageix.org/projects/models/en/latest/pkg-data/relation.html">relations</a> included in this model are described in a separate <a href="https://docs.messageix.org/projects/models/">repository</a>. The dataset relies on the <a href="https://docs.messageix.org/en/stable/">MESSAGEix modeling framework</a> (<a href="https://doi.org/10.1016/j.envsoft.2018.11.012">Huppmann et al. 2019</a>) and can be imported into MESSAGEix via the <a href="https://docs.messageix.org/en/stable/api.html?highlight=read_xls#message_ix.Scenario.read_excel">read_excel()</a> functionality for which a <a href="https://github.com/iiasa/message_ix/blob/main/tutorial/westeros/westeros_baseline_using_xlsx_import_part1.ipynb">tutorial</a> is available. After the import the scenario can be solved and modified to create new scenarios. Note that the published scenario as included in the <a href="https://zenodo.org/record/5553976">ENGAGE global scenarios dataset</a> has been run with a release candidate of <a href="https://docs.messageix.org/en/stable/whatsnew.html#v3-4-0-2022-01-27">version 3.4.0</a> of MESSAGEix.</p>', 'access_right': 'open', 'creators': [{'name': 'Fricko, Oliver', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0002-6835-9883'}, {'name': 'Frank, Stefan', 'affiliati
on': 'International Institute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0001-5702-8547'}, {'name': 'Gidden, Matthew', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0003-0687-414X'}, {'name': 'Huppmann, Daniel', 'af
filiation': 'International Institute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0002-7729-7389'}, {'name': 'Johnson, Nils A.', 'affiliation': 'Electric Power Research Institute (EPRI)'}, {'name': 'Kishimoto, Paul Natsuo', 'affiliation': 'International Institute f
or Applied Systems Analysis (IIASA)', 'orcid': '0000-0002-8578-753X'}, {'name': 'Kolp, Peter', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0003-0122-2839'}, {'name': 'Lovat, Francesco', 'affiliation': 'Danish Energy Agency',
 'orcid': '0000-0002-4331-980X'}, {'name': 'McCollum, David L.', 'affiliation': 'Oak Ridge National Labortory (ORNL) and International Institute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0003-1293-0179'}, {'name': 'Min, Jihoon', 'affiliation': 'International Ins
titute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0002-0020-1174'}, {'name': 'Rao, Shilpa', 'affiliation': 'Norwegian Institute of Public Health', 'orcid': '0000-0003-4012-9063'}, {'name': 'Riahi, Keywan', 'affiliation': 'International Institute for Applied Syste
ms Analysis (IIASA)', 'orcid': '0000-0001-7193-3498'}, {'name': 'Rogner, Holger', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0002-1045-9830'}, {'name': 'van Ruijven, Bas', 'affiliation': 'International Institute for Applied
 Systems Analysis (IIASA)', 'orcid': '0000-0003-1232-5892'}, {'name': 'Vinca, Adriano', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0002-3051-178X'}, {'name': 'Zakeri, Behnam', 'affiliation': 'International Institute for App
lied Systems Analysis (IIASA)', 'orcid': '0000-0001-9647-2878'}, {'name': 'Augustynczik, Andrey Lessa Derci', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)'}, {'name': 'Deppermann, Andre', 'affiliation': 'International Institute for Applied Sy
stems Analysis (IIASA)', 'orcid': '0000-0002-7943-4842'}, {'name': 'Ermolieva, Tatiana', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)'}, {'name': 'Gusti, Mykola', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)', 
'orcid': '0000-0002-2576-9217'}, {'name': 'Lauri, Pekka', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0002-5472-2039'}, {'name': 'Heyes, Chris', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)', 
'orcid': '0000-0001-5254-493X'}, {'name': 'Schoepp, Wolfgang', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0001-5990-423X'}, {'name': 'Klimont, Zbigniew', 'affiliation': 'International Institute for Applied Systems Analysis 
(IIASA)', 'orcid': '0000-0003-2630-198X'}, {'name': 'Havlik, Petr', 'affiliation': 'International Institute for Applied Systems Analysis (IIASA)', 'orcid': '0000-0001-5551-5085'}, {'name': 'Krey, Volker', 'affiliation': 'International Institute for Applied Systems Analysis 
(IIASA)', 'orcid': '0000-0003-0307-3515'}], 'keywords': ['integrated assessment model', 'scenario', 'no-policy baseline'], 'related_identifiers': [{'identifier': '10.1016/j.envsoft.2018.11.012', 'relation': 'cites', 'resource_type': 'publication-article', 'scheme': 'doi'}, 
{'identifier': '10.22022/iacc/03-2021.17115', 'relation': 'cites', 'resource_type': 'publication-report', 'scheme': 'doi'}, {'identifier': '10.1038/s41558-021-01215-2', 'relation': 'isSupplementTo', 'resource_type': 'publication-article', 'scheme': 'doi'}, {'identifier': '1
0.1038/s41558-021-01218-z', 'relation': 'isSupplementTo', 'resource_type': 'publication-article', 'scheme': 'doi'}, {'identifier': '10.1088/1748-9326/ac09ae', 'relation': 'isSupplementTo', 'resource_type': 'publication-article', 'scheme': 'doi'}, {'identifier': '10.1038/s41
893-021-00772-w', 'relation': 'isSupplementTo', 'resource_type': 'publication-article', 'scheme': 'doi'}, {'identifier': '10.5281/zenodo.5553976', 'relation': 'isSupplementTo', 'resource_type': 'dataset', 'scheme': 'doi'}], 'version': '1.1', 'language': 'eng', 'grants': [{'
id': '10.13039/501100000780::821471'}], 'license': 'cc-by-sa-4.0', 'imprint_publisher': 'Zenodo', 'communities': [{'identifier': 'engage-climate'}, {'identifier': 'iiasa'}, {'identifier': 'iiasa-ece'}, {'identifier': 'message-ix'}], 'upload_type': 'dataset', 'prereserve_doi
': {'doi': '10.5281/zenodo.5793870', 'recid': 5793870}}, 'title': 'MESSAGEix-GLOBIOM R11 no-policy baseline', 'links': {'self': 'https://zenodo.org/api/records/5793870', 'self_html': 'https://zenodo.org/records/5793870', 'self_doi': 'https://zenodo.org/doi/10.5281/zenodo.57
93870', 'doi': 'https://doi.org/10.5281/zenodo.5793870', 'parent': 'https://zenodo.org/api/records/5793869', 'parent_html': 'https://zenodo.org/records/5793869', 'parent_doi': 'https://zenodo.org/doi/10.5281/zenodo.5793869', 'self_iiif_manifest': 'https://zenodo.org/api/iii
f/record:5793870/manifest', 'self_iiif_sequence': 'https://zenodo.org/api/iiif/record:5793870/sequence/default', 'files': 'https://zenodo.org/api/records/5793870/files', 'media_files': 'https://zenodo.org/api/records/5793870/media-files', 'archive': 'https://zenodo.org/api/
records/5793870/files-archive', 'archive_media': 'https://zenodo.org/api/records/5793870/media-files-archive', 'latest': 'https://zenodo.org/api/records/5793870/versions/latest', 'latest_html': 'https://zenodo.org/records/5793870/latest', 'draft': 'https://zenodo.org/api/re
cords/5793870/draft', 'versions': 'https://zenodo.org/api/records/5793870/versions', 'access_links': 'https://zenodo.org/api/records/5793870/access/links', 'access_users': 'https://zenodo.org/api/records/5793870/access/users', 'access_request': 'https://zenodo.org/api/recor
ds/5793870/access/request', 'access': 'https://zenodo.org/api/records/5793870/access', 'reserve_doi': 'https://zenodo.org/api/records/5793870/draft/pids/doi', 'communities': 'https://zenodo.org/api/records/5793870/communities', 'communities-suggestions': 'https://zenodo.org
/api/records/5793870/communities-suggestions', 'requests': 'https://zenodo.org/api/records/5793870/requests'}, 'record_id': 5793870, 'owner': 233639, 'files': [{'id': '96ec5297-801c-4fe8-b797-2804e88784c6', 'filename': 'MESSAGEix-GLOBIOM_1.1_R11_no-policy_baseline.xlsx', 'f
ilesize': 135453950, 'checksum': '222193405c25c3c29cc21cbae5e035f4', 'links': {'self': 'https://zenodo.org/api/records/5793870/files/96ec5297-801c-4fe8-b797-2804e88784c6'}}], 'state': 'done', 'submitted': True}                              
Traceback (most recent call last):                         
  File "/home/khaeru/vc/iiasa/models/bug.py", line 14, in <module>                                                                                                                                                                             
    result = p.fetch(list(args["registry"].keys())[0])                                                                 
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                 
  File "/home/khaeru/.venv/3.11/lib/python3.11/site-packages/pooch/core.py", line 588, in fetch                                                                                                                                                
    stream_download(                                                                                                   
  File "/home/khaeru/.venv/3.11/lib/python3.11/site-packages/pooch/core.py", line 803, in stream_download                                                                                                                                                                         
    downloader(url, tmp, pooch)                                                                                                                                                                                                                
  File "/home/khaeru/.venv/3.11/lib/python3.11/site-packages/pooch/downloaders.py", line 605, in __call__                                                                                                                                                                         
    download_url = data_repository.download_url(file_name)                                                                               
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                               
  File "/home/khaeru/.venv/3.11/lib/python3.11/site-packages/pooch/downloaders.py", line 805, in download_url                                                                                                                                                                     
    files = {item["key"]: item for item in self.api_response["files"]}                                                                   
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                   
  File "/home/khaeru/.venv/3.11/lib/python3.11/site-packages/pooch/downloaders.py", line 805, in <dictcomp>                                                                                                                                                                       
    files = {item["key"]: item for item in self.api_response["files"]}                                                                   
             ~~~~^^^^^^^                                            
KeyError: 'key'                                                     

System information

paddyroddy commented 11 months ago

I have the same problem, and couldn't work it out. Thanks for identifying!

santisoler commented 11 months ago

Thanks @khaeru for opening this issue. I can reproduce the error with the script you shared, and I've also checked that our tests are failing because of the change in Zenodo API.

I'll try to work on a quick bugfix and make a new release. At the moment I see that https://developers.zenodo.org is down, so I won't be able to get further information about the new API and will have to rely on the JSON structure and try to cover most cases.

I'll probably ping you and @paddyroddy so you can test the bugfix against your use cases.

Thanks again for the detailed report! It's super helpful!

PS: I'm leaving the url for the API response for the repository in the example, just to have a quick way to access it when memory isn't helping: https://zenodo.org/api/records/5793870

paddyroddy commented 11 months ago

Thanks, @santisoler, for reference mine is just from here https://github.com/astro-informatics/sleplet/blob/12efdcd8d1b65900de7cea736c20d60c224aa9f4/src/sleplet/_data/setup_pooch.py#L11-L17

import pooch

_ZENODO_DATA_DOI = "10.5281/zenodo.7767698"
_POOCH = pooch.create(
    path=pooch.os_cache("sleplet"),
    base_url=f"doi:{_ZENODO_DATA_DOI}/",
    registry=None,
)
_POOCH.load_registry_from_doi()
santisoler commented 11 months ago

Pooch v1.8.0 has been released, including the bugfix we merged in #375 to solve this issue.

The new release is already available in PyPI: https://pypi.org/project/pooch/

Availability through conda-forge might take a few hours.

Thanks @khaeru again for opening this issue and all of you that reported back.

paddyroddy commented 11 months ago

Thank you for the quick fix!

santisoler commented 10 months ago

Just to keep everyone in the loop: I received another reply from Zenodo. They restored the old behaviour of the API, so our downloader is using the "legacy" version of the API. For the repository that @khaeru shared in the example, now we have the following list of files:

"files": [
    {
      "id": "878b8528-7706-436e-9536-b2a1a838ce14",
      "key": "santisoler/pooch-test-data-v1.zip",
      "size": 893,
      "checksum": "md5:6cdda261f5646a4089966fd0bf505233",
      "links": {"self": "https://zenodo.org/api/records/7632643/files/santisoler/pooch-test-data-v1.zip/content"}
    }
],

Note that:

Since the support for this API was kept, our downloader is still working just fine!

paddyroddy commented 10 months ago

Unrelated, but I've noticed that the API seems to be a lot more rate limiting that it used to be, i.e. my CI is breaking. Have you noticed the same?

santisoler commented 10 months ago

I've rerun Pooch's tests both locally and through GitHub Actions and I haven't noticed that.

If this persists, I would recommend you to get in touch with Zenodo, they are super responsive.