catalyst-cooperative / pudl-archiver

A tool for capuring snapshots of public data sources and archiving them on Zenodo for programmatic use.
MIT License
4 stars 2 forks source link

Fix broken archivers #285

Closed e-belfer closed 3 months ago

e-belfer commented 4 months ago

In #276 we encounter the following errors for archivers:

All FERC archivers fail with the following error:

2024-02-20 14:16:35,954 [webCache:cacheDownloadRenamingError] [Errno 2] No such file or directory: '/home/runner/.config/arelle/cache/http/www.xbrl.org/2003/xbrl-linkbase-2003-12-31.xsd.tmp' -> '/home/runner/.config/arelle/cache/http/www.xbrl.org/2003/xbrl-linkbase-2003-12-31.xsd' 
Unsuccessful renaming of downloaded file to active file /home/runner/.config/arelle/cache/http/www.xbrl.org/2003/xbrl-linkbase-2003-12-31.xsd 
Please remove with file manager. - 

File "/home/runner/micromamba/envs/pudl-cataloger/lib/python3.11/site-packages/aiohttp/client.py", line 449, in _request
    url = self._build_url(str_or_url)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/micromamba/envs/pudl-cataloger/lib/python3.11/site-packages/aiohttp/client.py", line 376, in _build_url
    url = URL(str_or_url)
          ^^^^^^^^^^^^^^^
  File "/home/runner/micromamba/envs/pudl-cataloger/lib/python3.11/site-packages/yarl/_url.py", line 179, in __new__
    raise TypeError("Constructor parameter should be str")
TypeError: Constructor parameter should be str

eiawater has a "record has been deleted" response that seems incorrect - have contacted Zenodo.

### Tasks
- [ ] Fix FERC archivers
- [x] Fix `eiawater` archive after Zenodo response
zaneselvans commented 3 months ago

@e-belfer I just noticed that the EIA cooling water archives are not currently getting turned into annual zipfiles. Does that conflict with the recent change in the PUDL repo that removed the EIA-860M not-a-zipfile special case?

e-belfer commented 3 months ago

We are not currently extracting these files at all, so we'll just have to handle it when we do. My suggestion is just to subclass the load method there to expect a zipfile, not a massive change. If it becomes a more generalized case we can either zip or handle Excel files again.

jdangerx commented 3 months ago

We are running into problems because arelle's taxonomy loading doesn't support concurrency. We can work around this in a couple ways:

  1. we can make each FERC dataset run in its own GHA runner
  2. we can add retries for the FileExistsError