catalyst-cooperative / pudl-archiver

A tool for capuring snapshots of public data sources and archiving them on Zenodo for programmatic use.
MIT License
4 stars 2 forks source link

Publish archives for the month of May 2024 #336

Closed zschira closed 1 month ago

zschira commented 2 months ago

Archiver run results

Most errors were acceptable validation errors (new years of data which have grown by >25% in size since last run). There seems to be some FERC rate limiting? error, plus nrelatb and phmsagas have legitimate errors that need investigating.

- [x] eia176 - Unchanged
- [x] eia191 - Published (fixed datapackage)
- [x] eia757a - Unchanged
- [x] eia860 - Published (fixed datapackage)
- [x] eia860m - Failed because eia860m-2024.zip increased in size by > 25%. This seems expected, so I've manually published.  (fixed datapackage)
- [x] eia861 - Unchanged
- [x] eia923 - Failed because eia923-2024.zip increased in size by > 25%. This seems expected, so I've manually published. (fixed datapackage)
- [x] eia930 - Failed because eia930-2024half1.zip increased in size by > 25%. This seems expected, so I've manually published. (fixed datapackage)
- [x] eiaaeo - Unchanged
- [x] eiawater - Unchanged
- [x] eia_bulk_elec - Published (fixed datapackage)
- [x] epacamd_eia - Published (fixed datapackage)
- [x] ferc1 - 2023 data grew by >25%, published
- [x] ferc2 - Failed because ferc2-xbrl-2023.zip increased in size by > 25%. This seems expected, so I've manually published. (fixed datapackage)
- [x] ferc6 - Same as ferc1
- [x] ferc60 - 2023 data grew by >25%, published
- [x] ferc714 - 2023 data grew by >25%, published
- [x] mshamines
- [ ] nrelatb - When trying to get the latest version of the deposition, zenodo is returning an error with the message: "The record has been deleted"
- [x] phmsagas - Validation failure
- [x] epacems - Published

Validation failures

For each run that failed because of validation test failures (seen in the GHA logs), add it to the tasklist. Download the run summary JSON by going into the "Upload run summaries" tab of the GHA run for each dataset, and follow the link. Investigate the validation failure.

If the validation failure is deemed ok after manual review (e.g., Q2 of 2024 data doubles the size of a file that only had Q1 data previously, but the new data looks as expected), go ahead and approve the archive and leave a note explaining your decision in the task list.

If the validation failure is blocking (e.g., file format incorrect, whole dataset changes size by 200%), make an issue to resolve it.

- [ ] https://github.com/catalyst-cooperative/pudl-archiver/issues/337

Other failures

For each run that failed because of another reason (e.g., underlying data changes, code failures), create an issue describing the failure and take necessary steps to resolve it.

- [x] https://github.com/catalyst-cooperative/pudl-archiver/issues/338
- [ ] https://github.com/catalyst-cooperative/pudl-archiver/issues/339

Relevant logs

archiver run

aesharpe commented 2 months ago

I found an issue with the 860m datapackage.json file btw. Going to go take a look at the archiver. Long story short, the path gets recorded as:

            "path": "https://zenodo.org/records/ID_NUMBER/files/eia860m-2023.zip",

instead of

            "path": "https://zenodo.org/records/10966105/files/eia860m-2023.zip",
zaneselvans commented 2 months ago

Uh oh. The ID_NUMBER thing already has an issue #332 but before it only seemed to be happening on the sandbox server. @jdangerx was interested in taking a look.

zaneselvans commented 1 month ago

I went ahead and published the May 3rd epacems draft archive because the most recent published version had the ID_NUMBER error.

Is there any reason not to publish ferc6? It seems like it was having the same issue as the other FERC archives, which is resolved.

I'm pretty sure the phmsagas archives are just filename changes, which results in all of them being deleted and re-created with the new names, so that one can probably be published as well.

zschira commented 1 month ago

I sent a note to zenodo about problems with the nrelatb deposition. Everything else is published, so closing.