catalyst-cooperative / pudl-archiver

A tool for capuring snapshots of public data sources and archiving them on Zenodo for programmatic use.
MIT License
4 stars 1 forks source link

Publish August 1st 2024 archives #398

Closed github-actions[bot] closed 3 months ago

github-actions[bot] commented 3 months ago

Summary of results:

See the job run logs and results here. Second run of CEMS and NREL ATB data here.

Review and publish archives

For each of the following archives, find the run status in the Github archiver run. If validation tests pass, manually review the archive and publish. If no changes detected, delete the draft. If changes are detected, manually review the archive following the guidelines in step 3 of README.md, then publish the new version. Then check the box here to confirm publication status, adding a note on the status (e.g., "v1 published", "no changes detected, draft deleted"):

- [x] eia176 - No changes, draft deleted.
- [x] eia191 - v9.0.0 published
- [x] eia757a - No changes detected, draft deleted
- [x] eia860 - No changes detected, draft deleted
- [x] eia860m - v23.0.0 published
- [x] eia861 - No changes detected, draft deleted
- [x] eia923 - v20.0.0 published
- [x] eia930 - v7.0.0 published
- [x] eiaaeo - No changes detected, draft deleted
- [x] eiawater - No changes detected, draft deleted
- [x] eia_bulk_elec - v11.0.0 published
- [x] epacamd_eia - No changes detected, draft deleted
- [x] ferc1 - These are all partition changes due to the changes made in #362, FERC 2023 XBRL also seems to have additional data. v 15.0.0 published. See #399 for future tracking of the partition change issue if it continues to be a problem
- [x] ferc2 - Again, all these are updated partition changes, with the exception of additional 2023 XBRL data. v.10.0.0 published
- [x] ferc6 - Same as above version 7.0.0 published
- [x] ferc60 - Same as above, v8.0.0 published
- [x] ferc714 - Same as above, v11.0.0 published
- [x] mshamines - v8.0.0 published
- [x] nrelatb - Slight size change in 2024 parquet but manually inspected data and it is identical. v 3.0.0 published.
- [x] phmsagas - v8.0.0 published

Validation failures

For each run that failed because of validation test failures (seen in the GHA logs), add it to the tasklist. Download the run summary JSON by going into the "Upload run summaries" tab of the GHA run for each dataset, and follow the link. Investigate the validation failure.

If the validation failure is deemed ok after manual review (e.g., Q2 of 2024 data doubles the size of a file that only had Q1 data previously, but the new data looks as expected), go ahead and approve the archive and leave a note explaining your decision in the task list.

If the validation failure is blocking (e.g., file format incorrect, whole dataset changes size by 200%), make an issue to resolve it.

Other failures

For each run that failed because of another reason (e.g., underlying data changes, code failures), create an issue describing the failure and take necessary steps to resolve it.

- [x] epacems - 2023 file looks like it got disrupted in uploading in the logs, and it's significantly smaller than it was previously. Re-run fixed this issue. v 12.0.0 published.
zaneselvans commented 3 months ago

It seems a little fishy to me that the ferc1 took 2 hours, but ferc2 only took 3 minutes, given that their archives should end up being about the same size, and almost all of the ferc2 files got updated.

e-belfer commented 3 months ago

Still working my way through the archives, I'll take a look.

e-belfer commented 3 months ago

Everything has been inspected and published.

zaneselvans commented 3 months ago

Would it be easy to automate checking for the kind of failed upload that CEMS experienced this time around? Like check that all the files in the datapackage area actually in the draft deposition and have the same checksum?

e-belfer commented 3 months ago

The datapackage and checksums are produced at the end from the files uploaded, so I'm not exactly sure what you're proposing? We already check file size against the last upload. This seems to be some kind of problem with the way that 502 errors are getting retried.

zaneselvans commented 3 months ago

I was imagining that we could calculate the file size and/or checksums locally, and compare to the file sizes and/or checksums that are reported on Zenodo, and if they don't match, raise an error.

Are you saying that the filesizes & checksums that end up in the datapackage.json are being populated based on the information on Zenodo, rather than the local files?

e-belfer commented 3 months ago

Ah yes, that would be a pretty straightforward validation! I'll write up an issue