catalyst-cooperative / pudl-archiver

A tool for capuring snapshots of public data sources and archiving them on Zenodo for programmatic use.
MIT License
4 stars 2 forks source link

Update archives for the month of April #314

Closed e-belfer closed 2 months ago

e-belfer commented 3 months ago

Review and publish archives

For each of the following archives, find the run status in the Github archiver run. If validation tests pass, manually review the archive and publish. If no changes detected, delete the draft. If changes are detected, manually review the archive following the guidelines in step 3 of README.md, then publish the new version. Then check the box here to confirm publication status, adding a note on the status (e.g., "v1 published", "no changes detected, draft deleted"):

- [x] eia176 - no changes detected, not published and draft deleted
- [x] eia191 - v4 published
- [x] eia757a - no changes detected, not published and draft deleted
- [x] eia861 - no changes detected, not published and draft deleted
- [x] eia923 - v14 published
- [x] eia930 - v2 published
- [x] eiawater - no changes detected, draft deleted
- [x] eia_bulk_elec - v6 published
- [x] mshamines - v4 published
- [x] nrelatb - no changes detected, deleted draft
- [x] epacems - v7 published
- [x] eia860 - no changes other than `datapackage.json` format, deleted draft
- [x] epacamd_eia - no changes other than `datapackage.json` format, deleted draft

Validation failures

For each run that failed because of validation test failures (seen in the GHA logs), add it to the tasklist. Download the run summary JSON by going into the "Upload run summaries" tab of the GHA run for each dataset, and follow the link. Investigate the validation failure.

If the validation failure is deemed ok (e.g., Q2 of 2024 data doubles the size of a file that only had Q1 data previously), go ahead and approve the archive and leave a note explaining your decision in the task list.

If the validation failure is blocking (e.g., file format incorrect, whole dataset changes size by 200%), make an issue to resolve it.

- [x] eia860m - added a second month of data and almost doubled size, all as expected. v17 published
- [ ] https://github.com/catalyst-cooperative/pudl-archiver/issues/318
- [x] ferc2 - When running locally - "The following files have absolute changes in file size >|25%|: {'ferc2-2007.zip': 0.7273698385642252}"
- [x] ferc6 - 2021-2023 taxonomies contain invalid files, all years of xbrl data have changed in size - re-ran after #323 merged, no validation issue
- [x] ferc60 - 2021 and 2022 taxonomies contain invalid files - re-ran after #323 merged, no validation issue
- [x] ferc714 - When running locally - "The following files have absolute changes in file size >|25%|: {'ferc714-xbrl-2020.zip': 0.4996762784976085, 'ferc714-xbrl-2021.zip': 0.7985890429364467}"

Other failures

For each run that failed because of another reason (e.g., underlying data changes, code failures), create an issue describing the failure and take necessary steps to resolve it.

- [ ] eiaaeo - known issue with DOI marked deleted, waiting on Zenodo response
- [ ] https://github.com/catalyst-cooperative/pudl-archiver/issues/319

Relevant logs

Link to logs from GHA run

zaneselvans commented 2 months ago

What do the file size change numbers mean? Are they the new file size relative to the old file size? Or is that the magnitude of the change observed?

zaneselvans commented 2 months ago

While trying to debug the FERC-714 and FERC-2 archiving validation errors, I got a weird response from the Zenodo Sandbox server, which doesn't seem to be happening on the Production server. In the datapackage.json we end up with ID_NUMBER in the path URL rather than the actual record ID, so somehow we're not getting the ID we need, or failing to substitute the real ID in before uploading the files, but the fact that it happens on the Sandbox and not on Production suggests it's not a problem we're introducing?


{
    "profile": "data-resource",
    "name": "ferc2-xbrl-2023.zip",
    "path": "https://sandbox.zenodo.org/records/ID_NUMBER/files/ferc2-xbrl-2023.zip",
    "title": "ferc2-xbrl-2023.zip",
    "parts": {
        "year": 2023,
        "data_format": "xbrl"
    },
    "encoding": "utf-8",
    "mediatype": "application/zip",
    "format": ".zip",
    "bytes": 8175132,
    "hash": "c9d3346e331535af8a7f622065ef17bf"
}
zaneselvans commented 2 months ago

FERC-714

Looking at the FERC-714 validation failure, now only the 2020 data is changing too much:

"The following files have absolute changes in file size >|25%|: {'ferc714-xbrl-2020.zip': 0.49967830175699846}"

However, the original file is only ~500K, whereas a full year of XBRL data is ~50MB, and all of the pre-existing XBRL files shrank in size, so some kind of change across the board seems to have made everything smaller.

Also, 2020 is not XBRL data that we are currently extracting, so I am going to go ahead and approve this as not currently concerning.

zaneselvans commented 2 months ago

FERC-2

Upon re-running the FERC-2 archiver, it ended up passing just fine. It's still strange that the ferc2-2007.zip file would have shrunk by some huge amount (or changed at all!) so maybe that was due to a glitch in downloading? So I guess I will go ahead and approve the new archive.

zaneselvans commented 2 months ago

Given that we don't expect the EIA AEO to be updated in 2024 at all, and we have reached out to Zenodo about the deleted DOI several times, I'm going to go ahead and close this issue. We'll have another one next week 😄