Archiver workflow sometimes creates unpublished draft depositions

zaneselvans commented 1 year ago

After adding several new archivers to the run-archivers workflow in the course of #65 and running all the workflows multiple times, I'm seeing some behavior that confuses me. The newly added archivers are:

eia176
eia923 (didn't make it into the list before)
eiawater
mshamines
phmsagas

My expectation was that if any changes were detected for any of the archived datasets (old or new) a new deposition would be created on Zenodo with the new data in it, and regardless of whether a new archive was published, no draft deposition would remain after the archiving process had completed.

I'm now able to run the run-archivers workflow manually via the workflow-dispatch, and all of the actions "succeed" ✅. What happens with the various archivers seemed to differ:

`censusdp1tract`

Initially the archiver was failing, for reasons I didn't understand.
I searched for Census DP1 archives on Zenodo Sandbox and found 2 different lineages of archives, one of which was started in January, and wondered if the new archiver might somehow be incompatible with the old lineage (started in 2020), so I switched the sandbox DOI over to the newer lineage, but this didn't fix it. It did seem to try and create a v2.0.0 archive though, which it hadn't been doing before.
Then I looked for draft archives, and saw that there were unpublished drafts for both the old and new lineages. I deleted them, and re-ran the run-archivers workflow, and this time it succeeded ✅.
However, it also created a new unpublished draft, which contains files with identical MD5 checksums as the current published version, but a new version (v2.0.0) and a new DOI.

`eia176` (first time w/ workflow)

The archiver runs and succeeds ✅, and (correctly) does not create a new published deposition.
However, like the censusdp1tract archiver, it seems to create an unpublished draft, even though the MD5 checksums for all the files are identical. The new unpublished draft has a different DOI and a new version (v3.0.0) compared to the previously published deposition.

`eia860`

Running the archiver did not result in either a newly published archive, or a new unpublished draft.
However, an old unpublished draft appears to be hanging around from October, 2022, with the same MD5 sums, but a new reserved DOI and a new version.

`eia860m`

Running the archiver did not result in either a newly published archive, or a new unpublished draft.
However, an old unpublished draft appears to be hanging around from February 6th, 2023, with the same MD5 sums, but a new reserved DOI and a new version.

`eia861`

Running the archiver did not result in either a newly published archive, or a new unpublished draft.
However, an old unpublished draft appears to be hanging around from February 2nd, 2023, with the same MD5 sums, but a new reserved DOI and a new version.

`eia923` (first time w/ workflow)

Running the archiver did not result in a newly published archive, but did create a new unpublished draft.
The new unpublished draft looks like it has the same MD5 sums, but a new reserved DOI and a new version.

`eia_bulk_elec`

Running the archiver did not result in either a newly published archive, or a new unpublished draft.
However, an old unpublished draft appears to be hanging around from February 3rd, 2023, with the same MD5 sums, but a new reserved DOI and a new version.

`eiawater` (first time w/ workflow)

Running the archiver did not result in a newly published archive, but did create a new unpublished draft.
The new unpublished draft looks like it has the same MD5 sums, but a new reserved DOI and a new version.

`epacamd_eia`

Running the archiver did not result in a new published archive or a new draft.
However, an old draft from September 8th, 2022 is still hanging around.
Also, the unpublished draft only contains the epacamd_eia.zip file, and not the datapackage.json, and the MD5 sum for the zipfile is different from that of the most recently published archive.

`epacems`

Running the archiver did not result in a new published archive, even though it definitely should have. The current archive is more than 2 years out of date.
It also didn't create a new draft, but there was an old draft from April, 2021 which contains no uploaded files.
I deleted the old draft and re-ran the archiving workflow, to see what would happen with the substantial downloads required to make a new CEMS archive.
With no blocking draft deposition, the CEMS archiver is running much longer. This seems to suggest that the existence of the draft deposition prevented the download of new raw data from EPA entirely, which means it wouldn't have been able to compare the checksums of the raw files with those stored on Zenodo to even figure out if it needed to make a new archive.
It looks like EPA has taken down the bulk EPA CEMS files entirely. When they archiver ran it got a 404 on the old index page. It looks like the CEMS data may now only be available via an API. EPA has a repo with API examples and the API documentation

`ferc1`

A new deposition is created every time the archiver is run, because the contents of the data downloaded from RSS is different every time (probably there are some timestamps or something).
In addition to the new deposition being created with a new version and DOI, a draft deposition is created and left unpublished. This draft deposition appears identical to the most recently published version -- the MD5 checksums for the files are the same, and the reserved DOI in the draft deposition is identical to the actual DOI for the new deposition that got published.

`ferc2`

As with FERC Form 1, a new deposition is created every time the archiver is run, again with all the XBRL (RSS) derived files having different checksums.
However, unlike the FERC Form 1, no draft version of the deposition is left hanging around. The drafts seem to get published and disappear, as expected.

`ferc6`

As with FERC Form 1, a new deposition is created every time the archiver is run, again with all the XBRL (RSS) derived files having different checksums.
However, unlike the FERC Form 1, no draft version of the deposition is left hanging around. The drafts seem to get published and disappear, as expected.

`ferc60`

As with FERC Form 1, a new deposition is created every time the archiver is run, again with all the XBRL (RSS) derived files having different checksums.
However, unlike the FERC Form 1, no draft version of the deposition is left hanging around. The drafts seem to get published and disappear, as expected.

`ferc714`

As with FERC Form 1, a new deposition is created every time the archiver is run, again with all the XBRL (RSS) derived files having different checksums.
However, unlike the FERC Form 1, no draft version of the deposition is left hanging around. The drafts seem to get published and disappear, as expected.

`mshamines` (first time w/ workflow)

No new published version is created.
An unpublished draft deposition is created, with a new version and DOI, but apparently the same MD5 sums.

`phmsagas` (first time w/ workflow)

No new published version is created.
An unpublished draft deposition is created, with a new version and DOI, but apparently the same MD5 sums.

Patterns?

It seems like any time there's a pre-existing unpublished draft, it blocks a new archive from being created.
It also seems like maybe when no update is required because there's been no change to any of the files, the files are still getting uploaded to Zenodo unnecessarily, creating an unpublished draft (I think if none of the file checksums change, Zenodo may not allow a new version to be published?)
It seems like maybe the RSS/XBRL sources are working as expected because every single archive requires an update.
However the FERC Form 1 doesn't seem to follow this pattern.

zaneselvans commented 1 year ago

@zschira what do you think of all this?

zschira commented 1 year ago

It also seems like maybe when no update is required because there's been no change to any of the files, the files are still getting uploaded to Zenodo unnecessarily, creating an unpublished draft (I think if none of the file checksums change, Zenodo may not allow a new version to be published?)

Creating a new draft is expected behavior, this happens at the start of a run, and if nothing changes the draft will not be published. This draft will have the exact same contents as the previous version as this is zenodo's behavior when creating a new draft. Nothing should be uploaded to create this new version. The existence of a draft should also not block future runs. If the archiver finds an existing unpublished draft it will just re-use it. I think we could change this behavior to not create a draft until there are detected changes, but it should be able to reuse drafts without a problem

It seems like maybe the RSS/XBRL sources are working as expected because every single archive requires an update.

I did some digging on why every run is requiring updates, and it looks like the names of the XBRL files are changing. These names come from the guid field from the rss feed, which is supposed to be a unique identifier, but as far as I can tell FERC is just putting random ID's in this field every time you download the RSS feed. This is definitely not desired, so I can email FERC about it, and/or we can change where we get the filing name from.

It looks like EPA has taken down the bulk EPA CEMS files entirely. When they archiver ran it got a 404 on the old index page. It looks like the CEMS data may now only be available via an API. EPA has a repo with API examples and the API documentation

This must be a quite recent change, we'll need to update to use the new API. This does highlight a place we might want to error out though. I ran the EPACEMS archiver and it's logging a warning that no files to download could be found, but that should probably result in a failure.

General thoughts

In general, I think some of these sandbox depositions are just in a weird state right now, potentially from all the testing that we've been doing recently. For example, I ran the epacamd_eia archiver and it says nothing should change even though the checksum of the zipfile on the most recent published version doesn't match the checksum of the downloaded file. This is because the draft deposition it's using has the new zipfile, so the archiver doesn't think there are any changes. @zaneselvans mentions the draft is from September 8th, so perhaps this version was created with the old pudl-zenodo-storage tool that didn't automatically publish new versions. We could try to go through dataset by dataset and clean up these archives, or because it's the sandbox we could just run --initialize on each dataset and start fresh to get out of any weird state situations.

I think changing the archiver to not create a new draft deposition until after it has detected changes would be good, and maybe help to avoid some of these weird state issues. It would also be nice if there's a programmatic way to delete draft depositions. I think just deleting any draft depositions and not trying to reuse them might avoid some weirdness. I haven't been able to figure out how to do this though, perhaps @zaneselvans has seen something?

Also curious if @jdangerx has any thoughts?

jdangerx commented 1 year ago

@zschira did a good job covering the high level stuff - here's some additional thoughts:

The unpublished draft problem occurs when something like this happens:

old data is in our published deposition
we start an archiver run, making a new draft that has the old deposition data
we download a bunch of new data, see that that's not what's in the draft, and update the draft deposition to have the new data
publish fails for some reason
old data is still the latest published version, but draft deposition has all the new data
we start an archiver run, picking up the draft deposition
we download a bunch of new data, see that that is exactly what is in the draft, and then decide "there are no changes to be made"
now we are stuck in "unpublishable new data" land until the data changes yet again.

I think ideally, when we start an archiver run, we'd delete the old draft and create a new one - then we can avoid this issue altogether. I think we had run into issues using the discard endpoint for this before (I think that's for discarding changes in an edit session for an already published deposition) but we might be able to use the delete endpoint to make sure the new version is actually new - something like this in the Zenodo depositor's get_new_version method:

+        response = await self.request(
+            "POST", url, log_label="Creating new version", headers=self.auth_write
+        )
+        old_deposition = Deposition(**response)
+        # Get url to newest deposition
+        new_deposition_url = old_deposition.links.latest_draft
+
+        # immediately delete new version, in case this was actually an old draft
+        headers = {
+            "Content-Type": "application/json",
+        } | self.auth_write
+        response = await self.request("DELETE", new_deposition_url, headers=headers)
+
+        # re-get a new version
        response = await self.request(
            "POST", url, log_label="Creating new version", headers=self.auth_write
        )
        old_deposition = Deposition(**response)
        # Get url to newest deposition
        new_deposition_url = old_deposition.links.latest_draft

        # existing metadata logic elided from this code snippet 

        response = await self.request(
            "PUT",
            new_deposition_url,
            log_label=f"Updating version number from {previous} ({old_deposition.id_}) to {version_info}",
            data=data,
            headers=headers,
        )
        return Deposition(**response)

jdangerx commented 1 year ago

Small PR to delete depositions at the right time incoming.

catalyst-cooperative / pudl-archiver