catalyst-cooperative / pudl-archiver

A tool for capuring snapshots of public data sources and archiving them on Zenodo for programmatic use.
MIT License
4 stars 2 forks source link

Archiver workflow sometimes creates unpublished draft depositions #67

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

After adding several new archivers to the run-archivers workflow in the course of #65 and running all the workflows multiple times, I'm seeing some behavior that confuses me. The newly added archivers are:

My expectation was that if any changes were detected for any of the archived datasets (old or new) a new deposition would be created on Zenodo with the new data in it, and regardless of whether a new archive was published, no draft deposition would remain after the archiving process had completed.

I'm now able to run the run-archivers workflow manually via the workflow-dispatch, and all of the actions "succeed" ✅. What happens with the various archivers seemed to differ:

censusdp1tract

eia176 (first time w/ workflow)

eia860

eia860m

eia861

eia923 (first time w/ workflow)

eia_bulk_elec

eiawater (first time w/ workflow)

epacamd_eia

epacems

ferc1

ferc2

ferc6

ferc60

ferc714

mshamines (first time w/ workflow)

phmsagas (first time w/ workflow)

Patterns?

zaneselvans commented 1 year ago

@zschira what do you think of all this?

zschira commented 1 year ago

It also seems like maybe when no update is required because there's been no change to any of the files, the files are still getting uploaded to Zenodo unnecessarily, creating an unpublished draft (I think if none of the file checksums change, Zenodo may not allow a new version to be published?)

Creating a new draft is expected behavior, this happens at the start of a run, and if nothing changes the draft will not be published. This draft will have the exact same contents as the previous version as this is zenodo's behavior when creating a new draft. Nothing should be uploaded to create this new version. The existence of a draft should also not block future runs. If the archiver finds an existing unpublished draft it will just re-use it. I think we could change this behavior to not create a draft until there are detected changes, but it should be able to reuse drafts without a problem

It seems like maybe the RSS/XBRL sources are working as expected because every single archive requires an update.

I did some digging on why every run is requiring updates, and it looks like the names of the XBRL files are changing. These names come from the guid field from the rss feed, which is supposed to be a unique identifier, but as far as I can tell FERC is just putting random ID's in this field every time you download the RSS feed. This is definitely not desired, so I can email FERC about it, and/or we can change where we get the filing name from.

It looks like EPA has taken down the bulk EPA CEMS files entirely. When they archiver ran it got a 404 on the old index page. It looks like the CEMS data may now only be available via an API. EPA has a repo with API examples and the API documentation

This must be a quite recent change, we'll need to update to use the new API. This does highlight a place we might want to error out though. I ran the EPACEMS archiver and it's logging a warning that no files to download could be found, but that should probably result in a failure.

General thoughts

In general, I think some of these sandbox depositions are just in a weird state right now, potentially from all the testing that we've been doing recently. For example, I ran the epacamd_eia archiver and it says nothing should change even though the checksum of the zipfile on the most recent published version doesn't match the checksum of the downloaded file. This is because the draft deposition it's using has the new zipfile, so the archiver doesn't think there are any changes. @zaneselvans mentions the draft is from September 8th, so perhaps this version was created with the old pudl-zenodo-storage tool that didn't automatically publish new versions. We could try to go through dataset by dataset and clean up these archives, or because it's the sandbox we could just run --initialize on each dataset and start fresh to get out of any weird state situations.

I think changing the archiver to not create a new draft deposition until after it has detected changes would be good, and maybe help to avoid some of these weird state issues. It would also be nice if there's a programmatic way to delete draft depositions. I think just deleting any draft depositions and not trying to reuse them might avoid some weirdness. I haven't been able to figure out how to do this though, perhaps @zaneselvans has seen something?

Also curious if @jdangerx has any thoughts?

jdangerx commented 1 year ago

@zschira did a good job covering the high level stuff - here's some additional thoughts:

The unpublished draft problem occurs when something like this happens:

  1. old data is in our published deposition
  2. we start an archiver run, making a new draft that has the old deposition data
  3. we download a bunch of new data, see that that's not what's in the draft, and update the draft deposition to have the new data
  4. publish fails for some reason
  5. old data is still the latest published version, but draft deposition has all the new data
  6. we start an archiver run, picking up the draft deposition
  7. we download a bunch of new data, see that that is exactly what is in the draft, and then decide "there are no changes to be made"
  8. now we are stuck in "unpublishable new data" land until the data changes yet again.

I think ideally, when we start an archiver run, we'd delete the old draft and create a new one - then we can avoid this issue altogether. I think we had run into issues using the discard endpoint for this before (I think that's for discarding changes in an edit session for an already published deposition) but we might be able to use the delete endpoint to make sure the new version is actually new - something like this in the Zenodo depositor's get_new_version method:

+        response = await self.request(
+            "POST", url, log_label="Creating new version", headers=self.auth_write
+        )
+        old_deposition = Deposition(**response)
+        # Get url to newest deposition
+        new_deposition_url = old_deposition.links.latest_draft
+
+        # immediately delete new version, in case this was actually an old draft
+        headers = {
+            "Content-Type": "application/json",
+        } | self.auth_write
+        response = await self.request("DELETE", new_deposition_url, headers=headers)
+
+        # re-get a new version
        response = await self.request(
            "POST", url, log_label="Creating new version", headers=self.auth_write
        )
        old_deposition = Deposition(**response)
        # Get url to newest deposition
        new_deposition_url = old_deposition.links.latest_draft

        # existing metadata logic elided from this code snippet 

        response = await self.request(
            "PUT",
            new_deposition_url,
            log_label=f"Updating version number from {previous} ({old_deposition.id_}) to {version_info}",
            data=data,
            headers=headers,
        )
        return Deposition(**response)
jdangerx commented 1 year ago

Small PR to delete depositions at the right time incoming.