Closed zaneselvans closed 1 year ago
@zschira what do you think of all this?
It also seems like maybe when no update is required because there's been no change to any of the files, the files are still getting uploaded to Zenodo unnecessarily, creating an unpublished draft (I think if none of the file checksums change, Zenodo may not allow a new version to be published?)
Creating a new draft is expected behavior, this happens at the start of a run, and if nothing changes the draft will not be published. This draft will have the exact same contents as the previous version as this is zenodo's behavior when creating a new draft. Nothing should be uploaded to create this new version. The existence of a draft should also not block future runs. If the archiver finds an existing unpublished draft it will just re-use it. I think we could change this behavior to not create a draft until there are detected changes, but it should be able to reuse drafts without a problem
It seems like maybe the RSS/XBRL sources are working as expected because every single archive requires an update.
I did some digging on why every run is requiring updates, and it looks like the names of the XBRL files are changing. These names come from the guid
field from the rss feed, which is supposed to be a unique identifier, but as far as I can tell FERC is just putting random ID's in this field every time you download the RSS feed. This is definitely not desired, so I can email FERC about it, and/or we can change where we get the filing name from.
It looks like EPA has taken down the bulk EPA CEMS files entirely. When they archiver ran it got a 404 on the old index page. It looks like the CEMS data may now only be available via an API. EPA has a repo with API examples and the API documentation
This must be a quite recent change, we'll need to update to use the new API. This does highlight a place we might want to error out though. I ran the EPACEMS
archiver and it's logging a warning that no files to download could be found, but that should probably result in a failure.
In general, I think some of these sandbox depositions are just in a weird state right now, potentially from all the testing that we've been doing recently. For example, I ran the epacamd_eia
archiver and it says nothing should change even though the checksum of the zipfile on the most recent published version doesn't match the checksum of the downloaded file. This is because the draft deposition it's using has the new zipfile, so the archiver doesn't think there are any changes. @zaneselvans mentions the draft is from September 8th, so perhaps this version was created with the old pudl-zenodo-storage
tool that didn't automatically publish new versions. We could try to go through dataset by dataset and clean up these archives, or because it's the sandbox we could just run --initialize
on each dataset and start fresh to get out of any weird state situations.
I think changing the archiver to not create a new draft deposition until after it has detected changes would be good, and maybe help to avoid some of these weird state issues. It would also be nice if there's a programmatic way to delete draft depositions. I think just deleting any draft depositions and not trying to reuse them might avoid some weirdness. I haven't been able to figure out how to do this though, perhaps @zaneselvans has seen something?
Also curious if @jdangerx has any thoughts?
@zschira did a good job covering the high level stuff - here's some additional thoughts:
The unpublished draft problem occurs when something like this happens:
I think ideally, when we start an archiver run, we'd delete the old draft and create a new one - then we can avoid this issue altogether. I think we had run into issues using the discard
endpoint for this before (I think that's for discarding changes in an edit session for an already published deposition) but we might be able to use the delete
endpoint to make sure the new version is actually new - something like this in the Zenodo depositor's get_new_version method:
+ response = await self.request(
+ "POST", url, log_label="Creating new version", headers=self.auth_write
+ )
+ old_deposition = Deposition(**response)
+ # Get url to newest deposition
+ new_deposition_url = old_deposition.links.latest_draft
+
+ # immediately delete new version, in case this was actually an old draft
+ headers = {
+ "Content-Type": "application/json",
+ } | self.auth_write
+ response = await self.request("DELETE", new_deposition_url, headers=headers)
+
+ # re-get a new version
response = await self.request(
"POST", url, log_label="Creating new version", headers=self.auth_write
)
old_deposition = Deposition(**response)
# Get url to newest deposition
new_deposition_url = old_deposition.links.latest_draft
# existing metadata logic elided from this code snippet
response = await self.request(
"PUT",
new_deposition_url,
log_label=f"Updating version number from {previous} ({old_deposition.id_}) to {version_info}",
data=data,
headers=headers,
)
return Deposition(**response)
Small PR to delete depositions at the right time incoming.
After adding several new archivers to the
run-archivers
workflow in the course of #65 and running all the workflows multiple times, I'm seeing some behavior that confuses me. The newly added archivers are:eia176
eia923
(didn't make it into the list before)eiawater
mshamines
phmsagas
My expectation was that if any changes were detected for any of the archived datasets (old or new) a new deposition would be created on Zenodo with the new data in it, and regardless of whether a new archive was published, no draft deposition would remain after the archiving process had completed.
I'm now able to run the
run-archivers
workflow manually via the workflow-dispatch, and all of the actions "succeed" ✅. What happens with the various archivers seemed to differ:censusdp1tract
run-archivers
workflow, and this time it succeeded ✅.eia176
(first time w/ workflow)censusdp1tract
archiver, it seems to create an unpublished draft, even though the MD5 checksums for all the files are identical. The new unpublished draft has a different DOI and a new version (v3.0.0) compared to the previously published deposition.eia860
eia860m
eia861
eia923
(first time w/ workflow)eia_bulk_elec
eiawater
(first time w/ workflow)epacamd_eia
epacamd_eia.zip
file, and not thedatapackage.json
, and the MD5 sum for the zipfile is different from that of the most recently published archive.epacems
ferc1
ferc2
ferc6
ferc60
ferc714
mshamines
(first time w/ workflow)phmsagas
(first time w/ workflow)Patterns?