Open alistairewj opened 1 year ago
If publish_complete fails, don't you think publish_rollback would fail too?
Plus, if an error occurs after deleting half of the ActiveProject files, you don't want to then delete the PublishedProject files and lose everything.
If publish_complete fails, don't you think publish_rollback would fail too?
Yeah for sure. Still, in my mind, it makes sense conceptually to have a rollback for a potential fail in publish_complete. Or change publish_complete to not include file I/O.
Plus, if an error occurs after deleting half of the ActiveProject files, you don't want to then delete the PublishedProject files and lose everything.
Yes exactly. So rollback would have to be a bit smarter than it is now for correctly restoring an ActiveProject.
Maybe a better solution is to schedule the file deletion as a background task which can be re-run later.
My inclination would be "proceed with publication, email the administrators to tell them Something Is Wrong."
Or change publish_complete to not include file I/O.
Indeed.
But yes... it's unavoidably messy that "publication" has to be synchronized acoross three to four independent systems (filesystem, DB, email, DOI) and this code should be structured better.
Currently the publishing of a project is not a completely atomic operation for the cloud backend due to the use of cloud API calls. We recently tried to publish a project. The deletion of the old files failed due to an unavailable API, which resulted in the project being in a semi-published state.
Here is my rough sketch of the publish project process.
pass
)We've run into an issue now with the cloud backend where this process can fail due to unavailable APIs when we are removing files in the publish_complete step (509 service unavailable). There's not much we can do to control the availability of the API. I can think of two mitigations:
delete_blobs
with the defaults: https://github.com/MIT-LCP/physionet-build/blob/86b9a6d9002e8bcb7b3459dac830cd57ecf08eed/physionet-django/physionet/gcs.py#L115-L120DEFAULT_RETRY_IF_GENERATION_SPECIFIED
, and we don't specify any generation headers, so I don't think we retry at allProjectFiles.publish_complete()
call within the try/except which contains the rollback on an exception. This seems like it would work.publish_rollback()
method to make sure we don't lose files.publish_complete()
just doespass
, so this would work fine currentlyFirst mitigation seems uncontentious. Curious about everyone's thoughts on the second mitigation.
For reference here are the GCS/local calls side by side:
https://github.com/MIT-LCP/physionet-build/blob/86b9a6d9002e8bcb7b3459dac830cd57ecf08eed/physionet-django/project/projectfiles/gcs.py#L106-L122
https://github.com/MIT-LCP/physionet-build/blob/86b9a6d9002e8bcb7b3459dac830cd57ecf08eed/physionet-django/project/projectfiles/local.py#L114-L123