MIT-LCP / physionet-build

The new PhysioNet platform.
https://physionet.org/
BSD 3-Clause "New" or "Revised" License
56 stars 20 forks source link

Publishing a project is not fully atomic for cloud backend #1860

Open alistairewj opened 1 year ago

alistairewj commented 1 year ago

Currently the publishing of a project is not a completely atomic operation for the cloud backend due to the use of cloud API calls. We recently tried to publish a project. The deletion of the old files failed due to an unavailable API, which resulted in the project being in a semi-published state.

Here is my rough sketch of the publish project process.

We've run into an issue now with the cloud backend where this process can fail due to unavailable APIs when we are removing files in the publish_complete step (509 service unavailable). There's not much we can do to control the availability of the API. I can think of two mitigations:

  1. A better retry policy for cloud backend actions. Currently we call delete_blobs with the defaults: https://github.com/MIT-LCP/physionet-build/blob/86b9a6d9002e8bcb7b3459dac830cd57ecf08eed/physionet-django/physionet/gcs.py#L115-L120
  2. Move the ProjectFiles.publish_complete() call within the try/except which contains the rollback on an exception. This seems like it would work.
    • For GCS, this means if we fail to delete the old files, we call rollback. Currently rollback just deletes the newly created bucket, with no guarantees that we haven't left the project files in an undetermined state. So we'd need to update the publish_rollback() method to make sure we don't lose files.
    • For the local file system, publish_complete() just does pass, so this would work fine currently

First mitigation seems uncontentious. Curious about everyone's thoughts on the second mitigation.

For reference here are the GCS/local calls side by side:

https://github.com/MIT-LCP/physionet-build/blob/86b9a6d9002e8bcb7b3459dac830cd57ecf08eed/physionet-django/project/projectfiles/gcs.py#L106-L122

https://github.com/MIT-LCP/physionet-build/blob/86b9a6d9002e8bcb7b3459dac830cd57ecf08eed/physionet-django/project/projectfiles/local.py#L114-L123

bemoody commented 1 year ago

If publish_complete fails, don't you think publish_rollback would fail too?

Plus, if an error occurs after deleting half of the ActiveProject files, you don't want to then delete the PublishedProject files and lose everything.

alistairewj commented 1 year ago

If publish_complete fails, don't you think publish_rollback would fail too?

Yeah for sure. Still, in my mind, it makes sense conceptually to have a rollback for a potential fail in publish_complete. Or change publish_complete to not include file I/O.

Plus, if an error occurs after deleting half of the ActiveProject files, you don't want to then delete the PublishedProject files and lose everything.

Yes exactly. So rollback would have to be a bit smarter than it is now for correctly restoring an ActiveProject.

alistairewj commented 1 year ago

Maybe a better solution is to schedule the file deletion as a background task which can be re-run later.

bemoody commented 1 year ago

My inclination would be "proceed with publication, email the administrators to tell them Something Is Wrong."

Or change publish_complete to not include file I/O.

Indeed.

bemoody commented 1 year ago

But yes... it's unavoidably messy that "publication" has to be synchronized acoross three to four independent systems (filesystem, DB, email, DOI) and this code should be structured better.