VEuPathDB / lib-compute-platform

Async compute platform core.
1 stars 0 forks source link

Allow expiration and deletion operations on incomplete jobs. #38

Closed Foxcapades closed 1 year ago

Foxcapades commented 1 year ago

Allow for the expiration or deletion of a job workspace for an in-flight job.

This may require the addition of a new "deleted" flag file to indicate that an incomplete job should not be completed and instead should result in the workspace being wiped out.

This means that when an in-flight job starts or completes, it will need to check the workspace and/or database to validate the state of the job before proceeding. If the job has not started yet but the expired flag file exists, or the job no longer exists in the database, stop processing the job as it is no longer needed. If the job has started, on completion the same rules apply, if the job workspace has been marked as expired, or the job has been removed from the database, stop there, and do not publish the results of the job to the workspace.

These checks will need to be on either side of the JobExecutor call so that the implementer need not concern themselves with them.

Relates to:

Foxcapades commented 1 year ago

There will be race conditions here where a job may be expired at the same time that its results are being written to S3, or the job may be deleted at the same time it's results are being written to the database.

ryanrdoherty commented 1 year ago

Steve's idea there was that the expirer adds the expire file while the job is running. Then once the job is done, regardless of success, the platform checks for it and clears out the just-written files if it sees the file.

Foxcapades commented 1 year ago

The race condition would be after that, though it's only really a "breaking" problem for job deletions.

  1. 🔴 Thread 1: async job completes
  2. 🔴 Thread 1: platform checks S3 and sees that the workspace is okay to write to (has not yet been deleted)
  3. 🔵 Thread 2: deletes the workspace
  4. 🔴 Thread 1: writes the results to the workspace

In this scenario, the workspace will have been deleted then immediately recreated, however the database record will not exist leaving a workspace around that nobody owns.

Foxcapades commented 1 year ago

For expirations, the expired flag will exist along with the data, meaning both campuses will see the job as expired, but we will have junk data floating around.

In this case we could add a 3rd check, after the files are written to S3, that looks for the expire flag and deletes the files we just wrote if the flag was created while we were writing files to S3. That would resolve this issue, but wouldn't resolve the deletion issue.

Foxcapades commented 1 year ago

Actually, that third check resolves all the issues. An important detail that I completely forgot about is the .workspace flag file. If this file is not present, the workspace is invalid. We can use that to determine if the workspace was deleted while we were writing files to S3.