galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Outputs of deleted jobs are repeatedly uploaded to Galaxy until giving up when using `remote_transfer_tus` #364

Closed natefoo closed 6 months ago

natefoo commented 6 months ago

Because Pulsar doesn't really have a great understanding of deleted jobs it will retry sending outputs and other postprocessing files back to Galaxy for deleted jobs. Under a normal remote_transfer action these are immediately rejected with a 403 and retried for however long the administrator has configured postprocessing retries (which it probably shouldn't do in the case of a 403, see #353). In the case of remote_transfer this is not great, but not terrible, just a little noisy and annoying.

In the case of remote_transfer_tus, the entire output is uploaded to Galaxy via tus, and then the request is made to the job_files API that results in a 403. This results in a retry, at which point the entire output is uploaded to Galaxy again, followed by the 403. repeat and recycle up to the retry limit for each and every output.

We have postprocess_action_max_retries set to 10. :skull:

UPDATE: I spoke a bit too soon, it does record the deletion via manager._record_cancel(), which causes postprocessing to be skipped, it appears the condition arises if the job is deleted while postprocessing is taking place.

natefoo commented 6 months ago

Related, we should probably finally finish up https://github.com/galaxyproject/galaxy/pull/7791 to discourage users from deleting their finished-but-staging-out jobs in the first place.