galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Skip postprocessing POST retries when the file does not exist on the pulsar machine #298

Open cat-bro opened 2 years ago

cat-bro commented 2 years ago

Setting max_retries to retry posts of output files to galaxy is extremely useful, since galaxy is sometimes restarting or too busy to receive the post. The retries also occur when the file does not exist on pulsar and this is not useful, because if the file does not exist upon the completion of a job it will not exist X retries later. Most often the output files are missing because a job has failed. Depending on the settings and the number of expected outputs, a user might have to wait over an hour to find out that their job has failed. Nonexistence of expected output files could be handled by a separate check, prior to the retry loop, and retries skipped in this case.

neoformit commented 2 years ago

Looking into this

natefoo commented 2 years ago

This was useful for me when I had filesystem problems on the Pulsar side where the filesystem did eventually come back, but I agree that it is far more nuisance than help in the overwhelming majority of cases. I'd typically just prefer to fail and rerun the job for the rare occurrence of filesystem problems rather than have this happen for legitimate job failures.

neoformit commented 2 years ago

I've been working on a fix for this (for a while, in the background) but it creates a nasty UX issue for many of our users. Any job run on Pulsar which fails, the user needs to wait an hour to get the fail message back. With AlphaFold, this also means an hour of Azure GPU time wasted! Perhaps we can add an additional check for a failed status before aborting the retry, either way I would plan on making this configurable.

natefoo commented 2 years ago

The issue is (partly) that in most cases, Pulsar is not really the arbiter of what is failed. It simply dutifully copies things back to Galaxy and then lets Galaxy decide. That said, failing to copy back outputs (after that long delay) is one of the things that does result in Pulsar informing Galaxy that the job failed.

As @cat-bro said, I think we're best off just not retrying when the file does not exist, or at least having a separate configurable - you could have NFS attribute caching issues that would cause you to want to retry a few times, but not extensively like you might for if posting it to Galaxy fails.

neoformit commented 2 years ago

Yep, we're in agreement on that. I was suggesting that we can also check for a job-failed status before aborting a missing-file retry loop? That still allows for NFS issues (etc.) to be resolved on a successful job. Do you think there's a way to do that consistently? Or is there no good way for Pulsar to determine that based on the job working directory?