galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Transfer retries fix #329

Open neoformit opened 1 year ago

neoformit commented 1 year ago

Potentially resolves https://github.com/galaxyproject/pulsar/issues/298. Tested manually on one of our dev pulsar nodes by @cat-bro.

neoformit commented 1 year ago

Rebased against master

natefoo commented 1 year ago

Looks good to me, thanks! I suspect this would not result in a "green-but-actually-failed" job since Galaxy will still expect the output to exist in its own finish method, did you happen to include that in testing?

neoformit commented 1 year ago

Haven't tested that Nate, @cat-bro is the only one who got this running on an actual Pulsar server.

Unfortunately, I can't figure out why the tests are failing. Apparently this change causes the output file transfer test to fail, which I guess must mean that the source file path doesn't exist in the test case, which doesn't make any sense to me... I spent some time looking at the tests and couldn't figure it out or debug it.

@mvdbeek I don't suppose you'd be able to take a look at this? I assume the explanation lies in https://github.com/galaxyproject/pulsar/blob/master/pulsar/client/test/check.py. Maybe there is a test that checks whether a transfer will retry when the source file is missing on the first attempt...?

mvdbeek commented 1 year ago

I haven't forgotten about this, but I didn't think this was quite the right layer to fix this at. You'd probably want to parameterize this on the job state.

neoformit commented 1 year ago

Sorry @mvdbeek I'm not sure what you mean. Only skip retrying if the job has failed and the output file doesn't exist? Isn't job state decided by Galaxy and not Pulsar anyway? :thinking:

mvdbeek commented 1 year ago

Isn't job state decided by Galaxy and not Pulsar anyway?

Pulsar knows the exit code, but you're right, we don't really know when that's a failure or not. I'll try getting back to it.

neoformit commented 1 year ago

Thanks! I'm interested to see what you find.