Closed mdpearson closed 4 years ago
@egafni we've tested this pretty thoroughly over here and it should be good to go.
Thanks! Just saw this - could you please resolve the conflicts?
Sorry for the delay on merging - please ping me if it takes this long again. I must have missed the github notification
Hi Erik,
This PR fixes a number of prod issues we've seen with grid engine in the past quarter or so.
The big one is when qstat falsely returns no output. If qstat returns nothing, this PR will sleep 30 sec and then retry.
A second issue is when GE commands hang indefinitely. This PR uses subprocess.run() to enforce timeouts.
There are logging improvements, esp. when Cosmos is waiting for a long time for jobs to complete. And I've made the teardown code more consistent by making a base implementation of
cleanup_task()
.I'm still testing this but it's working well enough I thought I'd file a PR and garner your thoughts.
In an attempt to be a good neighbor, I updated the slurm module to use the new commands I wrote for grid engine.