Mizzou-CBMI / COSMOS2

Python Scientific Pipeline Management System
GNU General Public License v3.0
71 stars 39 forks source link

Increase reliability when grid engine CLI tools hang or return invalid data #119

Closed mdpearson closed 4 years ago

mdpearson commented 5 years ago

Hi Erik,

This PR fixes a number of prod issues we've seen with grid engine in the past quarter or so.

The big one is when qstat falsely returns no output. If qstat returns nothing, this PR will sleep 30 sec and then retry.

A second issue is when GE commands hang indefinitely. This PR uses subprocess.run() to enforce timeouts.

There are logging improvements, esp. when Cosmos is waiting for a long time for jobs to complete. And I've made the teardown code more consistent by making a base implementation of cleanup_task().

I'm still testing this but it's working well enough I thought I'd file a PR and garner your thoughts.

In an attempt to be a good neighbor, I updated the slurm module to use the new commands I wrote for grid engine.

mdpearson commented 5 years ago

@egafni we've tested this pretty thoroughly over here and it should be good to go.

egafni commented 4 years ago

Thanks! Just saw this - could you please resolve the conflicts?

egafni commented 4 years ago

Sorry for the delay on merging - please ping me if it takes this long again. I must have missed the github notification