MikeDacre / fyrd

Submit functions and shell scripts to torque and slurm clusters or local machines using python.
https://fyrd.science
MIT License
22 stars 8 forks source link

Add depend_failed keyword argument #51

Open MikeDacre opened 7 years ago

gportella commented 6 years ago

Hi Mike,

I'm really enjoying your tool! I was wondering if you managed to make any progress related to this issue.

I'm missing a way to detect jobs that will never run because their dependency was never satisfied. As far as I can tell, the nodes list in the Queue objects are either empty (which I guess implies pending), or they contain the list of nodes. I can not find any way to differentiate between regular pending jobs and jobs with NODELIST(REASON) DependencyNeverSatisfied (in slurm, at least).

I guess one possible solution, besides implementing a keyword, would be to include the (REASON) string in the node list, such that the user can find them, perhaps by adapting your parse_queue() in batch_systems/slurm.py.

best

Guillem

MikeDacre commented 6 years ago

Hi Guillem,

Thanks for the comment. Unfortunately, I left my job and went to medical school, so now my time to work on fyrd is limited. I agree with you that adding a REASON value to the node list is probably a good way to go. The reason that I didn't do that is because reasons are handled differently by Torque.

Another solution would be to add a new return value for the job (in the Queue object). Currently, that includes things like 'pending' and 'running' and 'completed', you could add a 'depend_failed' value as well and then add that to the list of failed keywords.

If you would be willing, I would suggest trying to make the edits yourself and then I can review the pull request. It should be a relatively quick change, but it is unlikely I will have the time for several weeks at least.

Thanks,

Mike

gportella commented 6 years ago

Hi Mike,

I bet medical school is very demanding, so thanks for taking the time to reply.

I sort of found a way around it, at least for my needs. Slurm accepts a kill-on-invalid-dep switch, which kills the jobs dependants as soon as the dependency fails. I had written my own class for submitting jobs - admittedly less polished than what you did -, and I just include this switch in kwargs. After that, these type of jobs show up as failed when using fyrd, as they should, and I can take it from there.

Just by reading bits of your code and going over the documentation I could not see a way to pass kwargs to slurm. Is that possible? If so, I can remove my slurm class for job submission, since I already have your module as a dependency anyway.

I'm pretty busy myself also, but I'll send you a PR if I find the time to work on it. You know what would be cool? Async/await for jobs. I tried to combine your library and the multiprocessing module to get the output of the jobs, but somehow it crashes. Anyway, I didn't spend too much time on it, and that's another topic...

best,

Guillem

MikeDacre commented 5 years ago

Hi Guillem,

The way to do it using the 'API' in fyrd is to add it to the fyrd/batch_systems/slurm.py file in the parse_strange_options function at the bottom of the file. Otherwise you need to implement something in the primary option parsing that makes sense for both slurm and torque.

Thanks!

-Mike