Open MikeDacre opened 7 years ago
Hi Guillem,
Thanks for the comment. Unfortunately, I left my job and went to medical school, so now my time to work on fyrd is limited. I agree with you that adding a REASON value to the node list is probably a good way to go. The reason that I didn't do that is because reasons are handled differently by Torque.
Another solution would be to add a new return value for the job (in the Queue object). Currently, that includes things like 'pending' and 'running' and 'completed', you could add a 'depend_failed' value as well and then add that to the list of failed keywords.
If you would be willing, I would suggest trying to make the edits yourself and then I can review the pull request. It should be a relatively quick change, but it is unlikely I will have the time for several weeks at least.
Thanks,
Mike
Hi Mike,
I bet medical school is very demanding, so thanks for taking the time to reply.
I sort of found a way around it, at least for my needs. Slurm
accepts a kill-on-invalid-dep
switch, which kills the jobs dependants as soon as the dependency fails. I had written my own class for submitting jobs - admittedly less polished than what you did -, and I just include this switch in kwargs.
After that, these type of jobs show up as failed when using fyrd, as they should, and I can take it from there.
Just by reading bits of your code and going over the documentation I could not see a way to pass kwargs to slurm
. Is that possible? If so, I can remove my slurm
class for job submission, since I already have your module as a dependency anyway.
I'm pretty busy myself also, but I'll send you a PR if I find the time to work on it. You know what would be cool? Async/await for jobs. I tried to combine your library and the multiprocessing module to get the output of the jobs, but somehow it crashes. Anyway, I didn't spend too much time on it, and that's another topic...
best,
Guillem
Hi Guillem,
The way to do it using the 'API' in fyrd is to add it to the fyrd/batch_systems/slurm.py
file in the parse_strange_options
function at the bottom of the file. Otherwise you need to implement something in the primary option parsing that makes sense for both slurm and torque.
Thanks!
-Mike
Hi Mike,
I'm really enjoying your tool! I was wondering if you managed to make any progress related to this issue.
I'm missing a way to detect jobs that will never run because their dependency was never satisfied. As far as I can tell, the
nodes
list in the Queue objects are either empty (which I guess implies pending), or they contain the list of nodes. I can not find any way to differentiate between regular pending jobs and jobs with NODELIST(REASON) DependencyNeverSatisfied (inslurm
, at least).I guess one possible solution, besides implementing a keyword, would be to include the (REASON) string in the node list, such that the user can find them, perhaps by adapting your
parse_queue()
inbatch_systems/slurm.py
.best
Guillem