Closed dcoynel closed 7 years ago
@dcoynel Check out the find_pbs_jobs()
function for how we handle job dependencies in PBS. Might have to do something similar if slurm complains when dependencies aren't found.
Indeed, the hardest part of qbatch is correctly abstracting away the dependencies design of the queuing system.
SGE can just depend on jobs by name. PBS can only do number so we parse the job list XML and find the jobid for a given name pattern.
For sbatch hopefully one of these methods will do it.
Sent from my OnePlus ONEPLUS A3000 using FastHub
Cool. Thanks @dcoynel this is shaping up!
A few things before it's ready to merge, imo:
I’ve never done testing in python before, do you use py.test ?
I would need some input on this testing error. The nosetests -v
command functions on my cluster.
Cool beans.
I think you just need to add slurm-wlm
to the list of packages in .travis.yml
to install during testing.
Looking very interesting. The compute canada new clusters just launched with SLURM as their scheduler, so I'll be able to give this a test drive soon.
Ok, it doesn't seem so straightforward. From what I get in the travis logfile munge
has to be correctly setup for slurm to work.
Not starting munge (no keys found). Please run /usr/sbin/create-munge-key
... although the error I get is related to the local piped command.
As for the local piped test failing, I see this sh: printf: I/O error
being printed... Add a 'w'
flag to the open()
when redirecting subprocess.call
, i.e.:
subprocess.call(['which', 'sbatch'], stdout=open(os.devnull,'w'))
I don't know what munge
is or how to install it properly. Sorry this is turning out to be a bit difficult. If the munge
warning you see in the log isn't failing the test, then just leave it for now because we'll tidy all of this up in #114.
Yes I agree, sorry for playing around. I lack some experience with those issues.
No no, thanks @dcoynel for contributing!
From testing I also have the impression that job dependencies are also handled differently. For example in antsRegistration-MAGeT if the resample stage is done later, I see dependencies specified that correspond to the time of submission of the resample stage (e.g. mb_register_atlas_template*). These jobs don't exist, and sbatch fails. I'm I getting that correctly ?