CoBrALab / qbatch

The Unlicense
27 stars 13 forks source link

first attempt at making qbatch compatible with SLURM #140

Closed dcoynel closed 7 years ago

dcoynel commented 7 years ago

From testing I also have the impression that job dependencies are also handled differently. For example in antsRegistration-MAGeT if the resample stage is done later, I see dependencies specified that correspond to the time of submission of the resample stage (e.g. mb_register_atlas_template*). These jobs don't exist, and sbatch fails. I'm I getting that correctly ?

pipitone commented 7 years ago

@dcoynel Check out the find_pbs_jobs() function for how we handle job dependencies in PBS. Might have to do something similar if slurm complains when dependencies aren't found.

gdevenyi commented 7 years ago

Indeed, the hardest part of qbatch is correctly abstracting away the dependencies design of the queuing system.

SGE can just depend on jobs by name. PBS can only do number so we parse the job list XML and find the jobid for a given name pattern.

For sbatch hopefully one of these methods will do it.

Sent from my OnePlus ONEPLUS A3000 using FastHub

pipitone commented 7 years ago

Cool. Thanks @dcoynel this is shaping up!

A few things before it's ready to merge, imo:

dcoynel commented 7 years ago

I’ve never done testing in python before, do you use py.test ?

dcoynel commented 7 years ago

I would need some input on this testing error. The nosetests -v command functions on my cluster.

pipitone commented 7 years ago

Cool beans.

I think you just need to add slurm-wlm to the list of packages in .travis.yml to install during testing.

gdevenyi commented 7 years ago

Looking very interesting. The compute canada new clusters just launched with SLURM as their scheduler, so I'll be able to give this a test drive soon.

dcoynel commented 7 years ago

Ok, it doesn't seem so straightforward. From what I get in the travis logfile munge has to be correctly setup for slurm to work.

Not starting munge (no keys found). Please run /usr/sbin/create-munge-key

... although the error I get is related to the local piped command.

pipitone commented 7 years ago

As for the local piped test failing, I see this sh: printf: I/O error being printed... Add a 'w' flag to the open() when redirecting subprocess.call, i.e.:

subprocess.call(['which', 'sbatch'], stdout=open(os.devnull,'w')) 

I don't know what munge is or how to install it properly. Sorry this is turning out to be a bit difficult. If the munge warning you see in the log isn't failing the test, then just leave it for now because we'll tidy all of this up in #114.

dcoynel commented 7 years ago

Yes I agree, sorry for playing around. I lack some experience with those issues.

pipitone commented 7 years ago

No no, thanks @dcoynel for contributing!