Closed mwort closed 4 years ago
I really like this approach since the current version of slurm-submit has given me a lot of headache when trying to set different node/threads configurations for individual jobs.
Some things I have noticed that could be improved:
slurm_utils.py
. I think it would be handy to have them in a separate file instead. This way its much easier to edit them later on without having to touch a python file.SBATCH_DEFAULTS
during setup. Instead, it could be an optional question asked by cookiecutter. Users can then use this question or set it as SBATCH_DEFAULTS
according to their personal preferences. Let me know what you think about this. If you feel any of the two points are worth to address, I can prepare a PR.
On a side note, whats the general state of this PR in your opinion @mwort ? It's marked as draft but looks already very good. What do you think needs to be addressed before you consider it "finished"?
@mwort thanks a lot for this pull request - this was a much needed rewrite! I'm just going to look through the code, but sofar LGTM. I haven't gotten all the tests to pass however. This is not a big deal as most of the failing tests relate to the cookiecutter itself and need to be modified for the current options. One test does involve updating constraints (as per the advanced script), but as you rightly pointed out this was always a hacky solution.
@danijoo I think your comment regarding defaults makes sense. Including some default questions (account etc) would make sense in the current setup, but see also the discussion in #31.
Thanks @danijoo and @percyfal, the main thing still missing are proper tests. A couple of the previous tests still pass but IMHO the test suite is a bit of mess (I'm guessing this is also pbs-torque legacy?) and would need some tidying which I haven't got around to. Feel free to contribute here or should we merge first and then improve them?
Re your other points:
The sbatch_defaults
is really only intended as a convenience option for cases where you literally just want to define a few default sbatch parameters, i.e. you dont want to bother creating a separate file. What you describe is possible already via the default_cluster_config
file in the __default__:
directive (see README).
I must say I am not a big fan of overloading the cookiecutter options unnecessarily and disagree that some settings are used by every SLURM setup (e.g. I rarely use the email option and prefer setting my account via the SLURM_ACCOUNT
env variable). My point being, lets not have even more places to configure Snakemake than absolutely necessary. I would be even inclined to drop the sbatch_defaults
if you think it will lead to more confusion than it offers convenience.
It's certainly tempting to add more resources here, I fear a bit tho that it will be abused to basically add all sbatch arguments here going against the transferability principle of snakemake (although this leaves the choice surprisingly open). But it seems values must be integers, so I guess you coulnt parse things like constraints or partition. nodes
would definitely be a good candidate tho. Interesting and potentially really useful is the ability to parse callables here.
Btw if you guys think the cookiecutter options should have different names, please suggest alternatives. I am not too sure if they are accurate, e.g. sbatch_defaults
could simply be defaults
, default_cluster_config
could just be default_config
and adjust_to_partition
could also be advanced_conversion
or something more general.
@percyfal I wasnt able to actually verify that the behaviour of the new adjust_to_partition
matches that of the old advanced script. Would be good if you could run that in your env and possibly fix if it doesnt.
Hi. Thanks for your comments @mwort. Regarding point 3: I agree its a bit of a problematic point since having a huge dictionary for resource parsing is an ugly solution. On the other hand, there are a lot of slurm options I would define as some kind of "resources" that, depending on the job, could be required to be set: all the mem_per_*
, gpus-per-*
, ntasks-per-*
and cpus-per-*
come to my mind. And as far as I understand, the resources section is the only possible option we have to set these things on a per-job-basis, am I right? Without being able to set these, its really hard to actually use the scripts for complete workflows with very different jobs.
Also the Snakemake documentation states resources are for "arbitrary user-defined resources", so it looks appropriate to me.
Another possibility would be to allow basically anything in resources and parse them similar to this line in the current version of slurm-submit.py.
@danijoo you can set all these parameters as default or per-rule in either the default_cluster_config
file or in the config file you parse to SM via --cluster-config
.
They could be added but again we will end up with two places for sbatch arguments.
Thanks @danijoo amd @percyfal, the main thing still missing are proper tests. A couple of the previous tests still pass but IMHO the test suite is a bit of mess (I'm guessing this is also pbs-torque legacy?) and would need some tidying which I haven't got around to. Feel free to contribute here or should we merge first and then improve them?
No, the test suite mess is entirely on me :-) Currently the tests are running on my circleci account as there is no SnakemakeProfiles organization; @johanneskoester is there a way of setting this up for profiles? I have very little time this week to look further at tests, so if it's ok with you @mwort we can merge and setup an issue for improving tests?
2. I must say I am not a big fan of overloading the cookiecutter options unnecessarily and disagree that some settings are used by every SLURM setup (e.g. I rarely use the email option and prefer setting my account via the `SLURM_ACCOUNT` env variable). My point being, lets not have even more places to configure Snakemake than absolutely necessary. I would be even inclined to drop the `sbatch_defaults` if you think it will lead to more confusion than it offers convenience.
Fine with me; I would keep sbatch_defaults as it by default is unset.
3. It's certainly tempting to add more resources here, I fear a bit tho that it will be abused to basically add all sbatch arguments here going against the transferability principle of snakemake (although [this](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources) leaves the choice surprisingly open). But it seems values must be integers, so I guess you coulnt parse things like constraints or partition. `nodes` would definitely be a good candidate tho. Interesting and potentially really useful is the ability to parse callables here.
Yes, resources are handy, in particular the callable functionality. I briefly touch on this in #31.
Btw if you guys think the cookiecutter options should have different names, please suggest alternatives. I am not too sure if they are accurate, e.g.
sbatch_defaults
could simply bedefaults
,default_cluster_config
could just becluster_config
andadjust_to_partition
could also beadvanced_conversion
or something more general.
sbatch_defaults
is good I think; I'd prefer cluster_config
. adjust_to_partition
is a bit ambiguous; advanced_conversion
or maybe even adjust_resources_to_partition
?
@percyfal I wasnt able to actually verify that the behaviour of the new
adjust_to_partition
matches that of the old advanced script. Would be good if you could run that in your env and possibly fix if it doesnt.
All tests work except for the one where the constraint flag is missing; I'm having a look at it now.
Ok, I've looked into the failing test test_memory_with_constraint
. In the old version, the following arguments were passed to sbatch:
--error "logs/slurm-%j.err" --ntasks "1" --output "logs/slurm-%j.out" --partition "normal" --constraint "mem800MB" --mem "800"
whereas currently sbatch_options
is:
'{'partition': 'normal', 'output': 'logs/slurm-%j.out', 'error': 'logs/slurm-%j.err', 'constraint': 'mem800MB', 'mem': 500, 'ntasks': 1}'
Basically it has to do with which order the options are updated; the constraint is set in step 5 in the submit script, but mem
would be updated given this information (mem800MB
) by the adjust_to_partition step, which is step 3+. Moving this step to step 6 solves the issue, and all other advanced tests pass too.
Haha ok, by mess I mainly meant there is quite a bit of repetition that could be "DRY"ed and given it's complexity (necessary because of the cluster emulation) a few more comments/mini README would be helpful. Took me quite some time to get my head around it. Also, I would think it would be useful to also be able to run the test in a "real" SLURM environment but currently the test cant really be isolate from docker. I'm still getting a Snakemake "directory is locked" error in test_no_timeout if test_timeout is run before despite the --unlock
in that test. And the test_cluster_short_queue keeps hanging and I cant figure out why. But I think neither is a blocker.
I updated the following:
nodes
to resource mappingadjust_to_partition
to the more general advanced_argument_conversion
default_cluster_config
to cluster_config
slurm_options
dictGood to merge if you are ok with it.
Great! Let's merge this and have a separate issue on improvement of tests. Thanks for the effort!
This is a major refactor and merge of the slurm-submit scirpts to address #31 and #15. See README for a description of the cookiecutter options and the way arguments are parsed to SLURM (sbatch).
This implementation essentially allows setting any sbatch option. Since the
--cluster-config
Snakemake option is deprecated since v5.10 and it is encouraged to use profiles instead, I implemented thedefault_cluster_config
option that allows setting default or per-rule arguments to sbatch. Note that environment variables will be expanded in the path so that generic paths may be used (e.g.$HOME/snakemake/slurm_config.yaml
).I merged both slurm-submit scripts and the automatic adjustment of sbatch arguments is now available if the
adjust_to_partition
option is set. This is IMHO a fairly buggy feature as it really depends on the slurm setup (e.g. I wasnt able to get it to run on my cluster), i.e. cluster setups are too heterogeneous to implement a general solution here. But it might serve as a start for people to implement such a conversion to their setup.