Open dmitrishastin opened 2 years ago
Hi Dmitri,
I had no idea that FSL had the capability to auto-submit jobs to SLURM. Do you know if this is a new feature, or a non-default plugin of some kind?
In https://github.com/MRtrix3/mrtrix3/commit/3d83a1222c2d07e37a80917078a3962227591719 we precluded entirely the submission of SGE jobs. That was reasonably easy using an environment variable. If run_first_all
is reading directly from fsl_sub
and not an envvar, then your proposed solution is likely the best one.
If you'd like to be credited with the fix, I'd encourage you to generate the PR. We have contribution guidelines here that include some of the MRtrix3 nuances. This case would be one that we'd merge directly to master
.
Cheers Rob
Hi Rob,
I had no idea that FSL had the capability to auto-submit jobs to SLURM. Do you know if this is a new feature, or a non-default plugin of some kind?
From version 6.0.6 SLURM appears to be supported by default, otherwise it is available as a plugin (see here).
If
run_first_all
is reading directly fromfsl_sub
and not an envvar, then your proposed solution is likely the best one.
run_first_all
directly references ${FSLDIR}/bin/fsl_sub
. However, on second thought, it may be better to scrape for the method indirectly as fsl_sub itself can be configured differently. If fsl_sub
is run from the shell, after a certain version of FSL it should print 'method before check: xxx' and 'method after check: xxx', the "check" I believe ensures that jobs don't spawn other jobs so just the "before" output should be fine for scraping. As such, the modification could be:
import subprocess
result = subprocess.check_output((os.path.join(os.environ['FSL_DIR'], 'bin', 'fsl_sub'))
if 'method before check: SLURM' in result.stdout.decode('utf-8'):
enable_wait = True
I should add that at least in our set-up, SLURM jobs can also get occasionally stuck with launch failed requeued held
status.
I might just avoid submission to SLURM entirely by prefixing 5ttgen
with export FSL_SUB_METHOD=NONE
.
Is this the environment variable you referred to?
It may indeed just be best to leave things as they are and run FIRST jobs in series, in which case I could just leave a post on the community forum so others are aware of this problem/solution? Alternatively, an option could be added to 5ttgen to enforce the environment variable and override FIRST job submission to cluster depending on user preference?
Thanks D
Hesitant about scraping because FMRIB could make changes to how that file works that could be internally consistent but break external parsing of such.
If there's an environment variable that overrules SLURM submission, safest would be to set that in the custom environment in the run
module around here. That would disable FSL SLURM submission from inside any MRtrix3 script across the board, which should prevent unwanted errors on master
arising from this new FSL feature. If there's adequate motivation to want to re-enable such, that could be subsequently proposed as a software enhancement on dev
with less time pressure and adequate proof that it works.
It would be useful if you could contribute this information to the relevant Wiki on the forum.
Cheers Rob
Just checking: line 54 onwards of the run
module makes sure SGE jobs don't get submitted from within SGE environment. As far as I can tell, it does not automatically preclude the submission of SGE jobs otherwise, am I wrong?
I trust a similar check can be performed with SLURM - I just submitted python3 -c 'import os; print(os.environ.copy); exit()'
to SLURM and confirmed the existence of several suitable variables (e.g., SLURM_JOB_ID
). However, looking through online documentation as well as examining own environment I don't think there is an obvious variable analagous to SGE_ROOT
that would overtly point to SLURM running instructions / root directory which could then be used to check whether FSL might be submitting to SLURM automatically. Suppose a workaround is to scrape the output of which sbatch
(sbatch
being the SLURM job submission executable) but that's again indirect.
Happy to contribute either way, let me know what's best based on the above. Thanks.
Just checking: line 54 onwards of the run module makes sure SGE jobs don't get submitted from within SGE environment. As far as I can tell, it does not automatically preclude the submission of SGE jobs otherwise, am I wrong?
Correct (3d83a1222c2d07e37a80917078a3962227591719, #1155). My memory here is a little hazy, I think that there was some specific problem with an SGE-executed job submitting SGE-executed jobs; but if the MRtrix3 script was not executed within an SGE environment, then permitting FIRST to submit SGE jobs is fine as long as one uses the fsl.check_first()
function. That isn't actually the problem here: the issue is not of SLURM-executed jobs submitting SLURM-executed jobs, but simply of FIRST executing SLURM jobs and fsl.check_first()
not being aware of such.
(Indeed more fundamentally the issue is that run_first_all
is itself not checking its own outputs. IMO, from a software design perspective, run_first_all
should be doing exactly what fsl.check_first()
does, and should be generating a non-zero return code if not all segmentations were successful, regardless of how the segmentation jobs were submitted internally. But that's outside of my control.)
So there's two potential solutions:
fsl.check_first()
of the fact that FIRST is going to submit SLURM jobs, and have it perform the same dynamic monitoring of the presence of files as it already does if such jobs are to be submitted via SGE.My point above is that:
master
and have included in 3.0.4
.dev
and be included in 3.1.0
.I'm therefore actually proposing both.
Describe the bug
Forgive me if already discussed - I could only find topics addressing FSL FIRST with SGE but not SLURM on the community forum. Thought it was worth highlighting. Having a similar issue: 5ttgen freesurfer / hsvs fails if using FIRST because jobs get submitted to the cluster and the script moves on before they are completed resulting in the "0 out of 14 structures segmented" (paraphrasing) error.
To Reproduce
Submit a 5ttgen job with an algorithm that uses FLS FIRST and FSL configured to submit jobs to SLURM where supported. $FSL_BIN/fsl_sub has the following line:
METHOD=${FSL_SUB_METHOD:=SLURM}
Platform/Environment/Version
Please provide the following information:
OS: Centos 7
MRtrix3 version:
3.0.3-382-gc339da2e-dirty
My solution so far is to modify
fsl.py
to check if FSL is set to submit jobs to SLURM and if so, adopt a behaviour similar to what's already in place for SGE:Happy to open a pull request with this minor change if appropriate / will be grateful for a more obvious solution. Thanks!