5ttgen: FSL FIRST and SLURM

dmitrishastin commented 2 years ago

Describe the bug

Forgive me if already discussed - I could only find topics addressing FSL FIRST with SGE but not SLURM on the community forum. Thought it was worth highlighting. Having a similar issue: 5ttgen freesurfer / hsvs fails if using FIRST because jobs get submitted to the cluster and the script moves on before they are completed resulting in the "0 out of 14 structures segmented" (paraphrasing) error.

To Reproduce

Submit a 5ttgen job with an algorithm that uses FLS FIRST and FSL configured to submit jobs to SLURM where supported. $FSL_BIN/fsl_sub has the following line: METHOD=${FSL_SUB_METHOD:=SLURM}

Platform/Environment/Version

Please provide the following information:

OS: Centos 7
MRtrix3 version: 3.0.3-382-gc339da2e-dirty

My solution so far is to modify fsl.py to check if FSL is set to submit jobs to SLURM and if so, adopt a behaviour similar to what's already in place for SGE:

def check_first(prefix, structures): #pylint: disable=unused-variable
  from mrtrix3 import app, path #pylint: disable=import-outside-toplevel
  vtk_files = [ prefix + '-' + struct + '_first.vtk' for struct in structures ]
  existing_file_count = sum([ os.path.exists(filename) for filename in vtk_files ])
  if existing_file_count != len(vtk_files):
    enable_wait = False
    if 'SGE_ROOT' in os.environ and os.environ['SGE_ROOT']:
      app.console('FSL FIRST job may have been run via SGE; awaiting completion')
      enable_wait = True
    elif 'FSL_BIN' in os.environ and os.environ['FSL_BIN']:
      with open(os.path.join(os.environ['FSL_BIN'], 'fsl_sub')) as f:
        if 'METHOD=${FSL_SUB_METHOD:=SLURM}' in f.read():
          app.console('FSL FIRST job may have been run via SLURM; awaiting completion')
          enable_wait = True
    if enable_wait:      
      app.console('(note however that FIRST may fail silently, and hence this script may hang indefinitely)')
      path.wait_for(vtk_files)
    else:
      app.DO_CLEANUP = False
      raise MRtrixError('FSL FIRST has failed; ' + ('only ' if existing_file_count else '') + str(existing_file_count) + ' of ' + str(len(vtk_files)) + ' structures were segmented successfully (check ' + path.to_scratch('first.logs', False) + ')')

Happy to open a pull request with this minor change if appropriate / will be grateful for a more obvious solution. Thanks!

Lestropie commented 2 years ago

Hi Dmitri,

I had no idea that FSL had the capability to auto-submit jobs to SLURM. Do you know if this is a new feature, or a non-default plugin of some kind?

In https://github.com/MRtrix3/mrtrix3/commit/3d83a1222c2d07e37a80917078a3962227591719 we precluded entirely the submission of SGE jobs. That was reasonably easy using an environment variable. If run_first_all is reading directly from fsl_sub and not an envvar, then your proposed solution is likely the best one.

If you'd like to be credited with the fix, I'd encourage you to generate the PR. We have contribution guidelines here that include some of the MRtrix3 nuances. This case would be one that we'd merge directly to master.

Cheers Rob

dmitrishastin commented 2 years ago

Hi Rob,

I had no idea that FSL had the capability to auto-submit jobs to SLURM. Do you know if this is a new feature, or a non-default plugin of some kind?

From version 6.0.6 SLURM appears to be supported by default, otherwise it is available as a plugin (see here).

If run_first_all is reading directly from fsl_sub and not an envvar, then your proposed solution is likely the best one.

run_first_all directly references ${FSLDIR}/bin/fsl_sub. However, on second thought, it may be better to scrape for the method indirectly as fsl_sub itself can be configured differently. If fsl_sub is run from the shell, after a certain version of FSL it should print 'method before check: xxx' and 'method after check: xxx', the "check" I believe ensures that jobs don't spawn other jobs so just the "before" output should be fine for scraping. As such, the modification could be:

import subprocess
result = subprocess.check_output((os.path.join(os.environ['FSL_DIR'], 'bin', 'fsl_sub'))
if 'method before check: SLURM' in result.stdout.decode('utf-8'):
  enable_wait = True

I should add that at least in our set-up, SLURM jobs can also get occasionally stuck with launch failed requeued held status. I might just avoid submission to SLURM entirely by prefixing 5ttgen with export FSL_SUB_METHOD=NONE. Is this the environment variable you referred to?

It may indeed just be best to leave things as they are and run FIRST jobs in series, in which case I could just leave a post on the community forum so others are aware of this problem/solution? Alternatively, an option could be added to 5ttgen to enforce the environment variable and override FIRST job submission to cluster depending on user preference?

Thanks D

Lestropie commented 2 years ago

Hesitant about scraping because FMRIB could make changes to how that file works that could be internally consistent but break external parsing of such.

If there's an environment variable that overrules SLURM submission, safest would be to set that in the custom environment in the run module around here. That would disable FSL SLURM submission from inside any MRtrix3 script across the board, which should prevent unwanted errors on master arising from this new FSL feature. If there's adequate motivation to want to re-enable such, that could be subsequently proposed as a software enhancement on dev with less time pressure and adequate proof that it works.

It would be useful if you could contribute this information to the relevant Wiki on the forum.

Cheers Rob

dmitrishastin commented 2 years ago

Just checking: line 54 onwards of the run module makes sure SGE jobs don't get submitted from within SGE environment. As far as I can tell, it does not automatically preclude the submission of SGE jobs otherwise, am I wrong?

I trust a similar check can be performed with SLURM - I just submitted python3 -c 'import os; print(os.environ.copy); exit()' to SLURM and confirmed the existence of several suitable variables (e.g., SLURM_JOB_ID). However, looking through online documentation as well as examining own environment I don't think there is an obvious variable analagous to SGE_ROOT that would overtly point to SLURM running instructions / root directory which could then be used to check whether FSL might be submitting to SLURM automatically. Suppose a workaround is to scrape the output of which sbatch (sbatch being the SLURM job submission executable) but that's again indirect.

Happy to contribute either way, let me know what's best based on the above. Thanks.

Lestropie commented 2 years ago

Just checking: line 54 onwards of the run module makes sure SGE jobs don't get submitted from within SGE environment. As far as I can tell, it does not automatically preclude the submission of SGE jobs otherwise, am I wrong?

Correct (3d83a1222c2d07e37a80917078a3962227591719, #1155). My memory here is a little hazy, I think that there was some specific problem with an SGE-executed job submitting SGE-executed jobs; but if the MRtrix3 script was not executed within an SGE environment, then permitting FIRST to submit SGE jobs is fine as long as one uses the fsl.check_first() function. That isn't actually the problem here: the issue is not of SLURM-executed jobs submitting SLURM-executed jobs, but simply of FIRST executing SLURM jobs and fsl.check_first() not being aware of such.

(Indeed more fundamentally the issue is that run_first_all is itself not checking its own outputs. IMO, from a software design perspective, run_first_all should be doing exactly what fsl.check_first() does, and should be generating a non-zero return code if not all segmentations were successful, regardless of how the segmentation jobs were submitted internally. But that's outside of my control.)

So there's two potential solutions:

Forbid FIRST from submitting SLURM jobs inside of MRtrix3 scripts.
Inform fsl.check_first() of the fact that FIRST is going to submit SLURM jobs, and have it perform the same dynamic monitoring of the presence of files as it already does if such jobs are to be submitted via SGE.

My point above is that:

Solution 1 is very simple, does not require extensive testing, and would resolve an issue that other users could potentially encounter, and therefore it's an appealing solution to merge straight to master and have included in 3.0.4.
Solution 2 is a bit trickier, depends on more detailed knowledge of internal FSL operation (which isn't guaranteed to remain stable across versions), and could be considered an "enhancement" in that it would potentially improve computation time but the outcome should be the same. Since it's not strictly required to "fix" the reported problem (MRtrix3 script execution outright not working), this would be more appropriate to merge to dev and be included in 3.1.0.

I'm therefore actually proposing both.

MRtrix3 / mrtrix3

5ttgen: FSL FIRST and SLURM #2509