MRtrix3 / mrtrix3

MRtrix3 provides a set of tools to perform various advanced diffusion MRI analyses, including constrained spherical deconvolution (CSD), probabilistic tractography, track-density imaging, and apparent fibre density
http://www.mrtrix.org
Mozilla Public License 2.0
281 stars 176 forks source link

5ttgen bug with recent version fsl_sub and Debian/Ubuntu Gridengine #2595

Open glasserm opened 1 year ago

glasserm commented 1 year ago

In fsl.py --> check_first it uses SGE_ROOT to check for the sge queuing system; however, not all sge queuing systems use SGE_ROOT (e.g., Gridengine on Debian/Ubuntu). If both SGE_ROOT or FSLPARALLEL were checked for, this would work on more folks' systems. I am not a python coder, but on my system simply replacing SGE_ROOT with FSLPARALLEL resolved the issue. I don't know if SGE_ROOT is still used by others, however.

Lestropie commented 1 year ago

Thanks for the report Matt. Testing FSLPARALLEL in fsl.check_first() might resolve non-completion in some use cases, though it would potentially result in hanging rather than erroring out in some cases where one or more FIRST jobs fail and SGE is not being used.

It's not clear exactly what issue has been encountered and how:

Is there a discussion that's happened elsewhere that you can link me to?

glasserm commented 1 year ago

Debian/Ubuntu gridengine and FSL function without SGE_ROOT to launch jobs on SGE. On my system, swapping in FSLPARALLEL worked to allow the MRTrix code to wait to complete the jobs. It failed immediately otherwise. I don't see any other environment variables that would work, though I know there is a lot of ongoing modification of fsl_sub in recent versions of FSL and it is certainly possible that FSLPARALLEL is a hold over from older versions of FSL (it is being set in my .bashrc and unsetting it does not seem to prevent jobs from going to SGE through fsl_sub either). Perhaps talking to some of the FSL developers would suggest a better solution for wrapping scripts that call fsl_sub in recent FSL versions? fsl_sub (and first) still returns a job ID so perhaps monitoring the completion of that would be a solution.

Lestropie commented 1 year ago

I was looking at bash fsl_sub in 6.0.5.2; looks like there's a whole Python module now... SGE_ROOT doesn't even appear in that repository... I wasn't aware of those changes.

Capturing the hold job ID from run_first_all might be an option now. I avoided this in the past as it precludes the complete separation between execution and verification. And I've never had an SGE setup on which to myself test data where FIRST is and is not successful, I've just iteratively revised based on user-reported issues.

Manually setting SGE_ROOT in your environment to trick fsl.check_first() is an alternative hack fix that doesn't require modification of code. (Edit: Obviously not a universal solution, but will get anyone by until I figure out how I want to change the code)

glasserm commented 1 year ago

I tried that, but it broke the new fsl_sub when I set it to something random. I couldn't figure out what it actually should be because Debian/Ubuntu gridengine is scattered around multiple folders.