Open glasserm opened 1 year ago
Thanks for the report Matt. Testing FSLPARALLEL
in fsl.check_first()
might resolve non-completion in some use cases, though it would potentially result in hanging rather than erroring out in some cases where one or more FIRST jobs fail and SGE is not being used.
It's not clear exactly what issue has been encountered and how:
fsl_sub
itself checks for SGE_ROOT
so I don't think that an SGE environment could be being used in the absence of such;fsl_sub
could use FSLPARALLEL
> 1 without using SGE, and when that happens it calls wait
, which means that all subprocesses should have completed by the time run_first_all
completes. Absence of expected VTK files in such a scenario should therefore ideally result in an error message, rather than waiting indefinitely for files that will never appear.Is there a discussion that's happened elsewhere that you can link me to?
Debian/Ubuntu gridengine and FSL function without SGE_ROOT to launch jobs on SGE. On my system, swapping in FSLPARALLEL worked to allow the MRTrix code to wait to complete the jobs. It failed immediately otherwise. I don't see any other environment variables that would work, though I know there is a lot of ongoing modification of fsl_sub in recent versions of FSL and it is certainly possible that FSLPARALLEL is a hold over from older versions of FSL (it is being set in my .bashrc and unsetting it does not seem to prevent jobs from going to SGE through fsl_sub either). Perhaps talking to some of the FSL developers would suggest a better solution for wrapping scripts that call fsl_sub in recent FSL versions? fsl_sub (and first) still returns a job ID so perhaps monitoring the completion of that would be a solution.
I was looking at bash
fsl_sub
in 6.0.5.2
; looks like there's a whole Python module now... SGE_ROOT
doesn't even appear in that repository... I wasn't aware of those changes.
Capturing the hold job ID from run_first_all
might be an option now. I avoided this in the past as it precludes the complete separation between execution and verification. And I've never had an SGE setup on which to myself test data where FIRST is and is not successful, I've just iteratively revised based on user-reported issues.
Manually setting SGE_ROOT
in your environment to trick fsl.check_first()
is an alternative hack fix that doesn't require modification of code.
(Edit: Obviously not a universal solution, but will get anyone by until I figure out how I want to change the code)
I tried that, but it broke the new fsl_sub when I set it to something random. I couldn't figure out what it actually should be because Debian/Ubuntu gridengine is scattered around multiple folders.
In fsl.py --> check_first it uses SGE_ROOT to check for the sge queuing system; however, not all sge queuing systems use SGE_ROOT (e.g., Gridengine on Debian/Ubuntu). If both SGE_ROOT or FSLPARALLEL were checked for, this would work on more folks' systems. I am not a python coder, but on my system simply replacing SGE_ROOT with FSLPARALLEL resolved the issue. I don't know if SGE_ROOT is still used by others, however.