Closed noahfl closed 3 years ago
From Zheng:
Lately I submitted jobs that require a large memory to grids and I think I met a similar problem with you guys that the jobs drop when I submitted a number of jobs . After I monitored the bigmem_16.q and I think I found the problem. The host comp98 in bigmem_16.q has a problem and it keeps accepting jobs until reaching the h_vmem limit then jobs on comp98 are killed. The solution is to exclude this node when you submit jobs to grids... like qsub -V -cwd -q bigmem_16.q -l h='!comp98' -l h_vmem=16G file_sh
qsub
is used for submitting parallel PROBTRACKX jobs but otherwise a workaround for fsl_sub constraints are handled by batch script from commit d8e830e
archiving this issue for now
fsl_sub is a wrapper for qsub and causes issues when submitting to the grid. i'll see if i can easily swap out fsl_sub for qsub so we can have better control over parallelization and which nodes in our grid we can submit jobs to
fsl_sub code: https://github.com/neurolabusc/fsl_sub/blob/master/fsl_sub