eddy_cuda error when using singularity container. SLURM script and error message attached

yilewang commented 2 years ago

Hi! I have a question about the eddy_cuda module... even though I requested the gpu module in TACC, I still fail to execute the eddy_cuda... I am wondering do I need to install any additional nvidia driver inside the container to make it run? Thanks!

The error message

...................Allocated GPU # 0...................
parallel_for failed: no kernel image is available for execution on the device
EDDY:::  cuda/CudaVolume.cu:::  void EDDY::CudaVolume::common_assignment_from_newimage_vol(const NEWIMAGE::volume<float>&, bool):  Exception thrown
EDDY:::  cuda/CudaVolume.h:::  EDDY::CudaVolume::CudaVolume(const NEWIMAGE::volume<float>&, bool):  Exception thrown
EDDY:::  cuda/EddyInternalGpuUtils.cu:::  static void EDDY::EddyInternalGpuUtils::load_prediction_maker(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, const EDDY::ECScanManager&, unsigned int, float, bool, const EDDY::PolationPara&, std::shared_ptr<EDDY::DWIPredictionMaker>, NEWIMAGE::volume<float>&):  Exception thrown
EDDY:::  cuda/EddyGpuUtils.cu:::  static std::shared_ptr<EDDY::DWIPredictionMaker> EDDY::EddyGpuUtils::LoadPredictionMaker(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, const EDDY::ECScanManager&, unsigned int, float, NEWIMAGE::volume<float>&, bool):  Exception thrown
EDDY:::  eddy.cpp:::  EDDY::ReplacementManager* EDDY::Register(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, unsigned int, const std::vector<float, std::allocator<float> >&, EDDY::SecondLevelECModel, bool, EDDY::ECScanManager&, EDDY::ReplacementManager*, NEWMAT::Matrix&, NEWMAT::Matrix&):  Exception thrown
EDDY::: Eddy failed with message EDDY:::  eddy.cpp:::  EDDY::ReplacementManager* EDDY::DoVolumeToVolumeRegistration(const EDDY::EddyCommandLineOptions&, EDDY::ECScanManager&):  Exception thrown

STANDARD ERROR:
+ baseDir=/work/08008/yilewang/ls6/hsam/s123873
+ '[' y == y ']'
+ eddy_cuda --imain=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/DWI.nii.gz --mask=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/rawDTI_B0_brain_mask.nii.gz --topup=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/SynB0/topup --acqp=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/acqparams.txt --index=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/eddy_index.txt --bvecs=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/bvecs --bvals=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/bvals --out=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/data --flm=quadratic --resamp=jac --slm=linear --fwhm=2 --ff=5 --sep_offs_move --nvoxhp=1000 --repol --rms --cnr_maps -v
thrust::system_error thrown in CudaVolume::common_assignment_from_newimage_vol after resize() with message: parallel_for failed: no kernel image is available for execution on the device

The SLURM script:

# simple SLURM script for tvb-ukbb pipeline preprocessing

#-------------------------------------------------------

#set up parameters

#SBATCH -J TVB
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -p gpu-a100
#SBATCH -o job.%j.out
#SBATCH -e job.%j.err
#SBATCH -t 02:00:00

#LD_LIBRARY_PATH=/home/yxw190015/local/gsl-2.6/lib:$LD_LIBRARY_PATH ldd /opt/ohpc/pub/unpackaged/apps/afnibinary/21.0.06/3dROIMaker
#source ~/tvb-pipeline2/tvb-ukbb/init_vars && python ~/tvb-pipeline2/tvb-ukbb/bb_pipeline_tools/bb_pipeline.py s123366_output

LD_PRELOAD=''

ml load tacc-singularity
ml load cuda
singularity exec --nv -B /work/08008/yilewang/ls6/hsam tvb-ukbb-ls6.sif /opt/fsl/bin/eddy_cuda --imain=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/DWI.nii.gz --mask=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/rawDTI_B0_brain_mask.nii.gz --topup=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/SynB0/topup --acqp=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/acqparams.txt --index=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/eddy_index.txt --bvecs=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/bvecs --bvals=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/bvals --out=/work/08008/yilewang/ls6/hsam/s123873/dMRI/dMRI/data --flm=quadratic --resamp=jac --slm=linear --fwhm=2 --ff=5 --sep_offs_move --nvoxhp=1000 --repol --rms --cnr_maps -v

yilewang commented 2 years ago

Or, does the eddy_cuda require specific version of the cuda to run the job here? I am using cuda 11.4 in the TACC.

yilewang commented 2 years ago

Also @noahfl , do you mind sharing your singularity def file? One of the postdoc from TACC who is helping technical issue suspects that it's a compute capability problem between the image and TACC A100 GPU.. She wants to look at it~ Thanks so much!

noahfl commented 2 years ago

Hey Yile. It's definitely a compatibility issue with A100 GPUs. The only way to enable CUDA acceleration for FSL's EDDY, PROBTRACKX, and BEDPOSTX is to use CUDA 9.2 (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/GPU). A100 GPUs require CUDA >=10.0; as a result, until the FSL developers provide support for CUDA 10+ for EDDY, PROBTRACKX, and BEDPOSTX we are stuck with using non-Ampere GPUs. P100s are what we use on our HPC system but V100s should work as well. CUDA 9.2 is what's provided in the container because of this.

As for the container definition file, use this command to get a definition file for a container:

singularity inspect -d container-name.sif > container-name.def

I'll put it up in a repo for ease of use once I fix it up some more but this will at least allow you to grab it right now. It also works for grabbing the def file of any container pulled from remote.

Btw, these kinds of questions are better suited for the Discussions page. It's a bit easier for us to answer and keep track of them there :smiley:

yilewang commented 2 years ago

Hi Noah! Thanks for the reply! I apologize I didn't post them in the discussion page. I will do it after that~ I will double check with TACC system to see if they have non-ampere GPUs provided. I will close this issue with this comment, and if I have more updates, I will post them in the discussion page~

Thanks again for helping out!

McIntosh-Lab / tvb-ukbb

eddy_cuda error when using singularity container. SLURM script and error message attached #164