CUDA driver/runtime version mismatch

eds-slim commented 4 years ago

Hi, I'm currently exploring qsiprep v0.6.4 on Ubuntu 18.04 and encountered a problem with CUDA. Specifically, very early on, the pipeline throws the error

191120-19:50:44,55 nipype.workflow WARNING:
     [Node] Error on "qsiprep_wf.single_subject_00012_wf.dwi_preproc_acq_NDDFC_run_01_wf.hmc_sdc_wf.eddy" (/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/eddy)
191120-19:50:45,885 nipype.workflow ERROR:
     Node eddy failed to run on host Ixion.
191120-19:50:45,886 nipype.workflow ERROR:
     Saving crash info to /work/wdir/bids50/derivatives/qsiprep/sub-00012/log/20191120-194715_25077cc2-befc-4960-8775-7c4f7057509b/crash-20191120-195045-eckhard-eddy-67060859-878b-4c86-8521-bf03601ca462.txt
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/multiproc.py", line 69, in run_node
    result['result'] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 473, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 557, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 637, in _run_command
    result = self._interface.run(cwd=outdir)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/core.py", line 375, in run
    runtime = self._run_interface(runtime)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/fsl/epi.py", line 766, in _run_interface
    runtime = super(Eddy, self)._run_interface(runtime)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/core.py", line 758, in _run_interface
    self.raise_exception(runtime)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/core.py", line 695, in raise_exception
    ).format(**runtime.dictcopy()))
RuntimeError: Command:
eddy_cuda  --cnr_maps --flm=linear --ff=10.0 --acqp=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/gather_inputs/eddy_acqp.txt --bvals=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/dwi_merge/vol0000_tcat.bval --bvecs=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/dwi_merge/vol0000_tcat.bvec --imain=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/dwi_merge/vol0000_tcat.nii.gz --index=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/gather_inputs/eddy_index.txt --mask=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/distorted_enhance/fill_holes/vol0000_TruncateImageIntensity_RescaleImage_mask_FillHoles.nii.gz --interp=spline --resamp=jac --niter=5 --nvoxhp=1000 --out=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/eddy/eddy_corrected --repol --slm=linear
Standard output:

...................Allocated GPU # -1503168656...................
CUDA error after call to EddyGpuUtils::InitGpu
Error message: CUDA driver version is insufficient for CUDA runtime version
Standard error:

Return code: 1

In order to get this far, I had to manually link libcudart.so.7.5 by setting export SINGULARITYENV_LD_LIBRARY_PATH=/libs and specifying -B /usr/local/cuda-7.5/lib64:/libs in the call to singularity. Without it wouldn't find the CUDA runtime library and crash.

On the host I have CUDA 9.1 and NVIDIA driver version 390.132. Running the offending command (with eddy_cuda replaced by eddy_cuda9.1)

eddy_cuda9.1  --cnr_maps --flm=linear --ff=10.0 --acqp=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/gather_inputs/eddy_acqp.txt --bvals=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/dwi_merge/vol0000_tcat.bval --bvecs=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/dwi_merge/vol0000_tcat.bvec --imain=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/dwi_merge/vol0000_tcat.nii.gz --index=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/gather_inputs/eddy_index.txt --mask=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/distorted_enhance/fill_holes/vol0000_TruncateImageIntensity_RescaleImage_mask_FillHoles.nii.gz --interp=spline --resamp=jac --niter=5 --nvoxhp=1000 --out=/work/qsiprep_wf/single_subject_00012_wf/dwi_preproc_acq_NDDFC_run_01_wf/hmc_sdc_wf/eddy/eddy_corrected --repol --slm=linear

works well.

Does the singularity container have a CUDA 7.5 dependency built-in? And how does this square with the observation that eddy_cuda seems to support only version 8.0 and 9.1?

Thanks for tying to help figure this out!

eds-slim commented 4 years ago

I have now found out about the --nv option which is clearly relevant. With this option the pipeline now crashes with

...................Allocated GPU # 0...................
thrust::system_error thrown in CudaVolume::common_assignment_from_newimage_vol after resize() with message: function_attributes(): after cudaFuncGetAttributes: invalid device function
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  function_attributes(): after cudaFuncGetAttributes: invalid device function
Aborted (core dumped)

mattcieslak commented 4 years ago

I've never successfully run qsiprep in singularity using the cuda version of eddy.

If using a gpu is very important for you, it's possible to run qsiprep without a container as long as you can install all the dependencies (ANTs, DSI Studio, MRtrix, etc.. it's a pain). You might also have some luck by making a dockerfile that starts with one of the ubuntu 16.04 images from here https://hub.docker.com/r/nvidia/cuda. In theory, you can just replace the FROM statement at the beginning of the dockerfile.

If you open a PR I'd be happy to help try to figure it out. For now, the eddy_openmp that comes with the image works slowly but reliably

eds-slim commented 4 years ago

Great, thanks, I'll try to bootstrap the docker file from one of the CUDA-enabled ubuntu images.

While I'm at it, is there are reason you're using FSL 5.0.11 (which comes with CUDA 7.5), rather than a more recent version?

mattcieslak commented 4 years ago

Would you recommend an upgrade? 0.7 will have an upgraded ANTs, Dipy and MRtrix, so maybe this would be a good time to update FSL also.

eds-slim commented 4 years ago

Yes, I would recommend updating to FSL-6. In addition to various technical improvements and optimisations, it also includes eddyqc which is, as far as I can tell, one of the very few automated quality assessment tools for diffusion data. I bootstrapped the DOCKERFILE FROM nvidia/cuda:9.1-runtime-ubuntu16.04 without problems, and eddy_cuda9.1 from fsl-6.0.2-centos7_64.tar.gz to run without further modifications. Unfortunately, neither DSI Studio, nor Convert3D could be downloaded from the URLs specified in the file so I haven't been able to test the rest of pipeline yet. Given the small change necessary to the code, and the potentially huge speedup, I think it might be worthwhile to include CUDA support in the current, or one of the upcoming releases. Thanks for your help!

mattcieslak commented 4 years ago

Could you please try the dockerfile from here: https://github.com/PennBBL/qsiprep/blob/fsl6/Dockerfile. This has fixes for the missing downloads

eds-slim commented 4 years ago

The dockerfile seems to work after replacing the base image with nvidia/cuda:9.1-runtime-ubuntu16.04 and patching qsiprep/interfaces/eddy.py in line 288 with

self._cmd = 'eddy_cuda9.1' if self.inputs.use_cuda else 'eddy_openmp'

rather than

self._cmd = 'eddy_cuda' if self.inputs.use_cuda else 'eddy_openmp'

The processing pipeline runs to completion and the html report looks reasonable to me. I'm now running some of the reconstruction pipelines, but they shouldn't be affected by the changes in eddy and fsl I suppose.

mattcieslak commented 4 years ago

That is awesome!! Can you confirm that it ran more quickly? I am thinking it makes the most sense to build the official qsiprep image using the nvidia/cuda:9.1-runtime-ubuntu16.04, so those with gpus can use it if they want to and the openmp version should still work. Did you need to do anything special to build the docker image?

eds-slim commented 4 years ago

I felt much quicker, I'll try and time a few runs over the weekend. Surprisingly, no further tweaking of the dockerfile or build process was necessary, so including CUDA support in the official image seems the right thing to do. With FSL upgraded to v6 it might also be a good idea to save the eddy_qc output somewhere and/or include it in the html report, but that's probably not top priority.

eds-slim commented 4 years ago

I ran a few very non-scientific timing tests with the latest version built from the cuda:9.1 image, and including FSL 6. With "use_cuda": false, in eddy_params.json the preprocessing (not including connectome reconstruction) was about 2-3 times slower compared to "use_cuda": true, (~70 min vs ~30 min for a single subject, both runs using 8 threads).

mattcieslak commented 4 years ago

This is great news. It looks like the CI tests are all passing with the updated fsl. I think this is ready to merge

mattcieslak commented 4 years ago

could you provide an example of the singularity command you used to run this? I'd like to add an example to the documentation

eds-slim commented 4 years ago

Absolutely. I used

singularity run -B /mnt/data/HCHS:/work --nv /tmp/test/qsipreptest-2019-11-22-00089af84b60.sif  --participant_label 00012 -w /work --eddy_config /work/wdir/eddy_params.json --output-resolution 2 --skip_bids --fs-license-file /work/license.txt /work/wdir/bids50/ /work/wdir/bids50/derivatives/ participant

The key feature is obviously the --nv option to the singularity run command.

mattcieslak commented 4 years ago

Merged an working on docker too!

PennLINC / qsiprep

CUDA driver/runtime version mismatch #75