Environment-Cuda-Version

VGANGV commented 1 year ago

Hi, thank you for your research. I am very interested in your research, but got the following Error when I tried to train (Stage I).

(ddm2) root@2080Ti:~/DDM2# python3 train_noise_model.py -p train -c config/hardi_150.json
1.8.0 10.2
export CUDA_VISIBLE_DEVICES=0
23-07-01 00:03:35.315 - INFO: [Phase 1] Training noise model!
Loaded data of size: (81, 106, 76, 160)
23-07-01 00:03:38.916 - INFO: MRI dataset [hardi] is created.
Loaded data of size: (81, 106, 76, 160)
23-07-01 00:03:41.523 - INFO: MRI dataset [hardi] is created.
23-07-01 00:03:41.524 - INFO: Initial Dataset Finished
dropout 0.0 encoder dropout 0.0
23-07-01 00:03:45.363 - INFO: Noise Model is created.
23-07-01 00:03:45.363 - INFO: Initial Model Finished
Traceback (most recent call last):
  File "train_noise_model.py", line 72, in <module>
    trainer.optimize_parameters()
  File "/root/DDM2/model/model_stage1.py", line 62, in optimize_parameters
    outputs = self.netG(self.data)
  File "/root/miniconda3/envs/ddm2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/DDM2/model/mri_modules/noise_model.py", line 44, in forward
    return self.p_losses(x, *args, **kwargs)
  File "/root/DDM2/model/mri_modules/noise_model.py", line 36, in p_losses
    x_recon = self.denoise_fn(x_in['condition'])
  File "/root/miniconda3/envs/ddm2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/DDM2/model/mri_modules/unet.py", line 286, in forward
    x = layer(x)
  File "/root/miniconda3/envs/ddm2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/ddm2/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/root/miniconda3/envs/ddm2/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

I created env on a 2080Ti using the .yml file provided in the repo, so the experimental setup is the same as yours. So I suspect something is wrong with my data processing or config file.

I have used the following code to save the Hardi150 data:

hardi_fname, hardi_bval_fname, hardi_bvec_fname = get_fnames('stanford_hardi')
data, affine = load_nifti(hardi_fname)
save_nifti('hardi.nii.gz', data, affine)

and correspondingly updated dataroot in lines 17 and 30 in config/hardi_150.json , and keep everything else as it is.

Is the way of saving the data and using the config file correct?

I'm a rookie so probably my questions are very foolish.

I would greatly appreciate it if you could respond me.

tiangexiang commented 1 year ago

Hi, thanks for your interest! I believe this problem comes from the environment but not the data loading part (the message: Loaded data of size: (81, 106, 76, 160) indicates you have loaded the data correctly). Although we may share the same environment config file, there are still some disparities between CUDA versions and sometimes even the hardware arch. The problem simply means the installed PyTorch package has difficulties running the network, so this is probably a problem related to pytorch-CUDA version. Please see this thread of discussions for examples: https://stackoverflow.com/questions/66588715/runtimeerror-cudnn-error-cudnn-status-not-initialized-using-pytorch You may resolve this by installing an appropriate PyTorch package (not the one we specified in the config) wrt your own CUDA version.

VGANGV commented 12 months ago

Thank you for your reply, Tiange!

Following your advice, I have reinstalled the latest version of pytorch, and successfully completed all three stages.

Thanks again for your patient reply :). I will close this issue.

StanfordMIMI / DDM2

Environment-Cuda-Version #11