Environment conflicts with GPU

anar-rzayev commented 5 months ago

Hi, thanks a lot for your interest in those issues, I wanted to ask about your comment on the following issue when I want to train Stage1:

24-01-18 01:06:41.203 - INFO: [Phase 1] Training noise model!
24-01-18 01:07:04.744 - INFO: MRI dataset [hardi] is created.
24-01-18 01:07:23.001 - INFO: MRI dataset [hardi] is created.
24-01-18 01:07:23.001 - INFO: Initial Dataset Finished
/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/cuda/__init__.py:104: UserWarning: 
NVIDIA RTX 6000 Ada Generation with CUDA capability sm_89 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA RTX 6000 Ada Generation GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
24-01-18 01:07:23.542 - INFO: Noise Model is created.
24-01-18 01:07:23.542 - INFO: Initial Model Finished
1.8.0 10.2
export CUDA_VISIBLE_DEVICES=2
Loaded data of size: (118, 118, 25, 56)
Loaded data of size: (118, 118, 25, 56)
dropout 0.0 encoder dropout 0.0
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
raw_input shape before slicing: (118, 118, 1, 3)
raw_input shape after slicing: (118, 118, 3)
Traceback (most recent call last):
  File "train_noise_model.py", line 72, in <module>
    trainer.optimize_parameters()
  File "/home/anar/DDM2/model/model_stage1.py", line 62, in optimize_parameters
    outputs = self.netG(self.data)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/anar/DDM2/model/mri_modules/noise_model.py", line 44, in forward
    return self.p_losses(x, *args, **kwargs)
  File "/home/anar/DDM2/model/mri_modules/noise_model.py", line 36, in p_losses
    x_recon = self.denoise_fn(x_in['condition'])
  File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/anar/DDM2/model/mri_modules/unet.py", line 286, in forward
    x = layer(x)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_image/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: no kernel image is available for execution on the device

Previously, when I was trying to denoise HARDI150 volumes, I didn't specify any PyTorch version and made Python>=3.10. But after noticing your initial environment.yaml criteria, I changed to very specific cases for torch, torchvision, and python but frankly, I started to get the above issue. Do you think it is better I do not specify any version for PyTorch or they should exactly match?

The reason I ask this is because I feel like from the previous issue when the validation loader was not working, I thought maybe it happened due to version mismatches from the environment file but after getting the above problem, I am still very unsure on this as well.

anar-rzayev commented 5 months ago

@tiangexiang Any ideas on this?

tiangexiang commented 5 months ago

Sorry for the late response! The error you reported particularly indicates a mismatch between pytorch version and CUDA version. And you are right that the validation loader failure is probably due to version mismatch as well. In this way, I do recommend duplicating the exact environment as specified in environment.yaml, since it is guaranteed to work (be careful with the CUDA version though! It has to match your own hardware).

anar-rzayev commented 4 months ago

@tiangexiang Thanks for the reply. I checked very carefully and to match my hardware, I set up cudatoolkit=11.3 and the corresponding PyTorch versions as follows:

name: ddm2_experiment
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1
  - _openmp_mutex=4.5
  - _pytorch_select=0.1
  - blas=1.0
  - ca-certificates=2022.3.29
  - certifi=2021.10.8
  - cudatoolkit=11.3
  - freetype=2.11.0
  - giflib=5.2.1
  - intel-openmp=2021.4.0
  - jpeg=9d
  - lcms2=2.12
  - ld_impl_linux-64=2.35.1
  - libffi=3.3
  - libgcc-ng=9.3.0
  - libgomp=9.3.0
  - libpng=1.6.37
  - libstdcxx-ng=9.3.0
  - libtiff=4.2.0
  - libuv=1.40.0
  - libwebp=1.2.2
  - libwebp-base=1.2.2
  - lz4-c=1.9.3
  - mkl=2021.4.0
  - mkl-service=2.4.0
  - mkl_fft=1.3.1
  - mkl_random=1.2.2
  - ncurses=6.3
  - ninja=1.10.2
  - openssl=1.1.1n
  - pip=21.2.4
  - python=3.8.13
  - readline=8.1.2
  - setuptools=58.0.4
  - six=1.16.0
  - sqlite=3.38.2
  - tk=8.6.11
  - typing_extensions=4.1.1
  - wheel=0.37.1
  - xz=5.2.5
  - zlib=1.2.11
  - zstd=1.4.9
  - pip:
    - beautifulsoup4==4.11.1
    - charset-normalizer==2.0.12
    - cycler==0.11.0
    - dipy==1.5.0
    - filelock==3.6.0
    - fonttools==4.31.2
    - gdown==4.4.0
    - h5py==3.6.0
    - idna==3.3
    - imageio==2.16.1
    - joblib==1.1.0
    - kiwisolver==1.4.2
    - matplotlib==3.5.1
    - networkx==2.7.1
    - nibabel==3.2.2
    - numpy==1.22.3
    - opencv-python==4.5.4.58
    - packaging==21.3
    - pandas==1.4.1
    - pillow==9.1.0
    - pydicom==2.3.0
    - pyparsing==3.0.7
    - pysocks==1.7.1
    - python-dateutil==2.8.2
    - pytz==2022.1
    - pywavelets==1.3.0
    - pyyaml==6.0
    - requests==2.27.1
    - scikit-image==0.19.2
    - scikit-learn==1.0.2
    - scipy==1.8.0
    - seaborn==0.11.2
    - soupsieve==2.3.2.post1
    - statannot==0.2.3
    - threadpoolctl==3.1.0
    - tifffile==2022.3.25
    - timm==0.4.12
    - torch==1.8.0
    - torchvision==0.9.0
    - tqdm==4.63.1
    - urllib3==1.26.9

Even though the matching happened, I still had problems with the validation part of the training.

Validation
Traceback (most recent call last):
  File "train_noise_model.py", line 92, in <module>
    for _,  val_data in enumerate(val_loader):
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/anar/mambaforge-pypy3/envs/ddm2_experiment/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/anar/DDM2/data/mri_dataset.py", line 130, in __getitem__
    raw_input = raw_input[:,:,0]
IndexError: index 0 is out of bounds for axis 2 with size 0

Even trying the latest versions for torch & torchvision did not help at all 🙁

StanfordMIMI / DDM2

Environment conflicts with GPU #19