CUDA out of memory - Githubissues

Hello, I am trying to run your code on the test split of the wv3/qb/gf2 dataset, which I found on the PanCollection repository. Firstly, when using the FullData h5 file the code always runs into a KeyError for the d["gt"] in the pan_dataset.py file: KeyError: "Unable to synchronously open object (object 'gt' doesn't exist)"

However, my main issue appears when running the reduced example h5 files, which don't run into the error above. Instead, I get a CUDA out of memory error. The program first gives the following output, then it prints the error below.

2024-04-02 13:25:53 - INFO - log will print out in ./logs/04-02_13-25-pandiff.log 2024-04-02 13:25:53 - INFO - dataset name: qb 2024-04-02 13:25:53 - INFO - dataset norm division: 2047.0 2024-04-02 13:25:53 - INFO - rgb channel: [0, 1, 2] use attn: res 8 processing wavelets... done. datasets shape: pan ms lms gt
(20, 1, 256, 256) (20, 4, 64, 64) (20, 4, 256, 256) (20, 4, 256, 256)

output data ranging in [0, 1] processing wavelets... done. datasets shape: pan ms lms gt
(20, 1, 256, 256) (20, 4, 64, 64) (20, 4, 256, 256) (20, 4, 256, 256)
output data ranging in [0, 1]

File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/functional.py", line 2561, in group_norm return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 79.38 MiB is free. Process 94447 has 1.05 GiB memory in use. Process 103551 has 2.92 GiB memory in use. Process 3279216 has 1012.00 MiB memory in use. Process 3294244 has 1012.00 MiB memory in use. Including non-PyTorch memory, this process has 17.53 GiB memory in use. Of the allocated memory 17.03 GiB is allocated by PyTorch, and 46.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I have tried to change the group size of the groupnorm to lower values but that did not help; and setting the expandable segments to true, as the trace suggests, also did not help. The GPU I am using is an RTX 4090. The full error trace is below:

Traceback (most recent call last): File "/home/petcu/Desktop/ddif/repo/Dif-PAN/diffusion_engine.py", line 512, in engine_google( File "/home/petcu/Desktop/ddif/repo/Dif-PAN/diffusion_engine.py", line 232, in engine_google diff_loss, recon_x = diffusion_dp(res, cond=cond) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/Desktop/ddif/repo/Dif-PAN/diffusion/diffusion_ddpm_pan.py", line 770, in forward return self.p_losses(x, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/Desktop/ddif/repo/Dif-PAN/diffusion/diffusion_ddpm_pan.py", line 720, in p_losses model_predict = self.model(x_noisy, t, cond=cond, self_cond=x_self_cond) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/Desktop/ddif/repo/Dif-PAN/models/sr3_dwt.py", line 196, in forward x = layer( ^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/Desktop/ddif/repo/Dif-PAN/models/sr3_dwt.py", line 661, in forward x = self.cond_inj( ^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/Desktop/ddif/repo/Dif-PAN/models/sr3_dwt.py", line 390, in forward cond = self.body(cond) ^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) ^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/normalization.py", line 287, in forward return F.group_norm( ^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/functional.py", line 2561, in group_norm return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

294coder / Dif-PAN

CUDA out of memory #3