294coder / Dif-PAN

Diff-PAN: Denoising Diffusion Model for Pansharpening offical repository
22 stars 1 forks source link

CUDA out of memory #3

Closed dariepetcu closed 1 month ago

dariepetcu commented 6 months ago

Hello, I am trying to run your code on the test split of the wv3/qb/gf2 dataset, which I found on the PanCollection repository. Firstly, when using the FullData h5 file the code always runs into a KeyError for the d["gt"] in the pan_dataset.py file: KeyError: "Unable to synchronously open object (object 'gt' doesn't exist)"

However, my main issue appears when running the reduced example h5 files, which don't run into the error above. Instead, I get a CUDA out of memory error. The program first gives the following output, then it prints the error below.

2024-04-02 13:25:53 - INFO - log will print out in ./logs/04-02_13-25-pandiff.log 2024-04-02 13:25:53 - INFO - dataset name: qb 2024-04-02 13:25:53 - INFO - dataset norm division: 2047.0 2024-04-02 13:25:53 - INFO - rgb channel: [0, 1, 2] use attn: res 8 processing wavelets... done. datasets shape: pan ms lms gt
(20, 1, 256, 256) (20, 4, 64, 64) (20, 4, 256, 256) (20, 4, 256, 256)

output data ranging in [0, 1] processing wavelets... done. datasets shape: pan ms lms gt
(20, 1, 256, 256) (20, 4, 64, 64) (20, 4, 256, 256) (20, 4, 256, 256)
output data ranging in [0, 1]

File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/functional.py", line 2561, in group_norm return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 79.38 MiB is free. Process 94447 has 1.05 GiB memory in use. Process 103551 has 2.92 GiB memory in use. Process 3279216 has 1012.00 MiB memory in use. Process 3294244 has 1012.00 MiB memory in use. Including non-PyTorch memory, this process has 17.53 GiB memory in use. Of the allocated memory 17.03 GiB is allocated by PyTorch, and 46.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I have tried to change the group size of the groupnorm to lower values but that did not help; and setting the expandable segments to true, as the trace suggests, also did not help. The GPU I am using is an RTX 4090. The full error trace is below:

Traceback (most recent call last): File "/home/petcu/Desktop/ddif/repo/Dif-PAN/diffusion_engine.py", line 512, in engine_google( File "/home/petcu/Desktop/ddif/repo/Dif-PAN/diffusion_engine.py", line 232, in engine_google diff_loss, recon_x = diffusion_dp(res, cond=cond) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/Desktop/ddif/repo/Dif-PAN/diffusion/diffusion_ddpm_pan.py", line 770, in forward return self.p_losses(x, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/Desktop/ddif/repo/Dif-PAN/diffusion/diffusion_ddpm_pan.py", line 720, in p_losses model_predict = self.model(x_noisy, t, cond=cond, self_cond=x_self_cond) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/Desktop/ddif/repo/Dif-PAN/models/sr3_dwt.py", line 196, in forward x = layer( ^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/Desktop/ddif/repo/Dif-PAN/models/sr3_dwt.py", line 661, in forward x = self.cond_inj( ^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/Desktop/ddif/repo/Dif-PAN/models/sr3_dwt.py", line 390, in forward cond = self.body(cond) ^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) ^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/modules/normalization.py", line 287, in forward return F.group_norm( ^^^^^^^^^^^^^ File "/home/petcu/miniconda3/envs/ddif/lib/python3.11/site-packages/torch/nn/functional.py", line 2561, in group_norm return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

294coder commented 6 months ago
  1. KeyError: "Unable to synchronously open object (object 'gt' doesn't exist) is caused since the FULL pansharpening dataset has no GT. That's why we do not calculate the reduced metrics (i.e., SAM, ERGAS). To run test_fn, you should pass full_res arg to True which means FULL resolution.

  2. The CUDA OOM issue is not related to GroupNorm, and the n_groups>1 may cause some color shifts in pansharpening (in my experiment). So you can try to make the network smaller to ensure the running or try on GPU which has more memory.