Closed tsuJohc closed 4 months ago
Hi @tsuJohc You may try to reduce the size of backpropagation patches or just reduce the size of input image latents. Moreover, detecting some layers' features in the denoising U-Net may also help. Meanwhile, we are planning to implement NVS-Solver with OpenSora, which supports different resolutions, which may help with less GPU memory.
Thanks for the reply. I will try reducing size of image latents, but what is the "detecting some layers' features in the denoising U-Net". Could you explain more on it?
@tsuJohc like the solid line (backpropagation only on several layers) compared to the dashed line (normal backpropagation).
Hi, I am fortunate to get a RTX A6000 GPU 48G (same as GPU in paper). However, I still faced the out of memory with following message when running single image NVS in demo.sh
with posterior sampling.
File "svd_interpolate_single_img.py", line 616,
with torch.no_grad():
...
noise_pred = self.unet(
latent_model_input,
t,
encoder_hidden_states=image_embeddings,
added_time_ids=added_time_ids,
return_dict=False,
)[0]
RuntimeError: CUDA out of memory. Tried to allocate 39.55 GiB (GPU 0; 47.54 GiB total capacity; 6.26 GiB already allocated; 37.05 GiB free; 8.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Please try to replace it with the following code. I split two more patches.
for ii in range(6):
with torch.enable_grad():
latents.requires_grad_(True)
latents.retain_grad()
image_latents.requires_grad_(True)
latent_model_input = latent_model_input.detach()
latent_model_input.requires_grad = True
named_param = list(self.unet.named_parameters())
for n,p in named_param:
p.requires_grad = False
if ii == 0:
latent_model_input1 = latent_model_input[0:1,:,:,:40,:48]
latents1 = latents[0:1,:,:,:40,:48]
temp_cond_latents1 = temp_cond_latents[:2,:,:,:40,:48]
mask1 = mask[0:1,:,:,:40,:48]
elif ii ==1:
latent_model_input1 = latent_model_input[0:1,:,:,32:,:48]
latents1 = latents[0:1,:,:,32:,:48]
temp_cond_latents1 = temp_cond_latents[:2,:,:,32:,:48]
mask1 = mask[0:1,:,:,32:,:48]
elif ii ==2:
latent_model_input1 = latent_model_input[0:1,:,:,:40,40:88]
latents1 = latents[0:1,:,:,:40,40:88]
temp_cond_latents1 = temp_cond_latents[:2,:,:,:40,40:88]
mask1 = mask[0:1,:,:,:40,40:88]
elif ii ==3:
latent_model_input1 = latent_model_input[0:1,:,:,32:,40:88]
latents1 = latents[0:1,:,:,32:,40:88]
temp_cond_latents1 = temp_cond_latents[:2,:,:,32:,40:88]
mask1 = mask[0:1,:,:,32:,40:88]
elif ii ==4:
latent_model_input1 = latent_model_input[0:1,:,:,:40,80:]
latents1 = latents[0:1,:,:,:40,80:]
temp_cond_latents1 = temp_cond_latents[:2,:,:,:40,80:]
mask1 = mask[0:1,:,:,:40,80:]
elif ii ==5:
latent_model_input1 = latent_model_input[0:1,:,:,32:,80:]
latents1 = latents[0:1,:,:,32:,80:]
temp_cond_latents1 = temp_cond_latents[:2,:,:,32:,80:]
mask1 = mask[0:1,:,:,32:,80:]
image_embeddings1 = image_embeddings[0:1,:,:]
added_time_ids1 =added_time_ids[0:1,:]
torch.cuda.empty_cache()
noise_pred_t = self.unet(
latent_model_input1,
t,
encoder_hidden_states=image_embeddings1,
added_time_ids=added_time_ids1,
return_dict=False,
)[0]
output = self.scheduler.step_single(noise_pred_t, t, latents1,temp_cond_latents1,mask1,lambda_ts,step_i=i,lr=lr,weight_clamp=weight_clamp,compute_grad=True)
grad = output.grad
grads.append(grad)
grads1 = torch.cat((grads[0],grads[1][:,:,:,8:,:]),-2)
grads2 = torch.cat((grads[2],grads[3][:,:,:,8:,:]),-2)
grads3 = torch.cat((grads[4],grads[5][:,:,:,8:,:]),-2)
grads4 = torch.cat((grads1,grads2[:,:,:,:,8:],grads3[:,:,:,:,8:]),-1)
latents = latents - grads4.half()
Thanks for the active reply!
I replaced it with provided code (6 patches). However, out of memory still happened.
RuntimeError: CUDA out of memory. Tried to allocate 39.55 GiB (GPU 0; 47.54 GiB total capacity; 6.25 GiB already allocated; 38.47 GiB free; 7.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON
This time, the code still needs 1.1 GB (2.5 GB required last time, 4 patches). From 4 patches to 6 patches does help the problem but is not enough.
Therefore, (1) do I need to split more patches for memory efficiency? (2) Could you tell me the rules of patch division?
The patch division is quite simple. The last two dimension of latent_model_input
is 72, 128. So you just need to make sure the patch is evenly split the two dimension. And I usually make the patches have small overlaps to make sure the continuity through the image.
Thanks for active support on my issue!
I found my older-version pytorch causes out of memory when running demo.sh
with RTX A6000 48GB.
Because I only had cuda-11.3 in system environment, I installed pytorch-1.12.1. After I upgraded it to pytorch-2.2.1, demo.sh
worked fine.
In summary, for avoiding out of memory, (1) use a GPU with 48GB at least; (2) install the newer pytorch (like pytorch-2.2.1); (3) split more patches in backpropagating patches as @mengyou2 did. I solved my problem with (1)(2).
Hi, I wonder how can I reduce the GPU usage of posterior sampling-based NVS? Because I only have a RTX 3090 (24 GB) and observe a obvious distortion in direct guided sampling.
I tried the reducing the resolution #6, but still got "torch.cuda.OutOfMemoryError: CUDA out of memory." when executing
Do you have any suggestions on it?