ZHU-Zhiyu / NVS_Solver

Source code of paper "NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer"
189 stars 1 forks source link

How to reduce GPU usage for posterior sampling #13

Open tsuJohc opened 1 week ago

tsuJohc commented 1 week ago

Hi, I wonder how can I reduce the GPU usage of posterior sampling-based NVS? Because I only have a RTX 3090 (24 GB) and observe a obvious distortion in direct guided sampling.

I tried the reducing the resolution #6, but still got "torch.cuda.OutOfMemoryError: CUDA out of memory." when executing

# in optimization part
                            noise_pred_t = self.unet(
                                latent_model_input1,
                                t,
                                encoder_hidden_states=image_embeddings1,
                                added_time_ids=added_time_ids1,
                                return_dict=False,
                            )[0]

Do you have any suggestions on it?

ZHU-Zhiyu commented 1 week ago

Hi @tsuJohc You may try to reduce the size of backpropagation patches or just reduce the size of input image latents. Moreover, detecting some layers' features in the denoising U-Net may also help. Meanwhile, we are planning to implement NVS-Solver with OpenSora, which supports different resolutions, which may help with less GPU memory.

tsuJohc commented 6 days ago

Thanks for the reply. I will try reducing size of image latents, but what is the "detecting some layers' features in the denoising U-Net". Could you explain more on it?

ZHU-Zhiyu commented 6 days ago

@tsuJohc image like the solid line (backpropagation only on several layers) compared to the dashed line (normal backpropagation).

tsuJohc commented 11 hours ago

Hi, I am fortunate to get a RTX A6000 GPU 48G (same as GPU in paper). However, I still faced the out of memory with following message when running single image NVS in demo.sh with posterior sampling.

File "svd_interpolate_single_img.py", line 616,
with torch.no_grad():
    ...        
    noise_pred = self.unet(
        latent_model_input,
        t,
        encoder_hidden_states=image_embeddings,
        added_time_ids=added_time_ids,
        return_dict=False,
    )[0]
RuntimeError: CUDA out of memory. Tried to allocate 39.55 GiB (GPU 0; 47.54 GiB total capacity; 6.26 GiB already allocated; 37.05 GiB free; 8.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
mengyou2 commented 5 hours ago

https://github.com/ZHU-Zhiyu/NVS_Solver/blob/e8433d60a01eccf7cd967281ec42911aa843c4f2/svd_interpolate_single_img.py#L552-L603

Please try to replace it with the following code. I split two more patches.

for ii in range(6):
      with torch.enable_grad(): 
            latents.requires_grad_(True)
            latents.retain_grad()
            image_latents.requires_grad_(True)         
            latent_model_input = latent_model_input.detach()
            latent_model_input.requires_grad = True

            named_param = list(self.unet.named_parameters())
            for n,p in named_param:
                p.requires_grad = False
            if ii == 0:
                latent_model_input1 = latent_model_input[0:1,:,:,:40,:48]
                latents1 = latents[0:1,:,:,:40,:48]
                temp_cond_latents1 = temp_cond_latents[:2,:,:,:40,:48]
                mask1 = mask[0:1,:,:,:40,:48]
            elif ii ==1:
                latent_model_input1 = latent_model_input[0:1,:,:,32:,:48]
                latents1 = latents[0:1,:,:,32:,:48]
                temp_cond_latents1 = temp_cond_latents[:2,:,:,32:,:48]
                mask1 = mask[0:1,:,:,32:,:48]
            elif ii ==2:
                latent_model_input1 = latent_model_input[0:1,:,:,:40,40:88]
                latents1 = latents[0:1,:,:,:40,40:88]
                temp_cond_latents1 = temp_cond_latents[:2,:,:,:40,40:88]
                mask1 = mask[0:1,:,:,:40,40:88]
            elif ii ==3:
                latent_model_input1 = latent_model_input[0:1,:,:,32:,40:88]
                latents1 = latents[0:1,:,:,32:,40:88]
                temp_cond_latents1 = temp_cond_latents[:2,:,:,32:,40:88]
                mask1 = mask[0:1,:,:,32:,40:88]
            elif ii ==4:
                latent_model_input1 = latent_model_input[0:1,:,:,:40,80:]
                latents1 = latents[0:1,:,:,:40,80:]
                temp_cond_latents1 = temp_cond_latents[:2,:,:,:40,80:]
                mask1 = mask[0:1,:,:,:40,80:]
            elif ii ==5:
                latent_model_input1 = latent_model_input[0:1,:,:,32:,80:]
                latents1 = latents[0:1,:,:,32:,80:]
                temp_cond_latents1 = temp_cond_latents[:2,:,:,32:,80:]
                mask1 = mask[0:1,:,:,32:,80:]
            image_embeddings1 = image_embeddings[0:1,:,:]
            added_time_ids1 =added_time_ids[0:1,:]
            torch.cuda.empty_cache()
            noise_pred_t = self.unet(
                latent_model_input1,
                t,
                encoder_hidden_states=image_embeddings1,
                added_time_ids=added_time_ids1,
                return_dict=False,
            )[0]

            output = self.scheduler.step_single(noise_pred_t, t, latents1,temp_cond_latents1,mask1,lambda_ts,step_i=i,lr=lr,weight_clamp=weight_clamp,compute_grad=True)
            grad = output.grad
            grads.append(grad)

    grads1 = torch.cat((grads[0],grads[1][:,:,:,8:,:]),-2)                
    grads2 = torch.cat((grads[2],grads[3][:,:,:,8:,:]),-2)
    grads3 = torch.cat((grads[4],grads[5][:,:,:,8:,:]),-2)
    grads4 = torch.cat((grads1,grads2[:,:,:,:,8:],grads3[:,:,:,:,8:]),-1)
    latents = latents - grads4.half()
tsuJohc commented 5 hours ago

Thanks for the active reply!

I replaced it with provided code (6 patches). However, out of memory still happened.

RuntimeError: CUDA out of memory. Tried to allocate 39.55 GiB (GPU 0; 47.54 GiB total capacity; 6.25 GiB already allocated; 38.47 GiB free; 7.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CON

This time, the code still needs 1.1 GB (2.5 GB required last time, 4 patches). From 4 patches to 6 patches does help the problem but is not enough.

Therefore, (1) do I need to split more patches for memory efficiency? (2) Could you tell me the rules of patch division?

mengyou2 commented 2 hours ago

The patch division is quite simple. The last two dimension of latent_model_input is 72, 128. So you just need to make sure the patch is evenly split the two dimension. And I usually make the patches have small overlaps to make sure the continuity through the image.