ZHU-Zhiyu / NVS_Solver

Source code of paper "NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer"
189 stars 1 forks source link

change resolution for single image mode #6

Closed twblaotang closed 2 weeks ago

twblaotang commented 3 weeks ago

Hi, Thank you for your excellent work! I tried using your single-image script and change the resolution to 320x192, but the final output deteriorated. Could you help me on what might have gone wrong?

Below is the code I modified: Firstly, i change the image and mask resolution in save_warped_image

    K[0,0] = focal; K[1,1] = focal; K[0,2] = 320/2; K[1,2] = 192/2
    image_o = image_o.resize((320,192),PIL.Image.Resampling.NEAREST) 
    new_width, new_height = 320,192  # New dimensions
    depth = cv2.resize(depth, (new_width, new_height), interpolation=cv2.INTER_NEAREST)
    ... and in loop:
         mask_erosion = mask_erosion.reshape(192//8,8,320//8,8).transpose(0,2,1,3).reshape(192//8,320//8,64)

when call pipe, I add width and height params:

 pipe([image_o],temp_cond=cond_image,mask = masks,lambda_ts=lambda_ts,lr=lr,weight_clamp=weight_clamp,num_frames=25, 
decode_chunk_size=8,num_inference_steps=100,height=192,width=320).frames[0]

finally, in denoise loop, i remove patch cropping for new resoluton

                for ii in range(1):
                    with torch.enable_grad(): 
                        latents.requires_grad_(True)
                        latents.retain_grad()
                        image_latents.requires_grad_(True)         
                        latent_model_input = latent_model_input.detach()
                        latent_model_input.requires_grad = True

                        named_param = list(self.unet.named_parameters())
                        for n,p in named_param:
                            p.requires_grad = False
                        latent_model_input1 = latent_model_input[0:1,:,:,:,:]
                        latents1 = latents[0:1,:,:,:,:]
                        temp_cond_latents1 = temp_cond_latents[:2,:,:,:,:]
                        mask1 = mask[0:1,:,:,:,:]
                        image_embeddings1 = image_embeddings[0:1,:,:]
                        added_time_ids1 =added_time_ids[0:1,:]
                        torch.cuda.empty_cache()
                        print (latents.shape,latent_model_input.shape,latent_model_input1.shape, t.shape, image_embeddings1.shape, added_time_ids1.shape)
                        noise_pred_t = self.unet(
                            latent_model_input1,
                            t,
                            encoder_hidden_states=image_embeddings1,
                            added_time_ids=added_time_ids1,
                            return_dict=False,
                        )[0]

                        output = self.scheduler.step_single(noise_pred_t, t, latents1,temp_cond_latents1,mask1,lambda_ts,step_i=i,lr=lr,weight_clamp=weight_clamp,compute_grad=True)
                        grad = output.grad
                        grads.append(grad)

                latents = latents - grads[0].half()

I saved the cond images and masks, which seems not wrong. one mask: 000023_mask one cond image: 000023_cond

but the final output video seems to be deteriorated: 000023_result

Could you help me on what might have gone wrong? Thank you very much.

mengyou2 commented 3 weeks ago

Hi, thanks for your interest in our project. Your resolution change part looks fine. Have you tried running the pipeline with the normal size, i.e., 1024x576? does the pipeline function properly with this resolution?

twblaotang commented 3 weeks ago

Hi, thanks for your interest in our project. Your resolution change part looks fine. Have you tried running the pipeline with the normal size, i.e., 1024x576? does the pipeline function properly with this resolution?

Sure, I tried using a 1024x576 resolution and it produced the desired results. Do you have any suggestions? Should any parameters be adjusted along with the resolution?

mengyou2 commented 3 weeks ago

I think it's because the resolution is too small. I did the same steps as you did to run the pipeline with the input resolution of 192x320 and 384x640 respectively. Here is the result of 192x320: 001 and 384x640: 002 Enlarging the image resolution by a factor of two will give the correct result.

twblaotang commented 2 weeks ago

Thank you for your answer.