Picsart-AI-Research / Text2Video-Zero

[ICCV 2023 Oral] Text-to-Image Diffusion Models are Zero-Shot Video Generators
https://text2video-zero.github.io/
Other
3.91k stars 336 forks source link

Condition Text To Video generation on first frame #29

Open PaulOrasan opened 1 year ago

PaulOrasan commented 1 year ago

Hi!

This is a great work with amazing results, good job!

I was hoping you could provide some guidance on the following issue. I'm trying to condition the video generation by providing the first frame for text to video generation.

I noticed that the inference scheme for TextToVideoPipeline accepts xT as parameter, so I've made some modifications to the code and I'm providing the latent encoding of the first frame instead of sampling random latents.

I'm using the VAE and the SD preprocessing scheme. I've tested that encoding the frame and decoding the latents produces (practically) the same image and it works.

My issue is that the full generation produces low quality results, with a diagonal camera movement from left to right and low resolution and weird "filters". I assume it has to do with the backward diffusion steps on the first frame, but I'm kind of stuck on what to do next.

I would really appreciate your input on this one. Thanks!

Here's snippets of code on how I encode:

def encode_latents(self, image: PIL.Image):
        img = np.array(image.convert("RGB"))
        img = img[None].transpose(0, 3, 1, 2)
        img = torch.from_numpy(img).to(dtype=torch.float32) / 127.5 - 1.0
        masked_image_latents = self.vae.encode(img.to(device=self.device, dtype=torch.float16)).latent_dist.sample()
        return torch.unsqueeze(masked_image_latents, 2) * 0.18215

Results (all parameters are left to default of original text to video): Input image (text = "horse galloping"): horse_resized Produced output:

https://user-images.githubusercontent.com/41448014/229351697-9c7f7dbf-546e-4628-8e1e-5dc92e510c85.mp4

Input image (first frame of previous generated gif - "a horse galloping on a street"): generated_horse

Produced output:

https://user-images.githubusercontent.com/41448014/229365472-e23c676b-37a9-4715-8c66-18e0dcdb5362.mp4

PaulOrasan commented 1 year ago

I've now realised that my mistake was that I didn't do forward steps on the first frame latents.

The code is working fine now after doing 1000 forward steps on the first frame latents :). I'll try to post some results later.

Do you have any suggestions on how to tune hyperparameters such as motion field strength, t0 and t1 etc?

rob-hen commented 1 year ago

Hi @PaulOrasan,

motion field strength controls the global motion (camera translation). The parameters t0 and t1 enable to perceive object motion (e.g., making a walking step). The larger the gap between t0 and t1, the more variance you will observe in the object motion.

Jaxkr commented 1 year ago

I've now realised that my mistake was that I didn't do forward steps on the first frame latents.

The code is working fine now after doing 1000 forward steps on the first frame latents :). I'll try to post some results later.

Hey @PaulOrasan, how did you do the 1000 forward steps on the first frame latents? I have a very similar setup to you but am using ControlNet to generate a walk animation.

I added a method to turn an image into latents

class StableDiffusionControlNetPipelineWithInitialImageSupport(
    StableDiffusionControlNetPipeline
):
    def encode_img_to_latents(self, image: PIL.Image):
        img = np.array(image.convert("RGB"))
        img = img[None].transpose(0, 3, 1, 2)
        '''
        127.5 (half of 256) is used to normalize the pixel values of the image to the range [0, 1]
        Subtracting one makes it [-1, 1]
        '''
        img = torch.from_numpy(img).to(dtype=torch.float32) / 127.5 - 1.0
        masked_image_latents = self.vae.encode(
            img.to(device=self.device, dtype=torch.float16)
        ).latent_dist.sample()

        # Magic number source: https://github.com/huggingface/diffusers/issues/437#issuecomment-1241827515
        return masked_image_latents * 0.18215

Then, I consume it in model.py:

def process_controlnet_pose():
    # ...
    image = PIL.Image.open('first.png')
    first_frame_latents = self.pipe.encode_img_to_latents(image)

    print(f"first frame latents size:: {first_frame_latents.size()}")
    latents = torch.randn((1, 4, h//8, w//8), dtype=self.dtype,
                          device=self.device, generator=self.generator)
    latents = latents.repeat(f, 1, 1, 1)
    print(f"latents size: {latents.size()}")
    print(f"latents[0]: {latents[0].size()}")
    latents[0] = first_frame_latents

with shapes:

video shape: torch.Size([19, 3, 512, 512])
first frame latents size: torch.Size([1, 4, 64, 64])
latents size: torch.Size([19, 4, 64, 64])
latents[0]: torch.Size([4, 64, 64])

However, the first frame is very blurry (like the examples you posted) and the second frame has a significant drift from the first:

https://user-images.githubusercontent.com/1147244/230458760-19fcf297-94a4-4e86-aa86-6d3e67cd8b1a.mp4

@rob-hen : Is there any recommendation for conditioning generation from the first frame? Checked the paper and all the code and couldn't find anything.

rob-hen commented 1 year ago

The forward step should be done using deterministic forward steps (not included in the current code), in order to recover the input image. For the text conditioning, some further adaption must be done (like null text inversion).

Jaxkr commented 1 year ago

The forward step should be done using deterministic forward steps (not included in the current code), in order to recover the input image. For the text conditioning, some further adaption must be done (like null text inversion).

Thank you for the reply! Could you point me towards an implemention for deterministic foward steps? I have tried Googling extensively and also asking ChatGPT (but this stuff is too new for it).

Additionally, for null text conditioning are you referring to https://arxiv.org/pdf/2211.09794.pdf?

rob-hen commented 1 year ago

@Jaxkr The paper you found is correct. It also shows the deterministic DDIM inversion (which I was referring to).

Jaxkr commented 1 year ago

https://huggingface.co/docs/diffusers/api/schedulers/ddim_inverse Working with this. Will let you know how it goes.

PaulOrasan commented 1 year ago

Hi @Jaxkr, sorry for replying late.

I did DDPM forward because this allows for quality results in terms of resolution and such (no bluriness). It does work to condition the video generation based on the first frame, but as expected it's not a strong guidance: the video does not preserve the exact original image.

Depending on how closely related the provided image is to SD training data, it could produce pretty reasonable representations of the original image or totally divergent outputs.

I was planning to incorporate the null text inversion also mentioned by @rob-hen and see how it works out, but things are moving slow on my end due to time constraints. Let me know if it works out for you! :D

zhouliang-yu commented 1 year ago

@PaulOrasan Hey, how's the performance after applying null text inversion

rob-hen commented 1 year ago

@PaulOrasan DDPM forward can normally not be inverted so that you will not be able to reproduce the first frame. You should use DDIM forward and DDIM backward.

There should not be issues with quality and bluriness when using DDIM.

PaulOrasan commented 1 year ago

@zhouliang-yu hi there, sorry for replying so late, I've been away from this github account. It's still a WIP and full of hard coded stuff, haven't had much time to attend to it. It is able to reconstruct the original image with the null text optimization, but the generated output loses some of the consistency.

@rob-hen thanks for the tips, I only did DDPM forward initially to get it to generate good quality samples, which it works. I'm using DDIM forward and the null text optimization for the DDIM backward.

What I noticed is that the output loses quite a lot of consistency (foreground objects tend to become a bit deformed), do you have any intuition as to what might cause this? I have a feeling that I somehow messed up the cross-frame attention

adhityaswami commented 1 year ago

Sorry to re-open this. I'm trying to condition a video using an initial image, but I'm struggling to figure out how to do either the DDPM or the DDIM forward passes. I've spend a ton of time trying to understand this, but seem to have hit a dead end. Can I have some help figuring this out? I'm currently passing my own latents and getting a blurry, weird video similar to the ones above posted in this thread.

rob-hen commented 1 year ago

Have you tried to do the forward DDIM and backward DDIM with vanilla stable diffusion (on images) ? Once you get the correct results, you should be able to use it for T2V-Zero.

adhityaswami commented 1 year ago

Hey, thanks for the response! I managed to figure out how to make the DDPM forward thing work like @PaulOrasan mentioned, and it's decent-ish. My next step is to implement a DDIM forward function to use instead of the DDPM forward pass that currently works. I'll try this out, but if there's any code snippets/samples or links to help me implement the DDIM forward pass in the meantime that would be super helpful.

By the way, this repo works wonderfully thanks a lot @rob-hen !

adhityaswami commented 1 year ago

@rob-hen I was able to make this work. The reconstruction is pretty close, but not 100%, even when using SD generated images. Thanks a lot for your help!

rob-hen commented 1 year ago

@adhityaswami here you see what you can expect from DDIM inversion.