Open PaulOrasan opened 1 year ago
I've now realised that my mistake was that I didn't do forward steps on the first frame latents.
The code is working fine now after doing 1000 forward steps on the first frame latents :). I'll try to post some results later.
Do you have any suggestions on how to tune hyperparameters such as motion field strength, t0 and t1 etc?
Hi @PaulOrasan,
motion field strength controls the global motion (camera translation). The parameters t0 and t1 enable to perceive object motion (e.g., making a walking step). The larger the gap between t0 and t1, the more variance you will observe in the object motion.
I've now realised that my mistake was that I didn't do forward steps on the first frame latents.
The code is working fine now after doing 1000 forward steps on the first frame latents :). I'll try to post some results later.
Hey @PaulOrasan, how did you do the 1000 forward steps on the first frame latents? I have a very similar setup to you but am using ControlNet to generate a walk animation.
I added a method to turn an image into latents
class StableDiffusionControlNetPipelineWithInitialImageSupport(
StableDiffusionControlNetPipeline
):
def encode_img_to_latents(self, image: PIL.Image):
img = np.array(image.convert("RGB"))
img = img[None].transpose(0, 3, 1, 2)
'''
127.5 (half of 256) is used to normalize the pixel values of the image to the range [0, 1]
Subtracting one makes it [-1, 1]
'''
img = torch.from_numpy(img).to(dtype=torch.float32) / 127.5 - 1.0
masked_image_latents = self.vae.encode(
img.to(device=self.device, dtype=torch.float16)
).latent_dist.sample()
# Magic number source: https://github.com/huggingface/diffusers/issues/437#issuecomment-1241827515
return masked_image_latents * 0.18215
Then, I consume it in model.py:
def process_controlnet_pose():
# ...
image = PIL.Image.open('first.png')
first_frame_latents = self.pipe.encode_img_to_latents(image)
print(f"first frame latents size:: {first_frame_latents.size()}")
latents = torch.randn((1, 4, h//8, w//8), dtype=self.dtype,
device=self.device, generator=self.generator)
latents = latents.repeat(f, 1, 1, 1)
print(f"latents size: {latents.size()}")
print(f"latents[0]: {latents[0].size()}")
latents[0] = first_frame_latents
with shapes:
video shape: torch.Size([19, 3, 512, 512])
first frame latents size: torch.Size([1, 4, 64, 64])
latents size: torch.Size([19, 4, 64, 64])
latents[0]: torch.Size([4, 64, 64])
However, the first frame is very blurry (like the examples you posted) and the second frame has a significant drift from the first:
https://user-images.githubusercontent.com/1147244/230458760-19fcf297-94a4-4e86-aa86-6d3e67cd8b1a.mp4
@rob-hen : Is there any recommendation for conditioning generation from the first frame? Checked the paper and all the code and couldn't find anything.
The forward step should be done using deterministic forward steps (not included in the current code), in order to recover the input image. For the text conditioning, some further adaption must be done (like null text inversion).
The forward step should be done using deterministic forward steps (not included in the current code), in order to recover the input image. For the text conditioning, some further adaption must be done (like null text inversion).
Thank you for the reply! Could you point me towards an implemention for deterministic foward steps? I have tried Googling extensively and also asking ChatGPT (but this stuff is too new for it).
Additionally, for null text conditioning are you referring to https://arxiv.org/pdf/2211.09794.pdf?
@Jaxkr The paper you found is correct. It also shows the deterministic DDIM inversion (which I was referring to).
https://huggingface.co/docs/diffusers/api/schedulers/ddim_inverse Working with this. Will let you know how it goes.
Hi @Jaxkr, sorry for replying late.
I did DDPM forward because this allows for quality results in terms of resolution and such (no bluriness). It does work to condition the video generation based on the first frame, but as expected it's not a strong guidance: the video does not preserve the exact original image.
Depending on how closely related the provided image is to SD training data, it could produce pretty reasonable representations of the original image or totally divergent outputs.
I was planning to incorporate the null text inversion also mentioned by @rob-hen and see how it works out, but things are moving slow on my end due to time constraints. Let me know if it works out for you! :D
@PaulOrasan Hey, how's the performance after applying null text inversion
@PaulOrasan DDPM forward can normally not be inverted so that you will not be able to reproduce the first frame. You should use DDIM forward and DDIM backward.
There should not be issues with quality and bluriness when using DDIM.
@zhouliang-yu hi there, sorry for replying so late, I've been away from this github account. It's still a WIP and full of hard coded stuff, haven't had much time to attend to it. It is able to reconstruct the original image with the null text optimization, but the generated output loses some of the consistency.
@rob-hen thanks for the tips, I only did DDPM forward initially to get it to generate good quality samples, which it works. I'm using DDIM forward and the null text optimization for the DDIM backward.
What I noticed is that the output loses quite a lot of consistency (foreground objects tend to become a bit deformed), do you have any intuition as to what might cause this? I have a feeling that I somehow messed up the cross-frame attention
Sorry to re-open this. I'm trying to condition a video using an initial image, but I'm struggling to figure out how to do either the DDPM or the DDIM forward passes. I've spend a ton of time trying to understand this, but seem to have hit a dead end. Can I have some help figuring this out? I'm currently passing my own latents and getting a blurry, weird video similar to the ones above posted in this thread.
Have you tried to do the forward DDIM and backward DDIM with vanilla stable diffusion (on images) ? Once you get the correct results, you should be able to use it for T2V-Zero.
Hey, thanks for the response! I managed to figure out how to make the DDPM forward thing work like @PaulOrasan mentioned, and it's decent-ish. My next step is to implement a DDIM forward function to use instead of the DDPM forward pass that currently works. I'll try this out, but if there's any code snippets/samples or links to help me implement the DDIM forward pass in the meantime that would be super helpful.
By the way, this repo works wonderfully thanks a lot @rob-hen !
@rob-hen I was able to make this work. The reconstruction is pretty close, but not 100%, even when using SD generated images. Thanks a lot for your help!
Hi!
This is a great work with amazing results, good job!
I was hoping you could provide some guidance on the following issue. I'm trying to condition the video generation by providing the first frame for text to video generation.
I noticed that the inference scheme for
TextToVideoPipeline
acceptsxT
as parameter, so I've made some modifications to the code and I'm providing the latent encoding of the first frame instead of sampling random latents.I'm using the VAE and the SD preprocessing scheme. I've tested that encoding the frame and decoding the latents produces (practically) the same image and it works.
My issue is that the full generation produces low quality results, with a diagonal camera movement from left to right and low resolution and weird "filters". I assume it has to do with the backward diffusion steps on the first frame, but I'm kind of stuck on what to do next.
I would really appreciate your input on this one. Thanks!
Here's snippets of code on how I encode:
Results (all parameters are left to default of original text to video): Input image (text = "horse galloping"):
Produced output:
https://user-images.githubusercontent.com/41448014/229351697-9c7f7dbf-546e-4628-8e1e-5dc92e510c85.mp4
Input image (first frame of previous generated gif - "a horse galloping on a street"):![generated_horse](https://user-images.githubusercontent.com/41448014/229365437-993ae1c8-feb4-47ba-9807-0c5a9027d208.png)
Produced output:
https://user-images.githubusercontent.com/41448014/229365472-e23c676b-37a9-4715-8c66-18e0dcdb5362.mp4