Open Abhinay1997 opened 2 months ago
There were a couple of issues:
Both of them stem from the issue with shapes being different from latte t2v and cogvideo. viz. dim 1 has channels in latte but frames in cog.
There's still a bug though. But the images now look like this:
Things to try:
Instead of broadcasting across all frames in the scheduler step, use a loop as in the original implementation and see if it makes a difference. Not really sure if the scheduler auto increments the timestep when called in a loop. Just to remove that uncertainity.
Run it with the simplest configuration. a.k.a one partition, no lookahead denoising. And match every step with the algo in the paper.
The output from the model is just plain noise. Pretty close to:
Currently debugging why thats the case. Need to check if something went wrong when changing the logic from applying same timestep to all frames to making each frame have its own timestep embedding in the tensor. One simple test is to implement the original pipe call from diffusers using the modified transformer and scheduler.