Attention is all you need. Consistent video animation (simplified version of RERENDER A VIDEO: ZERO-SHOT TEXT-GUIDED VIDEO-TO-VIDEO TRANSLATION)

recoilme commented 1 year ago

https://anonymous-31415926.github.io/ - looked at this paper (RERENDER A VIDEO: ZERO-SHOT TEXT-GUIDED VIDEO-TO-VIDEO TRANSLATION)

The bottom line is that the input to each frame is the previous frame and a certain crossframe with textures, colors and shapes: The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors.

In short, I didn’t fucking understand, except that if you get confused, you can get an awesomely consistent video.

I set up a deforum, a bunch of controlnet models and started experimenting. First I came up with the following combination of models:

reference only
reference attan
open pose/face
softage/head All 4 through a loopback (pointing to the previous frame)

Generation speed dropped from 6 iterations/sec to 1 However, the authors of the paper write that they also need 16 video memory

But the generations were still inconsistent. Then I threw a reference on the very first frame:

reference only on the first frame

And then a miracle happened. The girl stopped mutating like crazy

https://github.com/deforum-art/deforum-stable-diffusion/assets/417177/79428ba9-c9d0-4dcd-b3f7-402889ff3071

I decided to test the theory that in fact it is not the number of models, but the quality. You have to give two signals with a reference (slightly different) so that the model understands what frames unite. I threw out Oppose and Softage and Reference Atan and left only 2 controllers, both on the reference. The first takes the last frame. The second takes the very first frame. And the combination gave even more consistent results. And by the way, the speed has doubled to 2 iterations per second, because 2 models instead of five:

https://github.com/deforum-art/deforum-stable-diffusion/assets/417177/b3d16ab9-1e89-4318-a05c-1a30292c2fe5

Actually, I think it's even simpler than that. You need to give two frames, the last and penultimate, so that the model rolls animation over them. This will not have to adjust each frame by hand, and will smooth it out more properly, because the difference between the first and last frame increases with the distance covered by the animation. In general, you get such a ZERO-SHOT VIDEO-TO-VIDEO of shit and sticks, just as we like. Configure deforum for those who want to repeat it attached deforum_settings_cntrl (1).txt

And to understand the difference, the version without the controllet. When the model starts to dance (zero zero knocked out)

https://github.com/deforum-art/deforum-stable-diffusion/assets/417177/7489f9a3-4f44-4363-8d3e-ecc823687c6e

I haven't tested video2video by mask and so on, but I'm sure it should work. So, attention is all you need, just add two frames instead of one

https://dump.video/i/B1PLxztF.mp4

recoilme commented 1 year ago

My experiment has dunamic strenght for each frame (by music amplitude) When i fix strength - it glitches I think we need more experiments. Ideally we must apply big weight to last frame and smaller to previous

For example current denoising strength 0.84

May be

last frame denoising strength 0.75 (last frame -1) denoising strength 0.5 (last frame -2) denoising strength 0.25

deforum commented 1 year ago

if you would like to contribute to the repo we can try merging your changes

Vendaciousness commented 1 year ago

Just wanted to mention that for this Loopback ControlNet to work, you need so drop Strength Schedule to 0:(0) or near 0, or it goes all wrong, in video2video. If your video looks like someone moving around behind the wallpaper, that's the reason. Here's even 0.15 SS with CN Loopback on:

https://github.com/deforum-art/deforum-stable-diffusion/assets/127359965/6fd07521-6b35-4f5f-b105-93838eda090b

And by the way, if anyone know how to get someone smoking, hit me up, because I could not do it, even with advanced prompt engineering and LoRA models of smoking.

-V

Vendaciousness commented 1 year ago

I think that the results in that paper could be replicated with Loopback, but even better would be a "Use previous image" option, which would let you use the last frame as the CNet image. Think of all the ways you could blend frames, if we could use the previous image as input... Loopback is a good preview of the possibilities, but for example, we could use Tile at 0.75, the same way we use Strength Schedule, I think. Maybe this is kinda extra, with loopback available, but it would be like having multiplicative different ways to control how your previous frame affects the next one.

deforum-art / deforum-stable-diffusion

Attention is all you need. Consistent video animation (simplified version of RERENDER A VIDEO: ZERO-SHOT TEXT-GUIDED VIDEO-TO-VIDEO TRANSLATION) #269