Trentonom0r3 / After-Diffusion

A CEP Extension for Adobe After Effects that allows for seamless integration of the Stable Diffusion Web-UI.
GNU Affero General Public License v3.0
44 stars 2 forks source link

AHHA! I've figured out temporal Coherence Directly in AE... Kind of. #21

Open Trentonom0r3 opened 1 year ago

Trentonom0r3 commented 1 year ago

EBSynth makes things quick and easy, but often fails with more complex motion, and runs into issues with incoherence due to the nature of keyframes. Even when generating as a grid, there's often still temporal coherence issues to a degree.

The following method is quite a bit more detailed, but I believe that if utilized properly, could output some incredible results! (I have some newer tests to share that I'll upload later.)

At its core, it feels like a variation of old-school roto-animation, but with a lot of newer bells and whistles.

Roto - Animation is the process of drawing/painting over live action frames to create an animated/ drawn version of the original frame. (Also used as a guide for motion, actions, etc.)

AE has a great rotoscope tool, and, provides access to Mocha. These tools combined with content aware fill are highly valuable.

Here is what I call the Iterative Roto-Fill Process, or IRFP for short:

These steps can be done using EBSynth as well, but I find performing the steps in AE is a bit more streamlined, and a bit more integrated.

Essentially you need to rotoscope all important areas of your input, and split your input into sections.

For example; you have an input of a person talking/moving their hands. You'd create at bare minimum, a mask/roto for the head, the hands, torso, and/or legs.

For further refinement, you can mask out smaller areas of the face, and break it into chunks-- nose, eyes, mouth, forehead, etc.

For each smaller patch, you'll perform content aware fill over that area.

For larger areas with minimal motion (such as the movement of torso) You can use a single keyframe and get great results.

For areas with greater motion, you'd create 4-5 keyframes (I've found that 7-8 gives a great result) and perform content aware fill.

You repeat this process for each divided section of your input, and on your completed fills, iterate through the areas where you find consistencies. If it's an area with larger motion, you probably need more keyframes. If it's an area such as a forehead, cheeks, other areas where there's slight motion, but nothing on the level of mouth or eye movement, you can use a keyframe or two to enhance the coherence.

By iterating through the patches like this, you have a lot more control over how the final output looks, and can more easily fix inconsistencies.

For further refinement, using Mocha AE to track and break up the patches more accurately can lead to an even more coherent result.

After a final pass, you can use the facial tracking data from your input to warp your stylized video even further.

This is still a workflow in progress, but each new method discovered is leading to better and better results.

I'll be posting an example I made using this method later tonight!

Originally posted by @Trentonom0r3 in https://github.com/Trentonom0r3/After-Diffusion/issues/1#issuecomment-1627489976

Using multiple keyframes and taking time to further refine masks and positioning will lead to a better output. Here's a simple example of this method I threw together in about 10-15 minutes. IMG2IMG, 0.8 denoising strength, Controlnets used: Softedge HED, Depth Midas, OpenPose FaceOnly.

Head was done using 4 keyframes:

https://user-images.githubusercontent.com/130304830/252121533-709a8830-fbfc-4707-9c04-2e36baaaf82c.mp4

Trentonom0r3 commented 1 year ago

After a second pass;

https://github.com/Trentonom0r3/After-Diffusion/assets/130304830/3f9d0fd8-15a9-4c94-b246-659202660200

Trentonom0r3 commented 1 year ago

After a 3rd pass, deflicker, interpolation, and upscaling:

https://github.com/Trentonom0r3/After-Diffusion/assets/130304830/78bc697c-ea65-436b-b5fd-0618a1f1f0c7