Advices on consistency for longer videos?

Expected behavior

WARNING! Many animated GIFs, ≈9 Mb each.

It's not an issue for 16-20 frames but anything longer often looks like it consists of two quite different parts. I enabled token padding as suggested but it doesn't seem to improve this situation much (maybe it's for a different issue, idk). The best consistency improvers are higher CFG (9+) and more steps (30) but for a longer video they're still not enough. Also a higher CFG (12-14) often introduces light flashes and unstable lighting in general.

Settings: 2023-10-22_00-40-29

I use the fine tuned human motion model that's based on mm_v15_v2. Same issues arise on the vanilla v15_v2.

Result for 20 frames: 00076-2551851382 Quite good and stable. Now 32 frames, overlap -1(i.e. 4): 00077-2551851382 Everything is morphing, including the character that sits differently in the first and second half of the video. Same 32 frames, overlap 6: 00078-2551851382 Slightly better, at least the background isn't as chaotic. The character is still not very stable. Same 32 frames, overlap 8: 00079-2551851382 Same 32 frames, overlap 10: 00080-2551851382 Getting somewhere, the morphing is still there but isn't as bad as in the beginning.

Anything else I'm missing? Is it possible to somehow enforce a better context preservation for longer videos, twice the context size and more? Or is it a fundamental limitation of the current tech? I'm currently not interested much in vid2vid, only in txt2vid. I know that guiding inference with a video should yield much better results.

continue-revolution / sd-webui-animatediff

Advices on consistency for longer videos? #241

Expected behavior