It's not an issue for 16-20 frames but anything longer often looks like it consists of two quite different parts. I enabled token padding as suggested but it doesn't seem to improve this situation much (maybe it's for a different issue, idk). The best consistency improvers are higher CFG (9+) and more steps (30) but for a longer video they're still not enough. Also a higher CFG (12-14) often introduces light flashes and unstable lighting in general.
Settings:
I use the fine tuned human motion model that's based on mm_v15_v2. Same issues arise on the vanilla v15_v2.
Result for 20 frames:
Quite good and stable. Now 32 frames, overlap -1(i.e. 4):
Everything is morphing, including the character that sits differently in the first and second half of the video.
Same 32 frames, overlap 6:
Slightly better, at least the background isn't as chaotic. The character is still not very stable.
Same 32 frames, overlap 8:
Same 32 frames, overlap 10:
Getting somewhere, the morphing is still there but isn't as bad as in the beginning.
Anything else I'm missing? Is it possible to somehow enforce a better context preservation for longer videos, twice the context size and more? Or is it a fundamental limitation of the current tech? I'm currently not interested much in vid2vid, only in txt2vid. I know that guiding inference with a video should yield much better results.
Expected behavior
WARNING! Many animated GIFs, ≈9 Mb each.
It's not an issue for 16-20 frames but anything longer often looks like it consists of two quite different parts. I enabled token padding as suggested but it doesn't seem to improve this situation much (maybe it's for a different issue, idk). The best consistency improvers are higher CFG (9+) and more steps (30) but for a longer video they're still not enough. Also a higher CFG (12-14) often introduces light flashes and unstable lighting in general.
Settings:
I use the fine tuned human motion model that's based on mm_v15_v2. Same issues arise on the vanilla v15_v2.
Result for 20 frames: Quite good and stable. Now 32 frames, overlap -1(i.e. 4): Everything is morphing, including the character that sits differently in the first and second half of the video. Same 32 frames, overlap 6: Slightly better, at least the background isn't as chaotic. The character is still not very stable. Same 32 frames, overlap 8: Same 32 frames, overlap 10: Getting somewhere, the morphing is still there but isn't as bad as in the beginning.
Anything else I'm missing? Is it possible to somehow enforce a better context preservation for longer videos, twice the context size and more? Or is it a fundamental limitation of the current tech? I'm currently not interested much in vid2vid, only in txt2vid. I know that guiding inference with a video should yield much better results.