Open G-force78 opened 5 days ago
I am also looking for the same information
please check: https://github.com/kijai/ComfyUI-MimicMotionWrapper 14 GB VRAM
Not sure I will survive given how tight it is but there are a number of optimisation opportunities still outstanding. These are the files I have optimised so far. How to share, tried to attach without joy
One way to reduce VRAM usage is to run cfg in seperate batches. Currently, the implementation processes cfg in batches of 2 samples, which can lead to higher peak GPU memory consumption. https://github.com/Tencent/MimicMotion/blob/0af6dab3e2f816717fa56815f1d7a1ba22375050/mimicmotion/pipelines/pipeline_mimicmotion.py#L608 To address this, you can use two forward passes: one normal and one without guidance.
For the 72 frame model, the GPU usage in this case is ~21.1G.
Thank you for the suggestion regarding separate batches for classifier-free guidance. I've implemented this optimization along with a few others, and it has indeed helped reduce VRAM usage during the main processing phase. However, I'm still encountering some challenges:
Memory spike at the end: Even with these optimizations, I'm observing a dramatic increase in memory usage towards the end of the process, particularly during the decoding phase. Video length constraints: I initially attempted a 35-second video, which resulted in an out-of-memory error during decoding. Reducing to 5 seconds at default settings brought memory usage down to about 14GB, but it still spikes near the end, barely avoiding another OOM error. Quality concerns: With the default settings, the output video quality appears to be significantly lower than expected, especially compared to the demo results. Balancing quality and stability: I'm experimenting with higher step counts to improve quality, but this increases the risk of running out of memory at the final stages.
The ability to replicate the demo quality while maintaining stability would be a game-changer for this project. Do you have any insights on how to address the memory spike during decoding or suggestions for settings that might better balance quality and memory usage? I'm particularly interested in understanding how the demo results were achieved and if there are any specific optimizations used for longer, higher-quality outputs.
For the decoding stage, the vae may consume a lot of memory. Actually, the 35s output video along may use 1024*576*15(fps)*35(s) *3 (RGB) * (4Byte) = 3.6 GB
. The pose latents alone will use 128*72*15*35*320*4*2(cfg) = 12.4 GB
VRAM.
I have several thoughts in to further reduce VRAM:
del pose_latents
with torch.cuda.device(device):
torch.cuda.empty_cache()
Update: I successfully ran the 72-frame model (35s video) on a 4060ti with 16GB VRAM. I plan to submit a pull request shortly.
I assume shorter reference videos for one. How about smaller resolution? Can it go below 576 and 9:16?