Tencent / MimicMotion

High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance
https://tencent.github.io/MimicMotion/
Other
733 stars 56 forks source link

Any way to reduce VRAM? #21

Open G-force78 opened 5 days ago

G-force78 commented 5 days ago

I assume shorter reference videos for one. How about smaller resolution? Can it go below 576 and 9:16?

### Tasks
nitinmukesh commented 5 days ago

I am also looking for the same information

SlimeVRX commented 5 days ago

please check: https://github.com/kijai/ComfyUI-MimicMotionWrapper 14 GB VRAM

JuvenileLocksmith commented 1 day ago
Screenshot 2024-07-08 at 01 30 40

Not sure I will survive given how tight it is but there are a number of optimisation opportunities still outstanding. These are the files I have optimised so far. How to share, tried to attach without joy

Screenshot 2024-07-08 at 01 39 54
zyayoung commented 1 day ago

One way to reduce VRAM usage is to run cfg in seperate batches. Currently, the implementation processes cfg in batches of 2 samples, which can lead to higher peak GPU memory consumption. https://github.com/Tencent/MimicMotion/blob/0af6dab3e2f816717fa56815f1d7a1ba22375050/mimicmotion/pipelines/pipeline_mimicmotion.py#L608 To address this, you can use two forward passes: one normal and one without guidance.

For the 72 frame model, the GPU usage in this case is ~21.1G.

JuvenileLocksmith commented 1 day ago

Thank you for the suggestion regarding separate batches for classifier-free guidance. I've implemented this optimization along with a few others, and it has indeed helped reduce VRAM usage during the main processing phase. However, I'm still encountering some challenges:

Memory spike at the end: Even with these optimizations, I'm observing a dramatic increase in memory usage towards the end of the process, particularly during the decoding phase. Video length constraints: I initially attempted a 35-second video, which resulted in an out-of-memory error during decoding. Reducing to 5 seconds at default settings brought memory usage down to about 14GB, but it still spikes near the end, barely avoiding another OOM error. Quality concerns: With the default settings, the output video quality appears to be significantly lower than expected, especially compared to the demo results. Balancing quality and stability: I'm experimenting with higher step counts to improve quality, but this increases the risk of running out of memory at the final stages.

The ability to replicate the demo quality while maintaining stability would be a game-changer for this project. Do you have any insights on how to address the memory spike during decoding or suggestions for settings that might better balance quality and memory usage? I'm particularly interested in understanding how the demo results were achieved and if there are any specific optimizations used for longer, higher-quality outputs.

Screenshot 2024-07-08 at 14 49 23 Screenshot 2024-07-08 at 14 49 32
zyayoung commented 15 hours ago

For the decoding stage, the vae may consume a lot of memory. Actually, the 35s output video along may use 1024*576*15(fps)*35(s) *3 (RGB) * (4Byte) = 3.6 GB. The pose latents alone will use 128*72*15*35*320*4*2(cfg) = 12.4 GB VRAM.

I have several thoughts in to further reduce VRAM:

  1. Unload the pose latents at the end of the denoising stage. del pose_latents
  2. Try torch.cuda.empty_cache() before the decoding stage.
    with torch.cuda.device(device):
    torch.cuda.empty_cache()
  3. If that still does not work, you may try decoing on CPU since this step is not very computational demanding.
zyayoung commented 12 hours ago

Update: I successfully ran the 72-frame model (35s video) on a 4060ti with 16GB VRAM. I plan to submit a pull request shortly. image