Closed ChanHyeok-Choi closed 5 days ago
Surprising.
I use it all the time on 4060 8 GB / Windows 11 and never face OOM issue. Processing 30s - 55s audio.
Can you share the config file. Did you modified width-height or any other settings. What is the length of audio?
Here is ./configs/prompts/animation.yaml:
## dependency models
pretrained_base_model_path: "./pretrained_weights/sd-image-variations-diffusers/"
pretrained_vae_path: "./pretrained_weights/sd-vae-ft-mse/"
audio_model_path: "./pretrained_weights/audio_processor/whisper_tiny.pt"
## echo mimic checkpoint
denoising_unet_path: "./pretrained_weights/denoising_unet.pth"
reference_unet_path: "./pretrained_weights/reference_unet.pth"
face_locator_path: "./pretrained_weights/face_locator.pth"
motion_module_path: "./pretrained_weights/motion_module.pth"
## deonise model configs
inference_config: "./configs/inference/inference_v2.yaml"
weight_dtype: 'fp16'
## test cases
test_cases:
"./assets/test_imgs/a.png":
- "./assets/test_audios/echomimic_en.wav"
# "./assets/test_imgs/b.png":
# - "./assets/test_audios/echomimic_en_girl.wav"
# "./assets/test_imgs/c.png":
# - "./assets/test_audios/echomimic_en_girl.wav"
# "./assets/test_imgs/d.png":
# - "./assets/test_audios/echomimic_en_girl.wav"
# "./assets/test_imgs/e.png":
# - "./assets/test_audios/echomimic_en.wav"
I did not change any other settings. The length of the audio echomimic_en.wav
is 5 sec.
Can you please try Accelerated version. python -u infer_audio2vid_acc.py
Just remove the extra audios and process only 1 audio. https://github.com/BadToBest/EchoMimic/blob/main/configs/prompts/animation_acc.yaml
It works! Why does the accelerated version script work? Thank you:)
(echomimic2) python -u infer_audio2vid_acc.py
Adding FFMPEG_PATH to PATH
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
['down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_q.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_k.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_v.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.0.attentions.0.transformer_blocks.0.norm2.weight, down_blocks.0.attentions.0.transformer_blocks.0.norm2.bias, down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_q.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_k.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_v.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.0.attentions.1.transformer_blocks.0.norm2.weight, down_blocks.0.attentions.1.transformer_blocks.0.norm2.bias, down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.1.attentions.0.transformer_blocks.0.norm2.weight, down_blocks.1.attentions.0.transformer_blocks.0.norm2.bias, down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.1.attentions.1.transformer_blocks.0.norm2.weight, down_blocks.1.attentions.1.transformer_blocks.0.norm2.bias, down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_q.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.2.attentions.0.transformer_blocks.0.norm2.weight, down_blocks.2.attentions.0.transformer_blocks.0.norm2.bias, down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_q.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.2.attentions.1.transformer_blocks.0.norm2.weight, down_blocks.2.attentions.1.transformer_blocks.0.norm2.bias, up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.1.attentions.0.transformer_blocks.0.norm2.weight, up_blocks.1.attentions.0.transformer_blocks.0.norm2.bias, up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.1.attentions.1.transformer_blocks.0.norm2.weight, up_blocks.1.attentions.1.transformer_blocks.0.norm2.bias, up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.1.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.1.attentions.2.transformer_blocks.0.norm2.bias, up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_q.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.2.attentions.0.transformer_blocks.0.norm2.weight, up_blocks.2.attentions.0.transformer_blocks.0.norm2.bias, up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_q.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.2.attentions.1.transformer_blocks.0.norm2.weight, up_blocks.2.attentions.1.transformer_blocks.0.norm2.bias, up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.2.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.2.attentions.2.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.0.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.0.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.1.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.1.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm2.bias, mid_block.attentions.0.transformer_blocks.0.attn2.to_q.weight, mid_block.attentions.0.transformer_blocks.0.attn2.to_k.weight, mid_block.attentions.0.transformer_blocks.0.attn2.to_v.weight, mid_block.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, mid_block.attentions.0.transformer_blocks.0.norm2.weight, mid_block.attentions.0.transformer_blocks.0.norm2.bias, conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias']
[0, 0, 1342, 1342]
video in 24 FPS, audio idx in 50FPS
whisper_chunks: (127, 50, 384)
audio_fea_final: torch.Size([1, 127, 50, 384])
ref_image_latents shape: torch.Size([1, 4, 64, 64])
face_mask_tensor shape: torch.Size([1, 1, 1, 512, 512])
face_locator_tensor shape: torch.Size([2, 320, 1, 64, 64])
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:47<00:00, 7.97s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 127/127 [00:08<00:00, 14.99it/s]
Moviepy - Building video output/20240907/2150--seed_420-512x512/a_echomimic_en_512x512_1_2150_withaudio.mp4.
MoviePy - Writing audio in a_echomimic_en_512x512_1_2150_withaudioTEMP_MPY_wvf_snd.mp4
MoviePy - Done.
Moviepy - Writing video output/20240907/2150--seed_420-512x512/a_echomimic_en_512x512_1_2150_withaudio.mp4
Moviepy - Done !
Moviepy - video ready output/20240907/2150--seed_420-512x512/a_echomimic_en_512x512_1_2150_withaudio.mp4
output/20240907/2150--seed_420-512x512/a_echomimic_en_512x512_1_2150_withaudio.mp4
Why does the accelerated version script work?
To be very frank, I don't know. Remember accelerated version quality is 20% of non-accelerated version. And eye blink also have issues with accelerated version.
I think only devs can help you with the issue.
Unfortunately, I have tried running the accelerate script again but I have got the following error (I have not changed any settings):
Traceback (most recent call last):
File "infer_audio2vid_acc.py", line 290, in
Why does this issue occur?
See if the last comment in this thread help https://github.com/Zulko/moviepy/issues/1986
In my case, the following commands (reinstall moviepy) have solved my issue:
pip uninstall moviepy decorator
pip install moviepy
Thanks a lot!
I run the command
python -u infer_audio2vid.py
on my gpu, but I got the following CUDA OOM error:Traceback (most recent call last): File "infer_audio2vid.py", line 259, in
main()
File "infer_audio2vid.py", line 227, in main
video = pipe(
File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, kwargs)
File "/gpfs/home/p122g24/chanhyuk/EchoMimic/src/pipelines/pipeline_echo_mimic.py", line 507, in call
pred = self.denoising_unet(
File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/gpfs/home/p122g24/chanhyuk/EchoMimic/src/models/unet_3d_echo.py", line 503, in forward
sample, res_samples = downsample_block(
File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/gpfs/home/p122g24/chanhyuk/EchoMimic/src/models/unet_3d_blocks.py", line 446, in forward
hidden_states = attn(
File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/gpfs/home/p122g24/chanhyuk/EchoMimic/src/models/transformer_3d.py", line 145, in forward
hidden_states = block(
File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/gpfs/home/p122g24/chanhyuk/EchoMimic/src/models/mutual_self_attention.py", line 156, in hacked_basic_transformer_inner_forward
self.attn1(
File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 522, in forward
return self.processor(
File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 743, in call
attention_probs = attn.get_attention_scores(query, key, attention_mask)
File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 598, in get_attention_scores
attention_scores = torch.baddbmm(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 GiB (GPU 0; 23.70 GiB total capacity; 17.09 GiB already allocated; 4.54 GiB free; 17.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My remote server spec is the following:
OS: CentOS Linux release 7.9.2009 (Core) GPU: NIVIDA GeForce RTX 3090 24gb
And I installed most packages of requirements.txt, but due to cuda version installed on gpu, I refined torch version and opencv-python version:
torch 1.13.1+cu116 torchaudio 0.13.1+cu116 torchvision 0.14.1+cu116 opencv-python-headless 4.10.0.84
Also, I just only used
"./assets/test_imgs/a.png"
and"./assets/test_audios/echomimic_en.wav"
. I have seen that running echomimic on rtx 3090 10gb would be fine but I do not understand why CUDA OOM error occurs. Please help me out!Thank you for your help in advance.