Cuda out of memory on RTX 3090 24gb

I run the command python -u infer_audio2vid.py on my gpu, but I got the following CUDA OOM error:

Traceback (most recent call last): File "infer_audio2vid.py", line 259, in main() File "infer_audio2vid.py", line 227, in main video = pipe( File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/gpfs/home/p122g24/chanhyuk/EchoMimic/src/pipelines/pipeline_echo_mimic.py", line 507, in call pred = self.denoising_unet( File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/gpfs/home/p122g24/chanhyuk/EchoMimic/src/models/unet_3d_echo.py", line 503, in forward sample, res_samples = downsample_block( File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/gpfs/home/p122g24/chanhyuk/EchoMimic/src/models/unet_3d_blocks.py", line 446, in forward hidden_states = attn( File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/gpfs/home/p122g24/chanhyuk/EchoMimic/src/models/transformer_3d.py", line 145, in forward hidden_states = block( File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/gpfs/home/p122g24/chanhyuk/EchoMimic/src/models/mutual_self_attention.py", line 156, in hacked_basic_transformer_inner_forward self.attn1( File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 522, in forward return self.processor( File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 743, in call attention_probs = attn.get_attention_scores(query, key, attention_mask) File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 598, in get_attention_scores attention_scores = torch.baddbmm( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 GiB (GPU 0; 23.70 GiB total capacity; 17.09 GiB already allocated; 4.54 GiB free; 17.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My remote server spec is the following:

OS: CentOS Linux release 7.9.2009 (Core) GPU: NIVIDA GeForce RTX 3090 24gb

And I installed most packages of requirements.txt, but due to cuda version installed on gpu, I refined torch version and opencv-python version:

torch 1.13.1+cu116 torchaudio 0.13.1+cu116 torchvision 0.14.1+cu116 opencv-python-headless 4.10.0.84

Also, I just only used "./assets/test_imgs/a.png" and "./assets/test_audios/echomimic_en.wav". I have seen that running echomimic on rtx 3090 10gb would be fine but I do not understand why CUDA OOM error occurs. Please help me out!

Thank you for your help in advance.

Surprising.

I use it all the time on 4060 8 GB / Windows 11 and never face OOM issue. Processing 30s - 55s audio.

Can you share the config file. Did you modified width-height or any other settings. What is the length of audio?

Here is ./configs/prompts/animation.yaml:

## dependency models
pretrained_base_model_path: "./pretrained_weights/sd-image-variations-diffusers/"
pretrained_vae_path: "./pretrained_weights/sd-vae-ft-mse/"
audio_model_path: "./pretrained_weights/audio_processor/whisper_tiny.pt"

## echo mimic checkpoint
denoising_unet_path: "./pretrained_weights/denoising_unet.pth"
reference_unet_path: "./pretrained_weights/reference_unet.pth"
face_locator_path: "./pretrained_weights/face_locator.pth"
motion_module_path: "./pretrained_weights/motion_module.pth"

## deonise model configs
inference_config: "./configs/inference/inference_v2.yaml"
weight_dtype: 'fp16'

## test cases
test_cases:
  "./assets/test_imgs/a.png":
    - "./assets/test_audios/echomimic_en.wav"
  # "./assets/test_imgs/b.png":
  #   - "./assets/test_audios/echomimic_en_girl.wav"
  # "./assets/test_imgs/c.png":
  #   - "./assets/test_audios/echomimic_en_girl.wav"
  # "./assets/test_imgs/d.png":
  #   - "./assets/test_audios/echomimic_en_girl.wav"
  # "./assets/test_imgs/e.png":
  #   - "./assets/test_audios/echomimic_en.wav"

I did not change any other settings. The length of the audio echomimic_en.wav is 5 sec.

Can you please try Accelerated version. python -u infer_audio2vid_acc.py

Just remove the extra audios and process only 1 audio. https://github.com/BadToBest/EchoMimic/blob/main/configs/prompts/animation_acc.yaml

It works! Why does the accelerated version script work? Thank you:)

(echomimic2) python -u infer_audio2vid_acc.py Adding FFMPEG_PATH to PATH Some weights of the model checkpoint were not used when initializing UNet2DConditionModel: ['down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_q.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_k.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_v.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.0.attentions.0.transformer_blocks.0.norm2.weight, down_blocks.0.attentions.0.transformer_blocks.0.norm2.bias, down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_q.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_k.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_v.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.0.attentions.1.transformer_blocks.0.norm2.weight, down_blocks.0.attentions.1.transformer_blocks.0.norm2.bias, down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.1.attentions.0.transformer_blocks.0.norm2.weight, down_blocks.1.attentions.0.transformer_blocks.0.norm2.bias, down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.1.attentions.1.transformer_blocks.0.norm2.weight, down_blocks.1.attentions.1.transformer_blocks.0.norm2.bias, down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_q.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.2.attentions.0.transformer_blocks.0.norm2.weight, down_blocks.2.attentions.0.transformer_blocks.0.norm2.bias, down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_q.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, down_blocks.2.attentions.1.transformer_blocks.0.norm2.weight, down_blocks.2.attentions.1.transformer_blocks.0.norm2.bias, up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.1.attentions.0.transformer_blocks.0.norm2.weight, up_blocks.1.attentions.0.transformer_blocks.0.norm2.bias, up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.1.attentions.1.transformer_blocks.0.norm2.weight, up_blocks.1.attentions.1.transformer_blocks.0.norm2.bias, up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.1.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.1.attentions.2.transformer_blocks.0.norm2.bias, up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_q.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.2.attentions.0.transformer_blocks.0.norm2.weight, up_blocks.2.attentions.0.transformer_blocks.0.norm2.bias, up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_q.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.2.attentions.1.transformer_blocks.0.norm2.weight, up_blocks.2.attentions.1.transformer_blocks.0.norm2.bias, up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.2.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.2.attentions.2.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.0.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.0.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.1.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.1.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm2.bias, mid_block.attentions.0.transformer_blocks.0.attn2.to_q.weight, mid_block.attentions.0.transformer_blocks.0.attn2.to_k.weight, mid_block.attentions.0.transformer_blocks.0.attn2.to_v.weight, mid_block.attentions.0.transformer_blocks.0.attn2.to_out.0.weight, mid_block.attentions.0.transformer_blocks.0.attn2.to_out.0.bias, mid_block.attentions.0.transformer_blocks.0.norm2.weight, mid_block.attentions.0.transformer_blocks.0.norm2.bias, conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias'] [0, 0, 1342, 1342] video in 24 FPS, audio idx in 50FPS whisper_chunks: (127, 50, 384) audio_fea_final: torch.Size([1, 127, 50, 384]) ref_image_latents shape: torch.Size([1, 4, 64, 64]) face_mask_tensor shape: torch.Size([1, 1, 1, 512, 512]) face_locator_tensor shape: torch.Size([2, 320, 1, 64, 64]) 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:47<00:00, 7.97s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 127/127 [00:08<00:00, 14.99it/s] Moviepy - Building video output/20240907/2150--seed_420-512x512/a_echomimic_en_512x512_1_2150_withaudio.mp4. MoviePy - Writing audio in a_echomimic_en_512x512_1_2150_withaudioTEMP_MPY_wvf_snd.mp4 MoviePy - Done.
Moviepy - Writing video output/20240907/2150--seed_420-512x512/a_echomimic_en_512x512_1_2150_withaudio.mp4

Moviepy - Done !
Moviepy - video ready output/20240907/2150--seed_420-512x512/a_echomimic_en_512x512_1_2150_withaudio.mp4 output/20240907/2150--seed_420-512x512/a_echomimic_en_512x512_1_2150_withaudio.mp4

Why does the accelerated version script work?

To be very frank, I don't know. Remember accelerated version quality is 20% of non-accelerated version. And eye blink also have issues with accelerated version.

I think only devs can help you with the issue.

Unfortunately, I have tried running the accelerate script again but I have got the following error (I have not changed any settings):

Traceback (most recent call last): File "infer_audio2vid_acc.py", line 290, in main() File "infer_audio2vid_acc.py", line 285, in main video_clip.write_videofile(f"{save_dir}/{refname}{audioname}{args.H}x{args.W}{int(args.cfg)}{time_str}_withaudio.mp4", codec="libx264", audio_codec="aac") File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/decorator.py", line 232, in fun return caller(func, *(extras + args), kw) File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/moviepy/decorators.py", line 54, in requires_duration return f(clip, *a, *k) File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/decorator.py", line 232, in fun return caller(func, (extras + args), kw) File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/moviepy/decorators.py", line 135, in use_clip_fps_by_default return f(clip, *new_a, new_kw) File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/decorator.py", line 232, in fun return caller(func, *(extras + args), *kw) File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/moviepy/decorators.py", line 22, in convert_masks_to_RGB return f(clip, a, k) File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/moviepy/video/VideoClip.py", line 300, in write_videofile ffmpeg_write_video(self, filename, fps, codec, File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/moviepy/video/io/ffmpeg_writer.py", line 213, in ffmpeg_write_video with FFMPEG_VideoWriter(filename, clip.size, fps, codec = codec, File "/home/p122g24/.conda/envs/echomimic2/lib/python3.8/site-packages/moviepy/video/io/ffmpeg_writer.py", line 88, in init '-r', '%.02f' % fps, TypeError: must be real number, not NoneType

Why does this issue occur?

See if the last comment in this thread help https://github.com/Zulko/moviepy/issues/1986

In my case, the following commands (reinstall moviepy) have solved my issue:

pip uninstall moviepy decorator
pip install moviepy

Thanks a lot!

BadToBest / EchoMimic

Cuda out of memory on RTX 3090 24gb #154