fudan-generative-vision / hallo2

Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation
https://fudan-generative-vision.github.io/hallo2
MIT License
1.56k stars 219 forks source link

How much time does it need for inference for 10s audio #15

Open nitinmukesh opened 1 day ago

nitinmukesh commented 1 day ago

This is running for 2.30 hrs and showing another 2.30 hours

save path:  ./output_long/debug/women
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CUDAExecutionProvider': {'device_id': '0', 'has_user_compute_stream': '0', 'cudnn_conv1d_pad_to_nc1d': '0', 'user_compute_stream': '0', 'gpu_external_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'enable_cuda_graph': '0', 'gpu_external_free': '0', 'gpu_external_empty_cache': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'do_copy_in_default_stream': '1', 'cudnn_conv_use_max_workspace': '1', 'tunable_op_enable': '0', 'tunable_op_tuning_enable': '0', 'tunable_op_max_tuning_duration_ms': '0', 'enable_skip_layer_norm_strict_mode': '0', 'prefer_nhwc': '0', 'use_ep_level_unified_stream': '0', 'use_tf32': '1'}, 'CPUExecutionProvider': {}}
find model: ./pretrained_models/face_analysis\models\1k3d68.onnx landmark_3d_68 ['None', 3, 192, 192] 0.0 1.0
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CUDAExecutionProvider': {'device_id': '0', 'has_user_compute_stream': '0', 'cudnn_conv1d_pad_to_nc1d': '0', 'user_compute_stream': '0', 'gpu_external_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'enable_cuda_graph': '0', 'gpu_external_free': '0', 'gpu_external_empty_cache': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'do_copy_in_default_stream': '1', 'cudnn_conv_use_max_workspace': '1', 'tunable_op_enable': '0', 'tunable_op_tuning_enable': '0', 'tunable_op_max_tuning_duration_ms': '0', 'enable_skip_layer_norm_strict_mode': '0', 'prefer_nhwc': '0', 'use_ep_level_unified_stream': '0', 'use_tf32': '1'}, 'CPUExecutionProvider': {}}
find model: ./pretrained_models/face_analysis\models\2d106det.onnx landmark_2d_106 ['None', 3, 192, 192] 0.0 1.0
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CUDAExecutionProvider': {'device_id': '0', 'has_user_compute_stream': '0', 'cudnn_conv1d_pad_to_nc1d': '0', 'user_compute_stream': '0', 'gpu_external_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'enable_cuda_graph': '0', 'gpu_external_free': '0', 'gpu_external_empty_cache': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'do_copy_in_default_stream': '1', 'cudnn_conv_use_max_workspace': '1', 'tunable_op_enable': '0', 'tunable_op_tuning_enable': '0', 'tunable_op_max_tuning_duration_ms': '0', 'enable_skip_layer_norm_strict_mode': '0', 'prefer_nhwc': '0', 'use_ep_level_unified_stream': '0', 'use_tf32': '1'}, 'CPUExecutionProvider': {}}
find model: ./pretrained_models/face_analysis\models\genderage.onnx genderage ['None', 3, 96, 96] 0.0 1.0
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CUDAExecutionProvider': {'device_id': '0', 'has_user_compute_stream': '0', 'cudnn_conv1d_pad_to_nc1d': '0', 'user_compute_stream': '0', 'gpu_external_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'enable_cuda_graph': '0', 'gpu_external_free': '0', 'gpu_external_empty_cache': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'do_copy_in_default_stream': '1', 'cudnn_conv_use_max_workspace': '1', 'tunable_op_enable': '0', 'tunable_op_tuning_enable': '0', 'tunable_op_max_tuning_duration_ms': '0', 'enable_skip_layer_norm_strict_mode': '0', 'prefer_nhwc': '0', 'use_ep_level_unified_stream': '0', 'use_tf32': '1'}, 'CPUExecutionProvider': {}}
find model: ./pretrained_models/face_analysis\models\glintr100.onnx recognition ['None', 3, 112, 112] 127.5 127.5
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CUDAExecutionProvider': {'device_id': '0', 'has_user_compute_stream': '0', 'cudnn_conv1d_pad_to_nc1d': '0', 'user_compute_stream': '0', 'gpu_external_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'enable_cuda_graph': '0', 'gpu_external_free': '0', 'gpu_external_empty_cache': '0', 'arena_extend_strategy': 'kNextPowerOfTwo', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'do_copy_in_default_stream': '1', 'cudnn_conv_use_max_workspace': '1', 'tunable_op_enable': '0', 'tunable_op_tuning_enable': '0', 'tunable_op_max_tuning_duration_ms': '0', 'enable_skip_layer_norm_strict_mode': '0', 'prefer_nhwc': '0', 'use_ep_level_unified_stream': '0', 'use_tf32': '1'}, 'CPUExecutionProvider': {}}
find model: ./pretrained_models/face_analysis\models\scrfd_10g_bnkps.onnx detection [1, 3, '?', '?'] 127.5 128.0
set det-size: (640, 640)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1729314279.187445    4552 face_landmarker_graph.cc:174] Sets FaceBlendshapesGraph acceleration to xnnpack by default.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
W0000 00:00:1729314279.211866   19844 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
W0000 00:00:1729314279.225367   19844 inference_feedback_manager.cc:114] Feedback manager requires a model with a single signature inference. Disabling support for feedback tensors.
Processed and saved: ./output_long/debug/women\women_sep_background.png
Processed and saved: ./output_long/debug/women\women_sep_face.png
Some weights of Wav2VecModel were not initialized from the model checkpoint at ./pretrained_models/wav2vec/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-10-19 10:34:40,113 - INFO - separator - Separator version 0.17.2 instantiating with output_dir: ./output_long/debug/women\audio_preprocess, output_format: WAV
2024-10-19 10:34:40,113 - INFO - separator - Operating System: Windows 10.0.22631
2024-10-19 10:34:40,113 - INFO - separator - System: Windows Node: nits Release: 10 Machine: AMD64 Proc: Intel64 Family 6 Model 186 Stepping 2, GenuineIntel
2024-10-19 10:34:40,113 - INFO - separator - Python Version: 3.10.6
2024-10-19 10:34:40,123 - INFO - separator - PyTorch Version: 2.2.2+cu118
2024-10-19 10:34:40,192 - INFO - separator - FFmpeg installed: ffmpeg version 6.1-full_build-www.gyan.dev Copyright (c) 2000-2023 the FFmpeg developers
2024-10-19 10:34:40,194 - INFO - separator - ONNX Runtime GPU package installed with version: 1.18.0
2024-10-19 10:34:40,194 - INFO - separator - CUDA is available in Torch, setting Torch device to CUDA
2024-10-19 10:34:40,194 - INFO - separator - ONNXruntime has CUDAExecutionProvider available, enabling acceleration
2024-10-19 10:34:40,194 - INFO - separator - Loading model Kim_Vocal_2.onnx...
2024-10-19 10:34:41,098 - INFO - separator - Load model duration: 00:00:00
2024-10-19 10:34:41,098 - INFO - separator - Starting separation process for audio_file_path: ./output_long/debug/women\seg-long-audio/segment_1.wav
100%|███████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00,  2.23s/it]
100%|███████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  9.36it/s]
2024-10-19 10:34:49,824 - INFO - mdx_separator - Saving Vocals stem to segment_1_(Vocals)_Kim_Vocal_2.wav...
2024-10-19 10:34:50,024 - INFO - common_separator - Clearing input audio file paths, sources and stems...
2024-10-19 10:34:50,034 - INFO - separator - Separation duration: 00:00:08
The config attributes {'center_input_sample': False, 'out_channels': 4} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
 ['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight']
The config attributes {'center_input_sample': False} were passed to UNet3DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Load motion module params from pretrained_models\motion_module\mm_sd_v15_v2.ckpt
loaded weight from  pretrained_models/hallo2\net.pth
ic| audio_emb.shape: torch.Size([288, 5, 12, 768])
ic| audio_length: 282
[1/18]
100%|██████████████████████████████████████████████████████████████████████████████| 40/40 [2:20:04<00:00, 210.11s/it]
100%|█████████████████████████████████████████████████████████████████████████████████| 16/16 [01:32<00:00,  5.76s/it]
ic| pipeline_output.videos.shape: torch.Size([1, 3, 16, 512, 512])
[2/18]
 15%|███████████▊                                                                   | 6/40 [21:12<1:59:07, 210.22s/it]
cuijh26 commented 1 day ago

Could you tell me what kind of GPU you ues to run this? It's seems a little strange.

nitinmukesh commented 1 day ago

image

In long.yaml I have modified

source_image: ./examples/women.png driving_audio: ./examples/audio.wav

fps: 30

Mikerhinos commented 1 day ago

You'd need a 16GB VRAM card, as you can see it's using 13GB and you have only 8GB available, so it's using RAM for the shared part which is way slower and turning minutes into hours. On my 16GB RTX4070 it takes around 1mn per second of audio, so 10s takes 10mn.

nitinmukesh commented 1 day ago

Surprising. I am using so many AI tools and almost all of them need shared RAM. I can understand 1 min processing can turn into 10 min but here for 10s audio 45 hours?

[Edit] I tried with default settings and still the same issue.

nitinmukesh commented 1 day ago

@cuijh26

Is there anything you can suggest or it is not supposed to work on Shared RAM