fudan-generative-vision / hallo

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
https://fudan-generative-vision.github.io/hallo/
MIT License
9.14k stars 1.25k forks source link

CUDNN failure 1: CUDNN_STATUS_NOT_INITIALIZED #181

Open Nyquist0 opened 1 month ago

Nyquist0 commented 1 month ago

Dear Sir or Madam,

I met the following error that keeps interrupting my training process. This happened after 1000 steps and is a ONNXRuntimeError error Could you help to check if there is anything wrong?

Environment:

commands: CUDA_VISIBLE_DEVICES=1 accelerate launch -m --config_file accelerate_config.yaml --machine_rank 0 --main_process_ip 0.0.0.0 --main_process_port 20055 --num_machines 1 --num_processes 1 scripts.train_stage1 --config ./configs/train/stage1.yaml

error:

...
[2024-08-13 21:32:16,393] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now!                                                                     
INFO:accelerate.accelerator:DeepSpeed Model and Optimizer saved to output dir ./exp_output/stage1/checkpoints/checkpoint-4000/pytorch_model                                                
INFO:accelerate.checkpointing:Scheduler state saved in exp_output/stage1/checkpoints/checkpoint-4000/scheduler.bin                                                                         
INFO:accelerate.checkpointing:Sampler state for dataloader 0 saved in exp_output/stage1/checkpoints/checkpoint-4000/sampler.bin                                                            
INFO:accelerate.checkpointing:Random states saved in exp_output/stage1/checkpoints/checkpoint-4000/random_states_0.pkl                                                                     
3 checkpoints already exist, removing 1 checkpoints                                                                                                                                        
Removing checkpoints: reference_unet-2500.pth                                                                                                                                              
Checkpoint saved at ./exp_output/stage1/modules/reference_unet-4000.pth                                                                                                                    
3 checkpoints already exist, removing 1 checkpoints                                                                                                                                        
Removing checkpoints: imageproj-2500.pth                                                                                                                                                   
Checkpoint saved at ./exp_output/stage1/modules/imageproj-4000.pth                                                                                                                         
3 checkpoints already exist, removing 1 checkpoints                                                                                                                                        
Removing checkpoints: denoising_unet-2500.pth                                                                                                                                              
Checkpoint saved at ./exp_output/stage1/modules/denoising_unet-4000.pth                                                                                                                    
3 checkpoints already exist, removing 1 checkpoints                                                                                                                                        
Removing checkpoints: face_locator-2500.pth                                                                                                                                                
Checkpoint saved at ./exp_output/stage1/modules/face_locator-4000.pth                                                                                                                      
INFO:__main__:Running validation...                                                                                                                                                        
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'prefer_nhwc': '0', 'enable_skip_layer_norm_stri
ct_mode': '0', 'tunable_op_enable': '0', 'enable_cuda_graph': '0', 'tunable_op_max_tuning_duration_ms': '0', 'tunable_op_tuning_enable': '0', 'cudnn_conv_use_max_workspace': '1', 'use_tf3
2': '1', 'cudnn_conv1d_pad_to_nc1d': '0', 'do_copy_in_default_stream': '1', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'gpu_externa
l_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'arena_extend_strategy': 'kNextPowerOfTwo', 'user_compute_stream': '0', 'has_user_compute_stream': '0', 'use_ep_level_unified_strea
m': '0', 'device_id': '0'}}                                                                                                                                                                
find model: ./pretrained_models/face_analysis/models/1k3d68.onnx landmark_3d_68 ['None', 3, 192, 192] 0.0 1.0                                                                              
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'prefer_nhwc': '0', 'enable_skip_layer_norm_stri
ct_mode': '0', 'tunable_op_enable': '0', 'enable_cuda_graph': '0', 'tunable_op_max_tuning_duration_ms': '0', 'tunable_op_tuning_enable': '0', 'cudnn_conv_use_max_workspace': '1', 'use_tf3
2': '1', 'cudnn_conv1d_pad_to_nc1d': '0', 'do_copy_in_default_stream': '1', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'gpu_externa
l_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'arena_extend_strategy': 'kNextPowerOfTwo', 'user_compute_stream': '0', 'has_user_compute_stream': '0', 'use_ep_level_unified_strea
m': '0', 'device_id': '0'}}                                                                                                                                                                
find model: ./pretrained_models/face_analysis/models/2d106det.onnx landmark_2d_106 ['None', 3, 192, 192] 0.0 1.0                                                                           
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'prefer_nhwc': '0', 'enable_skip_layer_norm_stri
ct_mode': '0', 'tunable_op_enable': '0', 'enable_cuda_graph': '0', 'tunable_op_max_tuning_duration_ms': '0', 'tunable_op_tuning_enable': '0', 'cudnn_conv_use_max_workspace': '1', 'use_tf3
2': '1', 'cudnn_conv1d_pad_to_nc1d': '0', 'do_copy_in_default_stream': '1', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'gpu_externa
l_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'arena_extend_strategy': 'kNextPowerOfTwo', 'user_compute_stream': '0', 'has_user_compute_stream': '0', 'use_ep_level_unified_strea
m': '0', 'device_id': '0'}}                   
find model: ./pretrained_models/face_analysis/models/genderage.onnx genderage ['None', 3, 96, 96] 0.0 1.0
Applied providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}, 'CUDAExecutionProvider': {'prefer_nhwc': '0', 'enable_skip_layer_norm_stri
ct_mode': '0', 'tunable_op_enable': '0', 'enable_cuda_graph': '0', 'tunable_op_max_tuning_duration_ms': '0', 'tunable_op_tuning_enable': '0', 'cudnn_conv_use_max_workspace': '1', 'use_tf3
2': '1', 'cudnn_conv1d_pad_to_nc1d': '0', 'do_copy_in_default_stream': '1', 'cudnn_conv_algo_search': 'EXHAUSTIVE', 'gpu_external_empty_cache': '0', 'gpu_external_free': '0', 'gpu_externa
l_alloc': '0', 'gpu_mem_limit': '18446744073709551615', 'arena_extend_strategy': 'kNextPowerOfTwo', 'user_compute_stream': '0', 'has_user_compute_stream': '0', 'use_ep_level_unified_strea
m': '0', 'device_id': '0'}}                   
find model: ./pretrained_models/face_analysis/models/glintr100.onnx recognition ['None', 3, 112, 112] 127.5 127.5

2024-08-13 21:33:53.948346489 [E:onnxruntime:, inference_session.cc:2045 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std
::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudnnStatus_t; bo
ol THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxru
ntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t
<THRW, void, onnxruntime::common::Status> = void] CUDNN failure 1: CUDNN_STATUS_NOT_INITIALIZED ; GPU=0 ; hostname=lancel-server ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cu
da_execution_provider.cc ; line=181 ; expr=cudnnCreate(&cudnn_handle_); 

ERROR:root:Failed to execute the training process: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call
.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudnnS
tatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, v
oid, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::co
nditional_t<THRW, void, onnxruntime::common::Status> = void] CUDNN failure 1: CUDNN_STATUS_NOT_INITIALIZED ; GPU=0 ; hostname=lancel-server ; file=/onnxruntime_src/onnxruntime/core/provid
ers/cuda/cuda_execution_provider.cc ; line=181 ; expr=cudnnCreate(&cudnn_handle_); 

Looking forward your reply. Thanks.

xumingw commented 1 month ago

Please check your onnx version, the inference step needs onnxruntime Does the inference script work?

Nyquist0 commented 1 month ago

I am directly training. Let me check the inference script. And the onnx version is completely aligned with yours in requirements.txt

Nyquist0 commented 1 month ago

Hi @xumingw Inference works well. Is it possible the onnx version you provided is compatible with Ampere architecture, but not with Ada architecture..? Any suggestions?