Very bad lip sync for audios especially in Korean

cvlab-kaist / GaussianTalker

Official implementation of “GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting” by Kyusun Cho, Joungbin Lee, Heeji Yoon, Yeobin Hong, Jaehoon Ko, Sangjun Ahn and Seungryong Kim

Other

295 stars 36 forks source link

Very bad lip sync for audios especially in Korean #30

Open justin4ai opened 4 months ago

justin4ai commented 4 months ago

Hello, thanks for providing such a great project.

FPS of your work is amazing, but seemingly the rendering outputs have bad lip sync quality. Particularly the lip sync for Korean audios is almost zero.

https://github.com/KU-CVLAB/GaussianTalker/assets/63603383/5174ea92-1ecc-42c3-beac-558dbadba7dc

At least I don't think it's because of the length of the trained video - over 4 mins length which will be enough for other person dependent models.

I'd like to ask you, why is that? Or do you have any hypothesis or guess about the problem?

Best, Junyeong Ahn

joungbinlee commented 4 months ago

Thank you for using our model!

https://github.com/KU-CVLAB/GaussianTalker/assets/87278950/ae1a880c-a849-40ad-b94a-438a7fde5be6

I extracted the audio from the video you uploaded and tried running it. It appears there may have been a mix-up with the code you used. To correctly perform the OOD process, please use the following code:

python render.py -s ${YOUR_DATASET_DIR}/${DATASET_NAME} --model_path ${YOUR_MODEL_DIR} --configs arguments/64_dim_1_transformer.py --iteration 10000 --batch 128 --custom_aud <custom_aud>.npy --custom_wav <custom_aud>.wav --skip_train --skip_test

Also, please locate the files .wav and .npy in the following directory path: ${YOUR_DATASET_DIR}/${DATASET_NAME}.

Could you please if this resolves the issue?

Thank you! :)

justin4ai commented 4 months ago

The command line you've suggested to me is exactly the same to what I used, so still have no clue about this situation. By the way is your model trained on the same May video as mine with the length of 4 minutes and 3 seconds?

joungbinlee commented 4 months ago

Right. We used the May data with the length of 4 minutes and 3 seconds. We used the first 10/11 of the total data for training, and the remaining 1/11 for testing. We extracted features using DeepSpeech and followed the code available on GitHub for training. Please check the data preprocessing and training process, and let us know if there are any issues.

If you need any more adjustments or details, feel free to ask!

justin4ai commented 4 months ago

Okay still mine and yours of training/inference process look identical but I'll give it a try again.

Appreciate your kind and quick response! ◡̈