cvlab-kaist / GaussianTalker

Official implementation of “GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting” by Kyusun Cho, Joungbin Lee, Heeji Yoon, Yeobin Hong, Jaehoon Ko, Sangjun Ahn and Seungryong Kim
Other
295 stars 36 forks source link

some flickering artifacts in the neck region #18

Open anujsinha72094 opened 6 months ago

anujsinha72094 commented 6 months ago

This one is a very good repo. But I am facing some issues. Since gaussian rasterizer setting is taking torso+bgimage as input, The torso is from training set, while performing inference with new audio, while rendering there are artifacts in the neck region because torso +bg image is from training set, what can be the solution for this?

ankit-gahlawat-007 commented 6 months ago

Hi Anuj,

By "inference with new audio" do you mean inferencing directly with audio_novel.wav which is kept separate during dataset preparation or inferencing with absolutely different audio? If it's the latter case and you don't mind, could you tell what code changes did you do for inferencing with new audio?

Thanks

anujsinha72094 commented 6 months ago

@ankit-gahlawat-007 I am inferencing with total new audio. I just added audio_novel.wav, audio_train.wav and aud.npy generated from new audio, and placed it in the folder in which model is trained , Then I ran python render.py -s data/__ --model_path __ --configs arguments/64_dim_1_transformer.py --iteration 10000 --batch 1. But there is flickering in the neck region . Could you state the possible solution to this?

nikhilchh commented 6 months ago

@anujsinha72094 Did you try with Obama video ? Did the flickering happen only when u used new audio ?

ankit-gahlawat-007 commented 6 months ago

@anujsinha72094 @nikhilchh I tested for Obama and May with new audio.

In the train rendering, I get good lip sync for new audio with no artifacts in the neck.

But in the test rendering, the mouth does not move at all (and there are no artifacts too). I guess it's just an issue with the code and not with the model, as anyways it produced outputs well for training set.

https://github.com/KU-CVLAB/GaussianTalker/assets/139246379/2e6f73cc-cad2-401b-9ca8-1a0b89edcb35

https://github.com/KU-CVLAB/GaussianTalker/assets/139246379/d81e02b3-2aea-489d-9a0b-19a502cf1a87

nikhilchh commented 6 months ago

@anujsinha72094

Can you please tell me how to get/create following files.

track_params.pt transforms_train.json transforms_val.json

When i try to train the model I get an error:

[Errno 2] No such file or directory: 'GaussianTalker/data/obama/transforms_train.json'

ankit-gahlawat-007 commented 6 months ago

@nikhilchh

Did you run process.py? The functions there generate these files.

nikhilchh commented 6 months ago

So I found the issue. process.py fails at face_tracking steps due to torch.cuda.OutOfMemoryError

I have a 10 gb GPU.

How can I make it work ? Is there any provision to reduce its memory consumption ?

ankit-gahlawat-007 commented 6 months ago

I am not sure but could you try reducing the batch_size.

In data_utils/face_tracking/face_tracker.py, the batch_size is set to 32. You can try 16 or even less.

nikhilchh commented 6 months ago

Thanks for the suggestion. That worked for me.

nikhilchh commented 6 months ago

I added all the audio files from a new video into obama folder. image

Ran the render code and lead to the same problem as @anujsinha72094 .

PS: I cannot run train_rendering as it crashes due ram. I have 32 gb ram which seems to be not enough. Observations: 1- The duration of the rendered video is longer than the audio. It is somehow still assuming that 30 second video is to be generated (that's the length of original obama novel audio) 2- Mouth doesn't move at all. (Only tried test rendering)

https://github.com/KU-CVLAB/GaussianTalker/assets/26286970/6ad39625-e2a8-40e9-845b-f89845eaf104

joungbinlee commented 6 months ago

Thank you very much for using our work. Since we generate faces from audio and use GT background and torso, there might be slight mismatches in the jaw and neck areas when we generate by custom audio. To resolve this, we believe that generating the torso using the 2D neural field method from the original ER-NeRF approach and then generating our face on top of this with the GT background will easily address the issue. Additionally, we have added a command to easily run novel audio(custom audio)! We would appreciate it if you could check it out. Thanks! :)

joungbinlee commented 5 months ago

Recently, it has become possible to generate the torso part as well using Gaussian splatting, in the same space as the face. Therefore, flickering issues at the neck region have significantly decreased in OOD audio!

https://github.com/KU-CVLAB/GaussianTalker/assets/87278950/634284cd-679e-42be-bbe1-9ba6bc2935cf

NghiaLeMartec commented 4 months ago

Hi @joungbinlee Do you have plans to release the code for this?