YuanxunLu / LiveSpeechPortraits

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)
MIT License
1.16k stars 198 forks source link

Process video with fps not 60 #59

Open Nuyoah13 opened 2 years ago

Nuyoah13 commented 2 years ago

Hi, thanks for your excellent work, I have read your paper and download your training video, I found that these training videos are with fps of 25, 30, etc, but you design the model with fps 60 setting in the paper, so I wonder how you tackle videos not fps 60, do you convert them to fps 60 in the data preprocessing procedure? if I use other fps like 25, which params should I change in the design, thanks.

YuanxunLu commented 2 years ago

I extracted the video at 60 FPS, you can do it simply using FFmpeg.

If you change the fps setting, you should consider changing the design of the audio feature extractor and the audio2mouth mapping network.

torphix commented 2 years ago

Hi, with regards to changing the video FPS would the APC audio feature extractor need to be retrained with the new FPS rate as when extracting the mel spectrogram the frames are split into chunks dependant on the number of FPS? Also in terms of computing the mel spectrogram i noticed that for parameteres like: winlen and winstep are winlen=1/60, winstep=0.5/60, respectivly are the bottom values supposed to match the FPS or are they coincedently the same?

Ie: if I change the video FPS and thereby the frame size what other paramters would also need to be changed to reflect this hop_length=int(16000/120), win_length=int(16000/60), winlen=1/60, winstep=0.5/60, Perhaps these must be changed?

Thank you kindly

YuanxunLu commented 2 years ago

Whether you need to retrain the audio feature extraction network depends on how you use it. The proposed setting was designed for my experiments. Of course, you can make use of it in a different way, make sure that it is fitted with your setting.

BTW, the APC features are one kind of deep speech feature, and you can try newer & better deep features.

Nuyoah13 commented 2 years ago

Sorry to bother you again, I still have a question about how to acquire the internal parameters of cameras. The internal parameters like focal length can be randomly set? Meanwhile, could you open-source the implementation of computing the focal length 𝑓 in Sec4.1 of your paper, which may help better reimplement the your paper.

YuanxunLu commented 2 years ago

Focal length, of course, should not be randomly set, and it should work with your tracked 3D face as well as your crop & scaling parameters.

You don't need to use exactly the same camera as mine, which is a part of the 3d tracking algorithm. You can try it with other camera models as well.

Nuyoah13 commented 2 years ago

Thanks for the focal length explanation. I trained the audio2feature model, but the result is wired. I use a 5min video of a specific person and obtain the APC_feat_database from the pre-trained APC model (e.g., (36000, 512))) and obtain the corresponding landmarks of each frame as target displacement (18000, 68, 3). Then I train audio2feature model by select random consecutive frames with length 240, like input (bs, 480, 512), feeding into LSTM network, and calculate the MSE loss with target displacement (bs, 240, 20*3), is this pipeline right? I trained the model and found the eval loss does not decrease and I wonder what is the reason for this issue. So I want to confirm some details with you. 1、The target displacements are calculated by subtracting the mean landmark and norm to [-1, 1], thus the target delta displacement are small result in around 1e-3 with small range? 2、You said in previous issue that LSTM will receive h0, h1, ..., hn and generate 'y0, y1, ..., yn'. and the loss can simply compare the 'y17, y18, ..., yn' with the corresponding groundtruth. How the frame_future=18 worked in the network. 3、I do not split the videos like you and just random select frame index and its 240 following frames, is there any problems with this setting?

Confused about the training result, and hope you could explain these details for me, may be could help me find the reasons. Thanks.

YuanxunLu commented 2 years ago

Training the audio2feature model is not hard I believe, as long as you put the input data & groundtruth right. Your description seems alright. The input landmarks should lie in a normalized space (no need to be in [-1, 1], but should be in the same definition). Make sure the 3d landmarks are head pose disentangled, otherwise it won't work.

Appling frame future is a useful method to improve performance, just train it as inference does. Whether cut the video doesn't matter, make sure your training clip is continuous.

Check the generation results using training audio if you did all things above right. Maybe you will find the problem.

torphix commented 2 years ago

Hi, may I inquire with regards to distenanglement: mouth landmark should be extracted and aligned to be front facing and standardised by aligning mouth edge landmarks to the mean? This way only lip movements are needed to be learned by the model and then mouth position can be protjected to match head pose in later step?

Thank you kindly

On Fri, 27 May 2022, 09:10 OldSix, @.***> wrote:

Training the audio2feature model is not hard I believe, as long as you put the input data & groundtruth right. Your description seems alright. The input landmarks should lie in a normalized space (no need to be in [-1, 1], but should be in the same definition). Make sure the 3d landmarks are head pose disentangled, otherwise it won't work.

Appling frame future is a useful method to improve performance, just train it as inference does. Whether cut the video doesn't matter, make sure your training clip is continuous.

Check the generation results using training audio if you did all things above right. Maybe you will find the problem.

— Reply to this email directly, view it on GitHub https://github.com/YuanxunLu/LiveSpeechPortraits/issues/59#issuecomment-1139384368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP52P63ETRDDSJZHB255JW3VMB7Q7ANCNFSM5VX7LSJA . You are receiving this because you commented.Message ID: @.***>

YuanxunLu commented 2 years ago

You need to split out the head pose influence on the 3d landmarks, your training disentangled landmarks should be something that looks like only the mouth moves while others are motionless and the head pose is fixed. Otherwise, your audio2feature model would try to learn both mouth movements and head poses, which is bad and you will fail.