YuanxunLu / LiveSpeechPortraits

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)
MIT License
1.16k stars 198 forks source link

about face tracking #49

Open mostafa610 opened 2 years ago

mostafa610 commented 2 years ago

first of all thank you so much for your marvelous work. second, regarding face tracking why do use it why don't you just extract the landmarks from every frame by the landmark detector thanks in advance

mostafa610 commented 2 years ago

and how did the file mean_pts3d.npy created is that the mean of this point across the dataset ?

mostafa610 commented 2 years ago

it is the average of landmarks of all video frames of the target person

YuanxunLu commented 2 years ago
  1. Using 3d landmarks obtained by face tracking has several advantages over directly using detected 2D landmarks. It helps disentangle the camera parameters, head pose, and facial movements, which allow explicit control over these parameters while using 2D landmarks can't do it. Besides, it is much easier for networks to learn normalized facial movements (in 3D space) than entangled landmarks, generating more accurate results.
  2. 'mean_pts3d.npy' should be the mean 3d landmarks of the target person in the training set. The network learns the displacements instead of the absolute locations.

Hope the above helps.

mostafa610 commented 2 years ago

thank you so much for your reply it really helped

I have another question about sequence length in the code you define sequence length to be 240 parser.add_argument('--sequence_length', type=int, default=240, help='length of training frames in each iteration') in the paper 240 is the batch size (𝑇 = 240 represents the number of consecutive frames sent to the model at each iteration) (number of samples in an iteration is the batch size)that means it is the batch size ??

If 240 is sequence length then audio features sent during training will be [b,T,ndim] = 32(batch size) and 240(seq_length) and 512 (ndim)

or do you mean that you will send batches of one each batch contain sample with 240 seq_length

thanks in advance

YuanxunLu commented 2 years ago

Of course the latter one. LSTM is a kind of RNN network, and it should take sequential data as input. 240 frames equal to 4 seconds under the 60 FPS setting.

Batch_size means for each forward pass, how many batches of sequential data (240 frames data) are sent.

mostafa610 commented 2 years ago

thank you so much i don't know how to thank you you really helped me !

mostafa610 commented 2 years ago

I have another question

regarding the training I understand that every sequence of 240 frame (4 sec) will output vector of size (25,3) this vector represents the displacement between the landmarks of the last frame and the mean position of landmarks is that right ?

if it is right, do you then walk through the data with a window i.e from frame zero to frame 240 from frame 1 to frame 240 . . . . from frame 39 to frame 279 this is the first patch for example is that right ?

and here A2Lsamples = self.audio_features[file_index][current_frame 2 : (current_frame + self.seq_len) 2] i don't get why (*2) thanks in advance

YuanxunLu commented 2 years ago

First, LSTM takes sequential data as input and its output is also sequential, therefore T frames input results in T frames output. Please carefully check the definition of LSTM networks. During training, we use 4 seconds as the length while during the test there is no length limitation.

Secondly, the audio2mouth network learns the displacements.

Thirdly, frame2* is because the APC feature frame is half of the 1/60. Please check the paper for details.

mostafa610 commented 2 years ago

thank you for your replays, I am wondering do you know any open source algorithm for face tracking as i can't find one to produce the same output of your paper. thanks in advance

YuanxunLu commented 2 years ago

Any parametric monocular face reconstruction method would be an alternative, like FaceScape, DECA, 3DDFA_v2, etc.

yuxx0218 commented 1 year ago

Any parametric monocular face reconstruction method would be an alternative, like FaceScape, DECA, 3DDFA_v2, etc.

What the method did you use? could you please upload the code?