Questions about training audio2feature model

TimmmYang commented 2 years ago

Hello, I am trying to reconstruct the training code and there are several questions I have:

From what I saw from audio2feature_model.py, in the forward module, the size for self.audio_feats is [b, 1, nfeats, nwins] while in audio2feature.py, the dimension for audio_features is [b, T, ndim]. From my understanding(correct me if I was wrong), for batch_size=32, T=240*2, ndim=512(the APC feature dimension), the input batch for Audio2Feature model should be [32, 480, 512] (480 because mel_frame is n_frames * 2) and output size is [32, 240, 75]. Is that right?
Furthermore, from your paper in section 3.2，a delay d=18 is added during training but not reflected in the code. How that works in training? For example, m0 is inferred by h0, h1,....h18?
In audiovisual_dataset.py, you seemed like clipping the audio into many pieces and extract APC feature for each audio. What is the number of clips for a certain dataset, eg. 4 mins 60fps video?

There might be some stupid questions as I am not very familiar with audio processing field, just correct me if I made mistakes, thanks!

YuanxunLu commented 2 years ago

You're right. Maybe the function comments mislead you because the codes have been iterated many times (change network structure, hyper-paras, etc), and I forgot to correct them. Please just follow the shape of the running codes (you can print them during inference).
18 frames delay can be found in the function 'generate_sequences', controlled by the parameters "frame_future". Yes, LSTM will receive h0, h1, ..., hn and generate 'y0, y1, ..., yn'. You can simply compare the 'y17, y18, ..., yn' with the corresponding groundtruth.
It is not fixed. Same as 1, codes were iterated many times. In the early version, I did some experiments using several clips (some data are sentences by sentences you know). If you have a consecutive audio clip with groundtruth, there is no need to cut them.

Hope the above helps.

TimmmYang commented 2 years ago

Thank you! This helps me a lot.

TimmmYang commented 2 years ago

Hi, there are still a couple of questions about training audio2feature model:

How to decide the mouth-related landmarks? I found that for 68 facial landmarks project, the mouth points is less than 20 and in the paper you have 25 points. Any other points are included? Like eyes, nose landmarks, etc.
I use 3DDFA to extract landmarks for the training videos and the landmarks values are just the pixel location. You made difference and normalization before sending input into networks, right? For example, Delta v1 = v1 - v0, Delta v2 = v2 - v0 , ...where v0 is the mean_pts3d.
How to determine mean_pts3d? Should I just choose a neural expression frame of the target person and get 3d landmark points or use mean value of the whole dataset? What's more, for frame_jump_stride=4, is this the frame increment for each iteration? Like batch size=32, the input tensor is [32, 240*2, 512], where T=240 is the 0-240 frame data for item1, 4-244 for item2 ... 124-364 frame for item32?

YuanxunLu commented 2 years ago

I use a 73 landmarks detector, different from the popular used 68 landmarks. The semantic definition of the mouth shape just contains the mouth shape. For 68 points, using mouth landmarks is just OK. Of course, you can add more points if you find it works better in your experiments.
I did 3D face tracking on each video, and extract these landmarks in 3D object space. Yes. The network learns the delta positions instead of the absolute positions.
The mean_pts3d should be fixed for one target I think. Each method to choose the mean landmarks is OK I think (I tested with both methods, and the results are similar). Frame jump is just an option to accelerate the training, it decides how many frames and how frequent the network sees them in one epoch. I think this hyper-para doesn't affect the performance much, but you can try it yourself.

TimmmYang commented 2 years ago

Okay, thanks a lot!

TimmmYang commented 2 years ago

Hello, I am still confused about the coordinates change between 2D and 3D.

How to set the camera parameters? I noticed that in project _landmarks you use camera_intrinsic and scale to convert 3d_pts to 2D landmarks. But how's the camera_intrinsic and scale determined if I train another dataset?

Now I had a frame_dataset with 3d points and head poses, but 3d points are ranging from 0-512, which is the image coordinates I believe. How should I process the current dataset to train a new model?

YuanxunLu commented 2 years ago

It depends on your camera model, which is a perspective camera (pinhole) model in our settings. I guess current you use a scaled orthogonal model (or called weak perspective in some papers) for your 3d points ranging from 0-512. The camera intrinsic and scale params should change for your camera models. If you are not familiar with camera-related knowledge, I recommend you check Sec.4.1 in Paper: 3D Morphable Face Models - Past, Present and Future or any other 3D face related papers.

TimmmYang commented 2 years ago

I read the papers you recommended and I know that I should set a camera to turn the 3d landmarks in world coordinate to camera coordinate then use camera intrinsic to compute 2D landmarks in the image coordinates. But how I set the camera in the world coordinate? Could it be done automatically?

Also, I find your paper section 4.1, you said for camera calibration, you use binary search to compute focal depth. Are there any open source tool to complete this process? I read the reference paper(Cao 2013) and found no quick implementations.

In my case, all the values of 3D landmarks are ranging from 0 to 512 because I got these points from a cropped video. I noticed that your camera's rotation and translation are all 0. Can I just set camera in the middle of the image(might not be accurate)? Like for point (x, y, z), the transformed points are (x-256)/256, (y-256)/256, (z-256)/256? Because it might be good to transform all values to [-1, 1] for training? I know I should also consider the head pose and shoulder points and do the same process but not sure if it works.

Thanks!

YuanxunLu commented 2 years ago

Check the tool you used (from the doc or its paper? )to find out which camera model/projection model is used. Once you have the camera model, you know how to project the detected 3D points to 2D images (just follow the formula of your used tool's paper/doc.). I guess you may just drop z-coords for your 3D landmarks are ranging from 0-512.

Binary search is used to compute the perspective camera focal length f. I don't know any open-source tools. Again, if you use a scaled orthogonal camera, this step is not necessary.

I think any transform over the landmarks for training is wise as long as it improves the experiment results.

TimmmYang commented 2 years ago

Hi, did you use 18-frame latency both during training and inferencing or just train the model with 0 latency and use this delay during inferencing?

I tried trained audio2feature model these days and the problem is the validation loss is much higher than the training loss. I just use the default 80%/20% train/val split but the validation loss is nearly 20 times larger than the training loss, as you can see from the plot: screenshot

YuanxunLu commented 2 years ago

Adding n-frame latency is an effective scheme to generate better mouth shapes, it should be used both in training and testing.

I don't know your experiment setting (and so it is hard to compare with mine - my validation loss is not as large as yours), but validation loss is higher is common. If you use only a few minutes of audio data, the model tends to overfit the training data. It is not a bad thing because you actually want to learn the distribution of the training data. The key problem lies in how to learn the mapping from input audio to the training audio. More training data is better but it is hard to acquire, SynthesizeObama-SIG17 paper has done such ablation study on the training corpus size. More data, the validation loss is closer to the training loss.

Back to the plot, it is clear that when the training loss drops the validation loss also drops, and this tendency is good. Usually, you can choose the model which owns the best validation loss to work with your testing.

You can try to decrease the validation loss by using more data, different training targets, or any other methods.

TimmmYang commented 2 years ago

Thanks for your reply! I use similar experiment settings(network structure, optimizer, learning rate, frame jump, n-frame latency, etc.) as yours except I use 20*3 as mouth-shape(68-point face alignment model) and because I didn't see any camera information from that model, I simply normalized the keypoints value by (x - 256)/256 to let data ranging from -1 to 1 (it might also be a simple coordinates changing ). Not sure if it is appropriate or not.

Maybe I should do some tests on my trained model to see how it performs.

foocker commented 2 years ago

Thanks for your reply! I use similar experiment settings(network structure, optimizer, learning rate, frame jump, n-frame latency, etc.) as yours except I use 20*3 as mouth-shape(68-point face alignment model) and because I didn't see any camera information from that model, I simply normalized the keypoints value by (x - 256)/256 to let data ranging from -1 to 1 (it might also be a simple coordinates changing ). Not sure if it is appropriate or not.

Maybe I should do some tests on my trained model to see how it performs.

the last result is good?

YuanxunLu / LiveSpeechPortraits

Questions about training audio2feature model #27