Closed TimmmYang closed 3 years ago
h0, h1, ..., hn
and generate 'y0, y1, ..., yn'. You can simply compare the 'y17, y18, ..., yn' with the corresponding groundtruth.Hope the above helps.
Thank you! This helps me a lot.
Hi, there are still a couple of questions about training audio2feature model:
How to decide the mouth-related landmarks? I found that for 68 facial landmarks project, the mouth points is less than 20 and in the paper you have 25 points. Any other points are included? Like eyes, nose landmarks, etc.
I use 3DDFA to extract landmarks for the training videos and the landmarks values are just the pixel location. You made difference and normalization before sending input into networks, right? For example, Delta v1 = v1 - v0, Delta v2 = v2 - v0 , ...
where v0
is the mean_pts3d.
How to determine mean_pts3d? Should I just choose a neural expression frame of the target person and get 3d landmark points or use mean value of the whole dataset? What's more, for frame_jump_stride=4, is this the frame increment for each iteration? Like batch size=32
, the input tensor is [32, 240*2, 512]
, where T=240
is the 0-240 frame data for item1, 4-244 for item2 ... 124-364 frame for item32?
Okay, thanks a lot!
Hello, I am still confused about the coordinates change between 2D and 3D.
project _landmarks
you use camera_intrinsic
and scale
to convert 3d_pts
to 2D landmarks. But how's the camera_intrinsic
and scale
determined if I train another dataset?Now I had a frame_dataset with 3d points and head poses, but 3d points are ranging from 0-512, which is the image coordinates I believe. How should I process the current dataset to train a new model?
It depends on your camera model, which is a perspective camera (pinhole) model in our settings. I guess current you use a scaled orthogonal model (or called weak perspective in some papers) for your 3d points ranging from 0-512. The camera intrinsic and scale params should change for your camera models. If you are not familiar with camera-related knowledge, I recommend you check Sec.4.1 in Paper: 3D Morphable Face Models - Past, Present and Future or any other 3D face related papers.
I read the papers you recommended and I know that I should set a camera to turn the 3d landmarks in world coordinate to camera coordinate then use camera intrinsic to compute 2D landmarks in the image coordinates. But how I set the camera in the world coordinate? Could it be done automatically?
Also, I find your paper section 4.1, you said for camera calibration, you use binary search to compute focal depth. Are there any open source tool to complete this process? I read the reference paper(Cao 2013) and found no quick implementations.
In my case, all the values of 3D landmarks are ranging from 0 to 512 because I got these points from a cropped video. I noticed that your camera's rotation and translation are all 0. Can I just set camera in the middle of the image(might not be accurate)? Like for point (x, y, z)
, the transformed points are (x-256)/256, (y-256)/256, (z-256)/256
? Because it might be good to transform all values to [-1, 1] for training? I know I should also consider the head pose and shoulder points and do the same process but not sure if it works.
Thanks!
Check the tool you used (from the doc or its paper? )to find out which camera model/projection model is used. Once you have the camera model, you know how to project the detected 3D points to 2D images (just follow the formula of your used tool's paper/doc.). I guess you may just drop z-coords for your 3D landmarks are ranging from 0-512.
Binary search is used to compute the perspective camera focal length f. I don't know any open-source tools. Again, if you use a scaled orthogonal camera, this step is not necessary.
I think any transform over the landmarks for training is wise as long as it improves the experiment results.
Hi, did you use 18-frame latency both during training and inferencing or just train the model with 0 latency and use this delay during inferencing?
I tried trained audio2feature model these days and the problem is the validation loss is much higher than the training loss. I just use the default 80%/20% train/val split but the validation loss is nearly 20 times larger than the training loss, as you can see from the plot:
Adding n-frame latency is an effective scheme to generate better mouth shapes, it should be used both in training and testing.
I don't know your experiment setting (and so it is hard to compare with mine - my validation loss is not as large as yours), but validation loss is higher is common. If you use only a few minutes of audio data, the model tends to overfit the training data. It is not a bad thing because you actually want to learn the distribution of the training data. The key problem lies in how to learn the mapping from input audio to the training audio. More training data is better but it is hard to acquire, SynthesizeObama-SIG17 paper has done such ablation study on the training corpus size. More data, the validation loss is closer to the training loss.
Back to the plot, it is clear that when the training loss drops the validation loss also drops, and this tendency is good. Usually, you can choose the model which owns the best validation loss to work with your testing.
You can try to decrease the validation loss by using more data, different training targets, or any other methods.
Thanks for your reply! I use similar experiment settings(network structure, optimizer, learning rate, frame jump, n-frame latency, etc.) as yours except I use 20*3 as mouth-shape(68-point face alignment model) and because I didn't see any camera information from that model, I simply normalized the keypoints value by (x - 256)/256 to let data ranging from -1 to 1 (it might also be a simple coordinates changing ). Not sure if it is appropriate or not.
Maybe I should do some tests on my trained model to see how it performs.
Thanks for your reply! I use similar experiment settings(network structure, optimizer, learning rate, frame jump, n-frame latency, etc.) as yours except I use 20*3 as mouth-shape(68-point face alignment model) and because I didn't see any camera information from that model, I simply normalized the keypoints value by (x - 256)/256 to let data ranging from -1 to 1 (it might also be a simple coordinates changing ). Not sure if it is appropriate or not.
Maybe I should do some tests on my trained model to see how it performs.
the last result is good?
Hello, I am trying to reconstruct the training code and there are several questions I have:
From what I saw from
audio2feature_model.py
, in the forward module, the size forself.audio_feats
is[b, 1, nfeats, nwins]
while inaudio2feature.py
, the dimension for audio_features is[b, T, ndim]
. From my understanding(correct me if I was wrong), forbatch_size=32, T=240*2, ndim=512(the APC feature dimension)
, the input batch for Audio2Feature model should be[32, 480, 512]
(480 because mel_frame is n_frames * 2) and output size is[32, 240, 75]
. Is that right?Furthermore, from your paper in section 3.2,a delay
d=18
is added during training but not reflected in the code. How that works in training? For example, m0 is inferred byh0, h1,....h18
?In
audiovisual_dataset.py
, you seemed like clipping the audio into many pieces and extract APC feature for each audio. What is the number of clips for a certain dataset, eg. 4 mins 60fps video?There might be some stupid questions as I am not very familiar with audio processing field, just correct me if I made mistakes, thanks!