JimWest / MeFaMo

MIT License
480 stars 103 forks source link

rotate face to neutral pose first #3

Open Neleac opened 2 years ago

Neleac commented 2 years ago

I noticed that the results vary with different face rotation / head tilt, since the values used are tuned to the neutral / upright face rotation. I think you should first rotate the landmarks into a neutral pose before doing the calculations, so that results are rotation invariant. Are there plans on adding this feature?

Neleac commented 2 years ago

Actually a simpler solution than rotating the landmarks would be to project the points onto a plane defined by some local axes.

JimWest commented 2 years ago

I actually got those points in a better way, but couldn't had the time to implement it properly yet. If you activate the --show_3d parameter and look at the image (projected 3d points onto 2d) that's pretty much as stable and normalized as you can get with the current mediapipe model.

qhanson commented 2 years ago

Currently, the code uses metric landmarks or normalized landmarks (image pixel space) to calculate blendshape values. I tried both methods and the results look awful.

However, both ways ignore the face identity. Different people have varied faces. I even try the rigid transformation to map my metric landmarks to the canonical face provided by mediapipe. However, even the neural faces in the transformed space (canonical space) look different. Do you have any suggestions? I am also working on data-driven blendshape solver (deep learning by collecting enough metahuman faces and their blendshape values).

xuguozhi commented 2 years ago

Currently, the code uses metric landmarks or normalized landmarks (image pixel space) to calculate blendshape values. I tried both methods and the results look awful.

However, both ways ignore the face identity. Different people have varied faces. I even try the rigid transformation to map my metric landmarks to the canonical face provided by mediapipe. However, even the neural faces in the transformed space (canonical space) look different. Do you have any suggestions? I am also working on data-driven blendshape solver (deep learning by collecting enough metahuman faces and their blendshape values).

deep learning base approach seems ok but requires mush pair-data for training

qhanson commented 2 years ago

Yes. It needs massive paired-data such as hundrends of faces. Luckily, Metahuman is real enough to compensate for the real human face collection. I am working on writing a metahuman project to receive blendshape values and save the results as image.

xuguozhi commented 2 years ago

Yes. It needs massive paired-data such as hundrends of faces. Luckily, Metahuman is real enough to compensate for the real human face collection. I am working on writing a metahuman project to receive blendshape values and save the results as image.

I am no more at NetEase, but the image-bs pair from metahuman could be easily acquired if you are familiar with UE.

qhanson commented 2 years ago

Some Updates: Datasets: Send some 52 blendshape to metahuman and get the corresponding metahuman face. Personally, I obtained 30k images for 40 expressions of 59 metahumans.

Method: training a neural network from synthesized metahuman faces to 52 bs.

Result: The neural network is converged well on the synthesized datasets. Testing on the synthesized datasets worked well. However, it does not generalize to real human faces.

Neleac commented 2 years ago

@qhanson I suggest training the model to directly use MediaPipe landmarks to predict blendshape values. To generate the ground truth blendshape values for the dataset, you'll have to use something like LiveLinkFace mentioned in the README. This MediaPipe -> blendshape model is the missing piece to replacing LiveLinkFace

iPsych commented 2 years ago

@Neleac @qhanson It seems that I am facing the same problem. I am looking for the better solution for 'already recorded video' to meta-human applicable blendshape output. Currently wiggling with mediapipe attention-mesh.

qhanson commented 2 years ago

@qhanson I suggest training the model to directly use MediaPipe landmarks to predict blendshape values. To generate the ground truth blendshape values for the dataset, you'll have to use something like LiveLinkFace mentioned in the README. This MediaPipe -> blendshape model is the missing piece to replacing LiveLinkFace

In my experiment, directly learning the mapping (468*3 -> 52) with a 4-layer MLP does not work well. With l1 loss, the output keeps the same. With l2 loss, the mouth can open and close while the eye keeps open all the time. This reminds me of the mesh classification problem. Passing the render mesh or point cloud of 468 landmarks may work. In this way, we can not exploit the pretrained-weights of mediapipe. I do not know the minimum number of paired image2bs. Tip: I have not tested this way.

JimWest commented 2 years ago

I would try to use a smaller input, you don't need all the 468 Keypoints, I would try to start with the ones I'm using in my config file and slowly adding more (by looking at the ones that really matter when doing facial stuff). With that you will need way less training data (and training time).

zk2ly commented 2 years ago

一些更新: 数据集:向 metahuman 发送一些 52 blendshape 并获得相应的 metahuman 人脸。就个人而言,我为 59 个超人类的 40 个表情获得了 30k 张图像。

方法:从合成的超人脸到 52 bs 训练一个神经网络。

结果:神经网络在合成数据集上收敛良好。对合成数据集的测试效果很好。但是,它并不能推广到真实的人脸。

Can you share your data, I want to use it to train a mediapipe2blendshape network, if it works well I will share the network with you.

qhanson commented 2 years ago

Can you share your data, I want to use it to train a mediapipe2blendshape network, if it works well I will share the network with you.

For simple experiments, you do not need these datasets to train on model. You can try https://github.com/yeemachine/kalidokit

sylyt62 commented 1 year ago

Some Updates: Datasets: Send some 52 blendshape to metahuman and get the corresponding metahuman face. Personally, I obtained 30k images for 40 expressions of 59 metahumans.

Method: training a neural network from synthesized metahuman faces to 52 bs.

Result: The neural network is converged well on the synthesized datasets. Testing on the synthesized datasets worked well. However, it does not generalize to real human faces.

What loss function did you use to train this network?

There's another morphable head model named FLAME, which offers a tool to generate 3d mesh with its 100 expression parameters (something like blendshape) as inputs. With this tool we could build loss functions by mapping it back to the image space (3d -> 2d) and thus compare the 3d landmarks of the face.

But it seems that ARKit lack this kind of tool to do the mapping. If you use statistical L1 loss or so, it will only focus on the similarity of the numbers, but not the similarity of the actual expressions. Guess that's why your model not performing well in generalization.