emilianavt / OpenSeeFace

Robust realtime face and facial landmark tracking on CPU with Unity integration
BSD 2-Clause "Simplified" License
1.46k stars 152 forks source link

Training codes #1

Open ghost opened 4 years ago

ghost commented 4 years ago

Great work. is the code in model.py used for training the onnx inference models? any chance to release the training codes?

emilianavt commented 4 years ago

Thank you. The code in model.py is was used for training, but I still need to update it for the current version of the models. One it is updated, I will update this issue.

I do not plan to release the full training code at this point.

sasanasadiabadi commented 4 years ago

are all the models (face detection, landmarking and gaze detection) based on mobilenet-v3?

emilianavt commented 4 years ago

Yes. They are all (except for the optional, pretrained retinaface model) basically heatmap regression with a mobilenet-v3 backend. The gaze tracking model works basically exactly like the landmark one, just for a single landmark.

Face detection is a bit special. That model outputs a heatmap, radius map and a maxpooled version of the heatmap that is used for decoding the output.

Because the landmarking is quite robust with respect to face size and orientation, the face detection model can get away with outputting only very rough bounding boxes.

sasanasadiabadi commented 4 years ago

Great thanks. I was wondering if its possible to share the pytorch pre-trained weights as well. I'm trying to run the codes in opencv dnn module instead of onnxruntime. current onnx model seems to be not compatible with dnn module.

emilianavt commented 4 years ago

I have now updated the model definitions in model.py to match the currently used models.

I have also uploaded the pytorch weights here. My previous attempts at getting the models to work using opencv's dnn module weren't successful, but if you manage to get them to run, I would be very interested in hearing about it!

sasanasadiabadi commented 4 years ago

Thanks for sharing the files. I was able to convert the models and run them in opencv's dnn module. fps seems to be quite similar with both onnxruntime and dnn module. I will update you when the inference code is complete.

emilianavt commented 4 years ago

Thank you for the update!

I'm curious to know which format you converted the models to for use with the dnn module.

sasanasadiabadi commented 4 years ago

I converted the pytorch weights to onnx and using cv2.dnn.readNetFromONNX() with opencv 4.3 version I could run the inference (no other change to your original code). But the outputs of dnn module and onnxruntime are a little bit different with the same preprocessed input.

compare

I have uploaded the converted weight here.

emilianavt commented 4 years ago

Thank you, I will give it a try using your converted models.

My first guess about the difference is that it might have something to do with the Upsample layers. The way I use them is apparently only fully supported with ONNX opset 11, which many inference engines do not seem to support yet.

sasanasadiabadi commented 4 years ago

yeah the problem was because of align_corners=True in nn.Upsample layer. dnn module is not supporting it yet somehow. therefore needs to set False for inference in dnn module. I will try to find a fix.

sasanasadiabadi commented 4 years ago

Finally solved. Thanks for pointing out the problem. with align_corners=True, converting from pytorch with onnx opset 11 and re-building opencv master (4.3.0-dev), the dnn module returns similar predictions as onnxruntime. I will re-upload the weights.

emilianavt commented 4 years ago

Nice, thank you for the updates! I tried using the models you previously posted with the dnn module, but got an error. I assume those already needed a more recent version than 4.2.0.

emilianavt commented 4 years ago

With the current OpenCV master branch, I also succeeded in loading the opset11 models (prior to optimization using onnxruntime). For the full landmark model, I get a pure inference time of around 13ms using onnxruntime with full optimization enabled and around 20ms for OpenCV's DNN module. The results are are practically identical. cv2

sasanasadiabadi commented 4 years ago

Great! However the landmarking looks very robust to various occlusion and illumination types, I think the pupil detection can be improved. As you are not planning to release the training codes, could you share some reference on the data preparation of the pupil model? you mentioned that the landmarking and pupil networks are basically same, but it seems the data pre-processing is quite different.

emilianavt commented 4 years ago

For pupil detection, the biggest challenge was finding training data with accurate annotations and variance in pose. Most datasets I looked at had a significant number of annotations that were noticably off. MPIIGaze was the best I could find, but it still had many issues. That's why I ended up training on basically just synthetic data generated with UnityEyes only, but that has its own issues.

Another challenge was keeping the gaze model fast, so it could be run in addition to the face landmark model without significantly impacting the frame rate for avatar animation. This lead me to select a very small model that is run at a low resolution.

To compensate for that, I forego training the model in a way that lets it adapt to different poses and align the eyes in a consistent way. This lead to another issue, because the eye corner points from the landmark model may not match the corner points (if any are given) in the gaze dataset and most gaze datasets do not include full face images, so it is not possible to run the face landmark model first to align the eyes in a consistent manner. In the end, I calculated eye corner points and pupil centers from the json generated by UnityEyes and aligned the eyes with that. The pupil center was then used as a single landmark to be detected by the model. When I was working on this part, I wasn't aware of skimage.transform.SimilarityTransform so I did things manually, but I would most likely change this if I were to rework the gaze tracking.

In the end, this alignment didn't quite match that produced when aligning according to the eye corner points, so there is some number fudging in the tracker to get better results.

During training, I also augmented the training data with rather strong blur, noise and color shifts to make up for the synthetic nature of the data. In addition, I overlaid random bright rectangles to imitate reflections on glasses.

While working on it, I posted some intermediate results in Twitter. The white dot is the model's prediction, the black dot is the target. The big picture is the red channel with the black and white dot overlaid. On the side, in the first column are the landmark map, the two offset layers as predicted and the adaptive wing loss mask. In the second rightmost column are the ground truth landmark map, adaptive wing loss mask (repeated) and the two ground truth offset layers.

Overall, considering the speed of the model, I think it's working decently well, but any improvement would be welcome of course! You can find the UnityEyes preprocessing script here.

sasanasadiabadi commented 4 years ago

no doubt about its decent performance. I just found poor performance on some challenging cases such as extreme glass reflections and outside sunny environments, which is mainly due to the limited training set and may not be a concern of your project. A first improvement could be some post processing stabilization scheme on the pupils to enhance their jittery behavior in case of glasses, with no change in the model. after I finish this part I can update you if stabilization makes any improvement.

And thanks for the detailed explanations. I set up the training and could get comparable results.

emilianavt commented 4 years ago

It's good to hear that you could get comparable results. Another thing I thought of, but haven't tried yet, is to train a bigger, slower model which would hopefully give more reliable results and use that to annotate a more diverse training set to train another smaller model.

About stabilizing the pupils, I do a lot bunch of filtering and stabilizing in the code I use to actually animate avatars.

sasanasadiabadi commented 4 years ago

Actually I tried a HG network (2 stacks) to train the pupil detector on the UnityEye set (with lots of augmentation) but it didn't improve much on my test set. I'm training now with 4 stacks.

Oh I wasn't aware of those stabilization part. Thanks.

emilianavt commented 4 years ago

That's very interesting! I'm curious to hear about your further results.

emilianavt commented 4 years ago

Since I posted the previous pytorch weights already, here are the weights for the new 56x56 30 point model.

sasanasadiabadi commented 4 years ago

Thank you for sharing the new trained weights. Intrestingly in OpenCV DNN, the inference time of the new model is higher than the lightest model before (6.5ms vs 5.5ms). However In OnnxRuntime inference time was reduced from (5.5ms to 1.7ms). I'm trying to figure out why DNN module is behaving like that!

emilianavt commented 4 years ago

That's an interesting difference. The new model is pretty much the full size model going by layer and channel count, but the resolution of the channels is lower. Maybe that has something to do with OpenCV DNN behaving differently.

emilianavt commented 4 years ago

One note regarding the inference=True code in model.py. Some users reported that the landmarks were noisier using this rather than the python landmark decoding function here:

https://github.com/emilianavt/OpenSeeFace/blob/e805aa24e35afa3af627bab5f4ede8b9ae4149dd/tracker.py#L718-L747

Guocode commented 4 years ago

Your landmark is very robust for most case like large pose and exaggerated expression, I have train my model on 300WLP, but it failed to detect often, can you share the way of data process like data augmentation or training tricks.

emilianavt commented 4 years ago

I merged multiple datasets, partially reannotating them with FAN and older versions of the same model for some features, fixed some eye point annotations in various ways and filtered out samples where different annotations didn't agree by some threshold. I also used very strong augmentation with noise, blur, downscale, rectangle overlays, strong rotation and random margins at the sides of faces. You can look at the sample images in the results part of the readme to see what the training data looks like.

Guocode commented 4 years ago

How do you think about regression based and heatmap based method, I use regression based method and add strong data augment as you mentioned, but when face box is not so good, the result will get very bad, but I tried your heatmap based model, even if the box is very strange like much larger than the actual face, the result is also very stable. Whether the robustness comes from model struction or something else. In addition, I found that the mouth point in 300W and in WFLW is very different so I gave up merging WFLW into the training data, in readme you did the mergement, how do you deal with the gap.

emilianavt commented 4 years ago

I can imagine that heatmap based methods lend themselves more to robustness, but I can't give a theoretical reason why. In this case, I think it is a combination of model structure and augmentation.

I don't remember the mouth points causing me issues as they at least have the same number of points. I deleted the center eye points in WFLW, but it changes the shape of the eye. You can do two step training, first training on a bigger dataset and once it has converged, training some more on an adjusted WFLW to take advantage of its higher quality annotations.

GitZinger commented 2 years ago
  1. facedetection and landmark network are independent?
  2. landmark with inference==false, what is the result? the shape like ?,198, 28, 28, how to train? if inference==true, the results of the network are with confident, still no clue how to train. During the training, are the labels just 66 pairs of x,y?
  3. are there any inference code in pytorch to use the pytorch network to interpret the landmarks especially the lips? thanks a lot
emilianavt commented 2 years ago
  1. Yes.
  2. Heatmaps, X offset maps and Y offset maps for each landmark, with map types grouped together. During training, the labels were turned into these maps. Setting inference to true just bakes the landmark decoding into the model itself.
  3. Please refer to the landmarks function shown in my previous comment here on how to turn the maps into landmark locations. It's not pytorch though. If you mean interpreting the landmarks in the form of blendshapes, please refer to OpenSeeVRMDriver.cs in the examples folder.
GitZinger commented 2 years ago

Thank you for explaining to me.

  1. How to convert the ground true coordinates labels to the heat map during training for the loss function? as I have no idea how to train right now. PyTorch dataset/dataloader gives coordinates, is there a customized dataset or a conversion coordinates to map?
    or is your AdapWingLoss able to take the landmark network heat map results and the ground true 66 coordinate points to calculate the loss? and there are many magic number in the AdapWingLoss, is there any documentation to explain it? what if I want to reduce the landmark to a lower amount? which part I can change?

  2. And how to interpret the heat map? The heat map contains the whole face from forehead to the jaw or it could be anything and it doesn't matter for the whole face or not?

I really appreciate it. @emilianavt

  1. Heatmaps, X offset maps and Y offset maps for each landmark, with map types grouped together. During training, the labels were turned into these maps. Setting inference to true just bakes the landmark decoding into the model itself.
  2. Please refer to the landmarks function shown in my previous comment here on how to turn the maps into landmark locations. It's not pytorch though. If you mean interpreting the landmarks in the form of blendshapes, please refer to OpenSeeVRMDriver.cs in the examples folder.
emilianavt commented 2 years ago

Please refer to the landmark landmarks function shown in my previous comment in this thread on how to calculate landmarks from the heat and offset maps. This will also help understand what your training data should look like. Visualizing the maps will help too.

is there a customized dataset or a conversion coordinates to map?

You can create maps from coordinates, but the dataset I used to train is very customized.

or is your AdapWingLoss able to take the landmark network heat map results and the ground true 66 coordinate points to calculate the loss?

No.

and there are many magic number in the AdapWingLoss, is there any documentation to explain it?

There isn't. The numbers are mainly for weighting different landmarks.

what if I want to reduce the landmark to a lower amount? which part I can change?

Please carefully review the code to understand how everything works. It's not a completely trivial change.

emilianavt commented 5 months ago

The geffnet commit I'm on is: c450c12ae6ffb1757f62dde3c2765da3c10f6def