Closed pokidyshev closed 5 years ago
If I use YouTube-Faces do I need to adjust face bounding boxes somehow? Or just use the labels as-is? You mentioned earlier that you filter low-resolution videos: which resolution you considered low?
@pokidyshev For the first comment, we use fixed cropping, because that face dominates the video and does not need a face detector to give the bounding box. For your own video, you can use any pre-trained face detector including cv2/dlib. However, it raises a problem, your detector might be different from the detector used on the labeled images, and thus may cause some differences due to your training data is not using the same detector.
For the second comment. You can directly use the provided face bounding boxes from YouTube-Faces. The face with a size lower than 256*256 is considered as low.
@D-X-Y Thanks for the reply! What should be the value of the loss after training SBR 50 epochs using 300W+300VW+YouTube-Face?
Since this work is done almost two years ago, I can not remember the detailed loss value. Sorry about that. A simple way to monitor whether your training works well is to check whether the detection loss of labeled data decreases smoothly (or keeps a similar value) or not.
Since this work is done almost two years ago, I can not remember the detailed loss value. Sorry about that. A simple way to monitor whether your training works well is to check whether the detection loss of labeled data decreases smoothly (or keeps a similar value) or not.
Ok, thanks so much!
How long did it take you to train 50 epochs? And which hardware did you use?
UPD: found in another issue: "I remember it takes several days to train on 4 Titan V GPUs"
Yes, on Nvidia Tesla 32GB V100.
I started training on YouTube faces and stumbled upon a strange thing: NME raises after 1st epoch and then it drops continuously. But as of the time of the 4th epoch, it is still not good as it was before SBR. Is it OK?
Here is a part of my log:
Compute NME and AUC for 689 images with 68 points :: [(NME): mean=3.884, std=2.461], auc@0.07=48.442, auc@0.08-54.057, acc@0.07=91.872, acc@0.08=94.630
Compute NME and AUC for 689 images with 68 points :: [(NME): mean=4.356, std=2.930], auc@0.07=43.590, auc@0.08-49.573, acc@0.07=89.550, acc@0.08=93.324
Compute NME and AUC for 689 images with 68 points :: [(NME): mean=4.110, std=2.746], auc@0.07=46.166, auc@0.08-51.947, acc@0.07=91.001, acc@0.08=93.469
Compute NME and AUC for 689 images with 68 points :: [(NME): mean=4.059, std=2.928], auc@0.07=47.113, auc@0.08-52.759, acc@0.07=90.566, acc@0.08=93.904
Compute NME and AUC for 689 images with 68 points :: [(NME): mean=4.038, std=2.862], auc@0.07=47.243, auc@0.08-52.995, acc@0.07=91.872, acc@0.08=94.485
Yes, "the NME raises at the first several epochs" is normal. As long as it will finally drop to a lower value than the initial NME, it will be fine. The SBR requires much more epochs than 4, in my memory, I trained it for more than 100 epochs.
Ok, thanks so much!
Hi, @D-X-Y
Thanks for sharing your awesome results.
I was playing a bit with your code and found out that you use hardcoded bounding box for SBR stage:
gap, x1, y1, x2, y2 = 5, 5, 5, 450, 680
It looks like this:
So, basically, you are using the whole image cropping 5px padding from the sides.
I was going to run SBR on my own videos and got stuck with the question: where do I get face bounding boxes for my unlabeled videos? Can I use cv2/dlib/etc to detect faces? Do I need to widen/adjust their predictions?