Closed xkortex closed 5 years ago
Well, actually, if I understand correctly, the visualizer
is just used to both annotate the frames with landmarks, and visualize the output with opencv's display features. Wouldn't it make sense architecturally to short-circuit this block of code if there is no visualization on? Something like...
char Visualizer::ShowObservation()
{
if (quiet) { return 0 };
bool ovservation_shown = false;
if (vis_align && !aligned_face_image.empty())
{
cv::imshow("sim_warp", aligned_face_image);
ovservation_shown = true;
}
Also, semantically, it is a little confusing that -verbose
means to show visualization, when typically this flag refers to logging output. This threw me through a loop the first time. Personally, I would propose something like --visualize
to indicate generating the tracks and --gui
to mean "use the gtk display output" and write to a file otherwise, but that's not my call :P.
Thanks for the feedback. To answer your concerns and questions:
FaceLandmarkVidMulti
run in headless mode, so -q
is unnecessary and was instead replaced by -verbose
for visualization purposes. These two executables are the main workhorses of OpenFaceFaceLandmarkVid
is just an example runner of face tracking and is not meant to be run headless as it doesn't really produce any output, just a live visualization. (the names of the executables are a bit confusing, but were kept for backwards compatibility sake, but I think it might be time to rename them)-verbose
being confusing, and I like your suggestion to rename it, probably won't do it for the release that's coming soon, but the one after.-tracked
flag to still output the tracks, but without showing any visualization (headless)Does that answer your questions? Let me know if I missed anything.
Yeah I think so. Is there any difference in the output format between the following commands?
FeatureExtraction -pose -2Dfp -3Dfp -aus -pdmparams
FaceLandmarkVidMulti -pose -2Dfp -3Dfp -aus -pdmparams
I see both output a 386-vec CSV, though the one from Multi is longer and distinguishes multiple FaceID values. For videos with only a single face, the length is pretty similar, too. Am I correct in thinking these outputs are pretty comparable? For my application, I only care about AU's, pose, confidence, and some of the spatial coordinates. Thanks.
The former (FeatureExtraction
) will only detect/track a single face in a video stream, it will try to pick the biggest/closest face when tracking. It will also perform person specific adaptation for action unit detection.
The latter (FaceLandmarkVidMulti
) will attempt to detect/track multiple faces in a video stream (up to 4 faces), however, it will not perform person specific adaptation for action unit detection (which makes it a bit less accurate).
I've taken to running my pipeline as a two-stage process. The first pass uses FaceLandmarkVidMulti
with -simalign -simsize 512
to generate face chips from the original video. I then run face recognition to get a mapping from face_id to actual identity. I run some python to aggregate these into contiguous runs of the same person, then I separate the aligned faces into folders for each contiguous block. Finally I run FeatureExtraction
with -fdir
for the person I am interested in (for each video, there is only one individual I care about).
This seems to work for the most part, but can you think of any gotchas to this approach? I am not yet sure if the AUs output is comparable to running on the simalign chips vs raw video. It's definitely more convenient than having to mask the input video in such a way that FeatEx can focus only on my person of interest.
That's a really clever way to get around face re-identification issue.
Two potential gotchas:
-nomask
flag to reduce artifacts introduced by masking the image twice.How does face_id get assigned? I initially thought each distinct face detection is given a unique face_id, but some identities will map to multiple face_ids. Then I thought it was just the index of the detection within a frame, but there are instances where there are ID's 1,2, and 3, but no 0. So does face_detector_HAAR
have some sort of statefulness to it?
Edit: oops didn't mean to close the thread just yet.
I think I misunderstood your initial explanation, I was under the assumption that you are doing some face recognition on the output faces to sort them into identity "bins" before arranging them.
Unfortunately, there is not statefulness in face detection of OpenFace. The id is arbitrary and can change from person to person if the tracking fails, there's no logic behind it. Face re-identification is surprisingly tricky once you start considering scenarios as people leaving the scene, new people entering the scene, people passing each other etc. However, the ID should be somewhat temporally coherent while the tracker is working successfully, but has a risk of flipping when the tracker fails.
Yeah, my initial approach was exactly that. I ran face recognition on the simalign
output chips, parsing frame_det_XX_YYYYYY
into id=XX and frame=YY, and binning based on id. This works alright with videos with 1-2 faces appearing at a time, e.g. a political debate. But when continuity breaks too much, that breaks the assumption that id0 refers to the same individual over the whole context, putting me back to square one.
I ended up having to adapt my strategy slightly in order to deal with the tracking issue in the more general case. Basically what I ultimately want is a "spotlight" on a given person, i.e. a frame sequence extracted from the original video, cropped around my person of interest, than I can feed to FeatureExtraction
for precision vectorization. This is...not trivial to say the least, so now I am taking a more top-down approach by first running a really fast YOLO based detector to generate a descriptor JSON of coarse, low-threshold face detections of the whole video. This goes through some logic to generate discrete spatiotemporal tracks. Each track is sampled a few times for facial recognition. Then more logic I have to write and then ideally, I get a nice, clean subsampled clip that I can feed to Feature Extraction
without many hiccups.
I was initially hoping to be able to use the simalign
images for this purpose, but there's too much edge-case-space in real-world videos for this to work robustly enough for the application. Easier to start from a rough {[frame, X,Y]} descriptor and whittle down.
Describe the bug
-q
seems to have no effect. In fact, neither my IDE search nor grepping for it shows anything. I am gonna be doing a lot of headless processing, so this would be super helpful.To Reproduce
I get
Expected behavior Successful output without launching visualizer
Desktop (please complete the following information):
Additional context Visualizer is hardcoded: https://github.com/TadasBaltrusaitis/OpenFace/blob/master/exe/FaceLandmarkVid/FaceLandmarkVid.cpp#L108
If you want, I could probably PR this, looks like an easy fix. Looks like it should be:
EDIT: actually it looks like the problem is only isolated to FaceLandmarkVid.cpp.