-q seems to have no effect on FaceLandmarkVid in headless mode

TadasBaltrusaitis / OpenFace

OpenFace – a state-of-the art tool intended for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation.

Other

6.86k stars 1.84k forks source link

-q seems to have no effect on FaceLandmarkVid in headless mode #732

Closed xkortex closed 5 years ago

xkortex commented 5 years ago

Describe the bug -q seems to have no effect. In fact, neither my IDE search nor grepping for it shows anything. I am gonna be doing a lot of headless processing, so this would be super helpful.

grep -ir --include=\*.{cpp,h} '\"\-q' .

To Reproduce

FaceLandmarkVid -f bar.mp4 -q -out_dir foo

I get

terminate called after throwing an instance of 'cv::Exception'
  what():  OpenCV(3.4.6) /home/mike/src/opencv-3.4.6/modules/highgui/src/window.cpp:634: error: (-2:Unspecified error) The function is not implemented. Rebuild the library with Windows, GTK+ 2.x or Carbon support. If you are on Ubuntu or Debian, install libgtk2.0-dev and pkg-config, then re-run cmake or configure script in function 'cvShowImage'

[1]    18462 abort (core dumped)  FaceLandmarkVid -f bar.mp4 -q -out_dir foo

Expected behavior Successful output without launching visualizer

Desktop (please complete the following information):

OS: Linux Ubuntu 18.04
Version latest develop d0964565
Compiler g++-8

Additional context Visualizer is hardcoded: https://github.com/TadasBaltrusaitis/OpenFace/blob/master/exe/FaceLandmarkVid/FaceLandmarkVid.cpp#L108

If you want, I could probably PR this, looks like an easy fix. Looks like it should be:

    Utilities::Visualizer visualizer(arguments);

EDIT: actually it looks like the problem is only isolated to FaceLandmarkVid.cpp.

xkortex commented 5 years ago

Well, actually, if I understand correctly, the visualizer is just used to both annotate the frames with landmarks, and visualize the output with opencv's display features. Wouldn't it make sense architecturally to short-circuit this block of code if there is no visualization on? Something like...

char Visualizer::ShowObservation()
{
        if (quiet) { return 0 };
    bool ovservation_shown = false;

    if (vis_align && !aligned_face_image.empty())
    {
        cv::imshow("sim_warp", aligned_face_image);
        ovservation_shown = true;
    }

Also, semantically, it is a little confusing that -verbose means to show visualization, when typically this flag refers to logging output. This threw me through a loop the first time. Personally, I would propose something like --visualize to indicate generating the tracks and --gui to mean "use the gtk display output" and write to a file otherwise, but that's not my call :P.

TadasBaltrusaitis commented 5 years ago

Thanks for the feedback. To answer your concerns and questions:

-q seems to have no effect. You're absolutely right, I removed that flag in a previous release, but forgot to change the wiki accordingly, I updated the wiki now.
By default 'FeatureExtraction'. 'FaceLandmarkImg', and FaceLandmarkVidMulti run in headless mode, so -q is unnecessary and was instead replaced by -verbose for visualization purposes. These two executables are the main workhorses of OpenFace
FaceLandmarkVid is just an example runner of face tracking and is not meant to be run headless as it doesn't really produce any output, just a live visualization. (the names of the executables are a bit confusing, but were kept for backwards compatibility sake, but I think it might be time to rename them)
Agree about -verbose being confusing, and I like your suggestion to rename it, probably won't do it for the release that's coming soon, but the one after.
You can actually use -tracked flag to still output the tracks, but without showing any visualization (headless)

Does that answer your questions? Let me know if I missed anything.

xkortex commented 5 years ago

Yeah I think so. Is there any difference in the output format between the following commands?

FeatureExtraction -pose -2Dfp -3Dfp -aus -pdmparams

FaceLandmarkVidMulti -pose -2Dfp -3Dfp -aus -pdmparams

I see both output a 386-vec CSV, though the one from Multi is longer and distinguishes multiple FaceID values. For videos with only a single face, the length is pretty similar, too. Am I correct in thinking these outputs are pretty comparable? For my application, I only care about AU's, pose, confidence, and some of the spatial coordinates. Thanks.

TadasBaltrusaitis commented 5 years ago

The former (FeatureExtraction) will only detect/track a single face in a video stream, it will try to pick the biggest/closest face when tracking. It will also perform person specific adaptation for action unit detection.

The latter (FaceLandmarkVidMulti) will attempt to detect/track multiple faces in a video stream (up to 4 faces), however, it will not perform person specific adaptation for action unit detection (which makes it a bit less accurate).

xkortex commented 5 years ago

I've taken to running my pipeline as a two-stage process. The first pass uses FaceLandmarkVidMulti with -simalign -simsize 512 to generate face chips from the original video. I then run face recognition to get a mapping from face_id to actual identity. I run some python to aggregate these into contiguous runs of the same person, then I separate the aligned faces into folders for each contiguous block. Finally I run FeatureExtraction with -fdir for the person I am interested in (for each video, there is only one individual I care about).

This seems to work for the most part, but can you think of any gotchas to this approach? I am not yet sure if the AUs output is comparable to running on the simalign chips vs raw video. It's definitely more convenient than having to mask the input video in such a way that FeatEx can focus only on my person of interest.

TadasBaltrusaitis commented 5 years ago

That's a really clever way to get around face re-identification issue.

Two potential gotchas:

Make sure the face images are sorted in order, there is some temporal smoothing for expressions that could be affected if the ordering or the framerate changes (30fps is assumed framerate for image sequences).
Another suggestion is to turn off the masking done by simalign using -nomask flag to reduce artifacts introduced by masking the image twice.

xkortex commented 5 years ago

How does face_id get assigned? I initially thought each distinct face detection is given a unique face_id, but some identities will map to multiple face_ids. Then I thought it was just the index of the detection within a frame, but there are instances where there are ID's 1,2, and 3, but no 0. So does face_detector_HAAR have some sort of statefulness to it?

Edit: oops didn't mean to close the thread just yet.

TadasBaltrusaitis commented 5 years ago

I think I misunderstood your initial explanation, I was under the assumption that you are doing some face recognition on the output faces to sort them into identity "bins" before arranging them.

Unfortunately, there is not statefulness in face detection of OpenFace. The id is arbitrary and can change from person to person if the tracking fails, there's no logic behind it. Face re-identification is surprisingly tricky once you start considering scenarios as people leaving the scene, new people entering the scene, people passing each other etc. However, the ID should be somewhat temporally coherent while the tracker is working successfully, but has a risk of flipping when the tracker fails.

xkortex commented 5 years ago

Yeah, my initial approach was exactly that. I ran face recognition on the simalign output chips, parsing frame_det_XX_YYYYYY into id=XX and frame=YY, and binning based on id. This works alright with videos with 1-2 faces appearing at a time, e.g. a political debate. But when continuity breaks too much, that breaks the assumption that id0 refers to the same individual over the whole context, putting me back to square one.

I ended up having to adapt my strategy slightly in order to deal with the tracking issue in the more general case. Basically what I ultimately want is a "spotlight" on a given person, i.e. a frame sequence extracted from the original video, cropped around my person of interest, than I can feed to FeatureExtraction for precision vectorization. This is...not trivial to say the least, so now I am taking a more top-down approach by first running a really fast YOLO based detector to generate a descriptor JSON of coarse, low-threshold face detections of the whole video. This goes through some logic to generate discrete spatiotemporal tracks. Each track is sampled a few times for facial recognition. Then more logic I have to write and then ideally, I get a nice, clean subsampled clip that I can feed to Feature Extraction without many hiccups.

I was initially hoping to be able to use the simalign images for this purpose, but there's too much edge-case-space in real-world videos for this to work robustly enough for the application. Easier to start from a rough {[frame, X,Y]} descriptor and whittle down.