Does this work in Windows?

tdbe commented 3 years ago

Never heard of v4l2loopback. Is it hard to make this thing work in windows? Because I want to be able to just go to a computer, any computer, which 99% of the time is Windows, and run this :)

BenBE commented 3 years ago

v4l2loopback is a kernel mode driver for Linux. No chance porting that driver to Windows. The part that may run on Windows is the actual image processing part inside libbackscrub, because that is independent of the driver interface. But the library is only the processing, the virtual cam part will need a completely different approach on Windows.

phlash commented 3 years ago

Hopefully someone will get time to port the core library to Windows, along with my OBS plugin that uses it, that is the reason we have split it out :smile:

BenBE commented 3 years ago

@phlash The core library part should mostly run on Windows as is (with maybe minor fixes).

Where I currently see the biggest open questions is with the Camera Interface. Once we have image data for processing, we should be fine even on Windows, but pushing the processed image into some camera device may be tricky.

OmarJay1 commented 3 years ago

I've done single image replacement in Windows, although there seems to be issues with the models. Probably my mucking with the code to make it build in Windows.

I'm trying to build the complete package now in Windows. I thought I had an issue with pthreads, but the issue is with the package pthreadpool. A different, now edited version of this post was incorrect on that matter.

I think there may be a versioning issue with TFlite. Pthreadpool builds on a different version I have, but the cmake pull in this project fetches a version that doesn't build with Windows Visual Studio.

phlash commented 3 years ago

Thanks for digging in @OmarJay1! I presume you are attempting to build the experimental branch, including the underlying TFLite with CMake and MSVC (from Visual Studio)?

This ticket (https://github.com/tensorflow/tensorflow/issues/47166) seems to indicate TFLite builds work for x64 binaries from v2.4.1 tagged Tensorflow source..

I might have a go with a cross-compiler (mingw64) locally myself.. [edit] cross-compiling is a non-starter, since there is no sane support for std::thread in current stable mingw based tooling and chunks of dependent code use this (as well as explicitly using pthreads). I'll fire up a Win VM and see what happens there with community edition MSVC (via VSCode I think)... [edit2] attempting to build with cmake 3.20, MSVC v19.16 (from VS2017) in a Win7 VM - produces an internal compiler error while trying to to build the farmhash component... giving up and going sailing! [edit 3] switching to VS2019 Build Tools (MSVC v19.38, CMake 3.20 now included from Microsoft) plus the Win10 SDK no longer crashes the compiler (yay!) but now fails with an error somewhere in flatbuffers component as a result of with statements. Attempting to force c++11/14/17 compliance via CMAKE_CXX_STANDARD=<number> doesn't fix this - disappointing :disappointed: - since Google claim this should build out-of-the-box.

OmarJay1 commented 3 years ago

I got something to build and with the default settings it seems to work. It's very messy at this point. I'll try to clean it up a bit.

Build-wise the main issues were pthreadpool, which TFLite needs, and pthreads which deepseg.cc needs. I had to get the latest version of pthreadpool and build against that.

For pthreads, I hacked in overrides to the functions using std::thread and std::mutex.

I'm not sure what the best way to present a PR. Right now it's a messy bunch of #if !_WINDOWS,#else,#endif. Pthreadpool would need to be changed in the cmake file.

Here's what I have right now. Sorry for the formatting problems. I keep forgetting how to format in Markdown. I used the "insert code" feature, but it still seems to be hiding #include links.

#if !_WINDOWS
#include <unistd.h>

#else
// C:\temp\pthreadpool\include
// pthreadpool\Debug\pthreadpool.lib

#include <chrono>
#include <thread>
#include <io.h>
#include <mutex>

void usleep(int usec)
{
    std::this_thread::sleep_for(std::chrono::microseconds(usec));
}

void pthread_mutex_lock(std::mutex *m)
{
    m->lock();
}

void pthread_mutex_unlock(std::mutex* m)
{
    m->unlock();
}

#define pthread_mutex_t std::mutex
#define pthread_t std::thread
#endif

#include <cstdio>
#include <chrono>
#include <string>

#include "tensorflow/lite/interpreter.h"
#include "tensorflow/lite/kernels/register.h"
#include "tensorflow/lite/model.h"
#include "tensorflow/lite/optional_debug_tools.h"

#include <opencv2/core/core.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/types_c.h>
#include <opencv2/videoio/videoio_c.h>

#if !_WINDOWS
#include "loopback.h"
#endif

#include "transpose_conv_bias.h"

int fourCcFromString(const std::string& in)
{
    if (in.empty())
        return 0;

    if (in.size() <= 4)
    {
        // fourcc codes are up to 4 bytes long, right-space-padded and upper-case
        // c.f. http://ffmpeg.org/doxygen/trunk/isom_8c-source.html and
        // c.f. https://www.fourcc.org/codecs.php
        std::array<uint8_t, 4> a = {' ', ' ', ' ', ' '};
        for (size_t i = 0; i < in.size(); ++i)
            a[i] = ::toupper(in[i]);
        return cv::VideoWriter::fourcc(a[0], a[1], a[2], a[3]);
    }
    else if (in.size() == 8)
    {
        // Most people seem to agree on 0x47504A4D being the fourcc code of "MJPG", not the literal translation
        // 0x4D4A5047. This is also what ffmpeg expects.
        return std::stoi(in, nullptr, 16);
    }
    return 0;
}

// OpenCV helper functions
cv::Mat convert_rgb_to_yuyv( cv::Mat input ) {
    cv::Mat tmp;
    cv::cvtColor(input,tmp,CV_RGB2YUV);
    std::vector<cv::Mat> yuv;
    cv::split(tmp,yuv);
    cv::Mat yuyv(tmp.rows, tmp.cols, CV_8UC2);
    uint8_t* outdata = (uint8_t*)yuyv.data;
    uint8_t* ydata = (uint8_t*)yuv[0].data;
    uint8_t* udata = (uint8_t*)yuv[1].data;
    uint8_t* vdata = (uint8_t*)yuv[2].data;
    for (unsigned int i = 0; i < yuyv.total(); i += 2) {
        uint8_t u = (uint8_t)(((int)udata[i]+(int)udata[i+1])/2);
        uint8_t v = (uint8_t)(((int)vdata[i]+(int)vdata[i+1])/2);
        outdata[2*i+0] = ydata[i+0];
        outdata[2*i+1] = v;
        outdata[2*i+2] = ydata[i+1];
        outdata[2*i+3] = u;
    }
    return yuyv;
}

// Tensorflow Lite helper functions
using namespace tflite;

#define TFLITE_MINIMAL_CHECK(x)                              \
  if (!(x)) {                                                \
    fprintf(stderr, "Error at %s:%d\n", __FILE__, __LINE__); \
    exit(1);                                                 \
  }

std::unique_ptr<Interpreter> interpreter;

cv::Mat getTensorMat(int tnum, int debug) {

    TfLiteType t_type = interpreter->tensor(tnum)->type;
    TFLITE_MINIMAL_CHECK(t_type == kTfLiteFloat32);

    TfLiteIntArray* dims = interpreter->tensor(tnum)->dims;
    if (debug) for (int i = 0; i < dims->size; i++) printf("tensor #%d: %d\n",tnum,dims->data[i]);
    TFLITE_MINIMAL_CHECK(dims->data[0] == 1);

    int h = dims->data[1];
    int w = dims->data[2];
    int c = dims->data[3];

    float* p_data = interpreter->typed_tensor<float>(tnum);
    TFLITE_MINIMAL_CHECK(p_data != nullptr);

    return cv::Mat(h,w,CV_32FC(c),p_data);
}

// deeplabv3 classes
const std::vector<std::string> labels = { "background", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "dining table", "dog", "horse", "motorbike", "person", "potted plant", "sheep", "sofa", "train", "tv" };
// label number of "person" for DeepLab v3+ model
const size_t cnum = labels.size();
const size_t pers = std::distance(labels.begin(), std::find(labels.begin(),labels.end(),"person"));

// timing helpers
typedef std::chrono::high_resolution_clock::time_point timestamp_t;
typedef struct {
    timestamp_t bootns;
    timestamp_t lastns;
    timestamp_t waitns;
    timestamp_t lockns;
    timestamp_t copyns;
    timestamp_t openns;
    timestamp_t tfltns;
    timestamp_t maskns;
    timestamp_t postns;
    timestamp_t v4l2ns;
    // these are already converted to ns
    long grabns;
    long retrns;
} timinginfo_t;

timestamp_t timestamp() {
    return std::chrono::high_resolution_clock::now();
}
long diffnanosecs(timestamp_t t1, timestamp_t t2) {
    return std::chrono::duration_cast<std::chrono::nanoseconds>(t1-t2).count();
}

// threaded capture shared state
typedef struct {
    cv::VideoCapture *cap;
    cv::Mat *grab;
    cv::Mat *raw;
    int64 cnt;
    timinginfo_t *pti;
    pthread_mutex_t lock;
} capinfo_t;

enum class modeltype_t {
    Unknown,
    BodyPix,
    DeepLab,
    GoogleMeetSegmentation,
    MLKitSelfie,
};

struct normalization_t {
    float scaling;
    float offset;
};

typedef struct {
    const char *modelname;
    modeltype_t modeltype;
    normalization_t norm;
    size_t threads;
    size_t width;
    size_t height;
    int debug;
    std::unique_ptr<tflite::FlatBufferModel> model;
    cv::Mat input;
    cv::Mat output;
    cv::Rect roidim;
    cv::Mat mask;
    cv::Mat mroi;
    cv::Mat raw;
    cv::Mat ofinal;
    cv::Mat element;
    float ratio;
} calcinfo_t;

// capture thread function
void *grab_thread(void *arg) {
    capinfo_t *ci = (capinfo_t *)arg;
    bool done = false;
    // while we have a grab frame.. grab frames
    while (!done) {
        timestamp_t ts = timestamp();
        ci->cap->grab();
        long ns = diffnanosecs(timestamp(),ts);
        pthread_mutex_lock(&ci->lock);
        ci->pti->grabns = ns;
        if (ci->grab!=NULL) {
            ts = timestamp();
            ci->cap->retrieve(*ci->grab);
            ci->pti->retrns = diffnanosecs(timestamp(),ts);
        } else {
            done = true;
        }
        ci->cnt++;
        pthread_mutex_unlock(&ci->lock);
    }
    return NULL;
}

modeltype_t get_modeltype(const char* modelname) {
    if (strstr(modelname, "body-pix")) {
        return modeltype_t::BodyPix;
    }
    else if (strstr(modelname, "deeplab")) {
        return modeltype_t::DeepLab;
    }
    else if (strstr(modelname, "segm_")) {
        return modeltype_t::GoogleMeetSegmentation;
    }
    else if (strstr(modelname, "selfie")) {
        return modeltype_t::MLKitSelfie;
    }
    return modeltype_t::Unknown;
}

normalization_t get_normalization(modeltype_t type) {
    // TODO: This should be read out from actual mode metadata instead
    switch (type) {
        case modeltype_t::DeepLab:
#if !_WINDOWS
            return normalization_t{.scaling = 1/127.5, .offset = -1};
#else
            {
                normalization_t norm;
                norm.scaling = 1 / 127.5;
                norm.offset = -1;
                return norm;
            }

#endif
        case modeltype_t::BodyPix:
        case modeltype_t::GoogleMeetSegmentation:
        case modeltype_t::MLKitSelfie:
        case modeltype_t::Unknown:
        default:
#if !_WINDOWS
            return normalization_t{.scaling = 1/255.0, .offset = 0};
#else
            {
                normalization_t norm;
                norm.scaling = 1 / 255.0;
                norm.offset = 0;
                return norm;
            }
#endif

    }
}

void init_tensorflow(calcinfo_t &info) {
    // Load model
    info.model = tflite::FlatBufferModel::BuildFromFile(info.modelname);
    TFLITE_MINIMAL_CHECK(info.model != nullptr);

    // Build the interpreter
    tflite::ops::builtin::BuiltinOpResolver resolver;
    // custom op for Google Meet network
    resolver.AddCustom("Convolution2DTransposeBias", mediapipe::tflite_operations::RegisterConvolution2DTransposeBias());
    InterpreterBuilder builder(*info.model, resolver);
    builder(&interpreter);
    TFLITE_MINIMAL_CHECK(interpreter != nullptr);

    // Allocate tensor buffers.
    TFLITE_MINIMAL_CHECK(interpreter->AllocateTensors() == kTfLiteOk);

    // set interpreter params
    interpreter->SetNumThreads(info.threads);
    interpreter->SetAllowFp16PrecisionForFp32(true);

    // get input and output tensor as cv::Mat
    info.input = getTensorMat(interpreter->inputs ()[0],info.debug);
    info.output = getTensorMat(interpreter->outputs()[0],info.debug);
    info.ratio = (float)info.input.cols/(float) info.input.rows;

    // initialize mask and square ROI in center
    info.roidim = cv::Rect((info.width-info.height/info.ratio)/2,0,info.height/info.ratio,info.height);
    info.mask = cv::Mat::ones(info.height,info.width,CV_8UC1);
    info.mroi = info.mask(info.roidim);

    // erosion/dilation element
    info.element = cv::getStructuringElement( cv::MORPH_RECT, cv::Size(5,5) );

    // create Mat for small mask
    info.ofinal = cv::Mat(info.output.rows,info.output.cols,CV_8UC1);
}

void calc_mask(calcinfo_t &info, timinginfo_t &ti) {
    // map ROI
    cv::Mat roi = info.raw(info.roidim);

    // resize ROI to input size
    cv::Mat in_u8_bgr, in_u8_rgb;
    cv::resize(roi,in_u8_bgr,cv::Size(info.input.cols,info.input.rows));
    cv::cvtColor(in_u8_bgr,in_u8_rgb,CV_BGR2RGB);
    // TODO: can convert directly to float?

    // bilateral filter to reduce noise
    if (1) {
        cv::Mat filtered;
        cv::bilateralFilter(in_u8_rgb,filtered,5,100.0,100.0);
        in_u8_rgb = filtered;
    }

    // convert to float and normalize to values expected by model
    in_u8_rgb.convertTo(info.input,CV_32FC3,info.norm.scaling,info.norm.offset);
    ti.openns=timestamp();

    // Run inference
    TFLITE_MINIMAL_CHECK(interpreter->Invoke() == kTfLiteOk);
    ti.tfltns=timestamp();

    float* tmp = (float*)info.output.data;
    uint8_t* out = (uint8_t*)info.ofinal.data;

    switch (info.modeltype) {
        case modeltype_t::DeepLab:
            // find class with maximum probability
            for (unsigned int n = 0; n < info.output.total(); n++) {
                float maxval = -10000; size_t maxpos = 0;
                for (size_t i = 0; i < cnum; i++) {
                    if (tmp[n*cnum+i] > maxval) {
                        maxval = tmp[n*cnum+i];
                        maxpos = i;
                    }
                }
                // set mask to 0 where class == person
                uint8_t val = (maxpos==pers ? 0 : 255);
                out[n] = (val & 0xE0) | (out[n] >> 3);
            }
            break;
        case modeltype_t::BodyPix:
        case modeltype_t::MLKitSelfie:
            // threshold probability
            for (unsigned int n = 0; n < info.output.total(); n++) {
                // FIXME: hardcoded threshold
                uint8_t val = (tmp[n] > 0.65 ? 0 : 255);
                out[n] = (val & 0xE0) | (out[n] >> 3);
            }
            break;
        case modeltype_t::GoogleMeetSegmentation:
            /* 256 x 144 x 2 tensor for the full model or 160 x 96 x 2
             * tensor for the light model with masks for background
             * (channel 0) and person (channel 1) where values are in
             * range [MIN_FLOAT, MAX_FLOAT] and user has to apply
             * softmax across both channels to yield foreground
             * probability in [0.0, 1.0]. */
            for (unsigned int n = 0; n < info.output.total(); n++) {
                float exp0 = expf(tmp[2*n  ]);
                float exp1 = expf(tmp[2*n+1]);
                float p0 = exp0 / (exp0+exp1);
                float p1 = exp1 / (exp0+exp1);
                uint8_t val = (p0 < p1 ? 0 : 255);
                out[n] = (val & 0xE0) | (out[n] >> 3);
            }
            break;
        case modeltype_t::Unknown:
            fprintf(stderr, "Unknown model type\n");
            break;
    }
    ti.maskns=timestamp();

    // denoise
    cv::Mat tmpbuf;
    cv::dilate(info.ofinal,tmpbuf,info.element);
    cv::erode(tmpbuf,info.ofinal,info.element);

    // scale up into full-sized mask
    cv::resize(info.ofinal,info.mroi,cv::Size(info.raw.rows/info.ratio,info.raw.rows));
}

int main(int argc, char* argv[]) {

    printf("deepseg v0.2.0\n");
    printf("(c) 2021 by floe@butterbrot.org\n");
    printf("https://github.com/floe/deepbacksub\n");
    timinginfo_t ti;
    ti.bootns = timestamp();
    int debug  = 0;
    bool showProgress = false;
    size_t threads= 2;
    size_t width  = 640;
    size_t height = 480;
    const char *back = nullptr; // "images/background.png";
    const char *vcam = "/dev/video0";
    const char *ccam = "/dev/video1";
    bool flipHorizontal = false;
    bool flipVertical   = false;
    int fourcc = 0;

#if !_WINDOWS
    const char* modelname = "models/segm_full_v679.tflite";
#else
    const char* modelname = "../models/segm_full_v679.tflite";
#endif

    bool showUsage = false;
    for (int arg=1; arg<argc; arg++) {
        bool hasArgument = arg+1 < argc;
        if (strncmp(argv[arg], "-?", 2)==0) {
            showUsage = true;
        } else if (strncmp(argv[arg], "-d", 2)==0) {
            ++debug;
        } else if (strncmp(argv[arg], "-p", 2)==0) {
            showProgress = true;
        } else if (strncmp(argv[arg], "-H", 2)==0) {
            flipHorizontal = !flipHorizontal;
        } else if (strncmp(argv[arg], "-V", 2)==0) {
            flipVertical = !flipVertical;
        } else if (strncmp(argv[arg], "-v", 2)==0) {
            if (hasArgument) {
                vcam = argv[++arg];
            } else {
                showUsage = true;
            }
        } else if (strncmp(argv[arg], "-c", 2)==0) {
            if (hasArgument) {
                ccam = argv[++arg];
            } else {
                showUsage = true;
            }
        } else if (strncmp(argv[arg], "-b", 2)==0) {
            if (hasArgument) {
                back = argv[++arg];
            } else {
                showUsage = true;
            }
        } else if (strncmp(argv[arg], "-m", 2)==0) {
            if (hasArgument) {
                modelname = argv[++arg];
            } else {
                showUsage = true;
            }
        } else if (strncmp(argv[arg], "-w", 2)==0) {
            if (hasArgument && sscanf(argv[++arg], "%zu", &width)) {
                if (!width) {
                    showUsage = true;
                }
            } else {
                showUsage = true;
            }
        } else if (strncmp(argv[arg], "-h", 2)==0) {
            if (hasArgument && sscanf(argv[++arg], "%zu", &height)) {
                if (!height) {
                    showUsage = true;
                }
            } else {
                showUsage = true;
            }
        } else if (strncmp(argv[arg], "-f", 2)==0) {
            if (hasArgument) {
                fourcc = fourCcFromString(argv[++arg]);
                if (!fourcc) {
                    showUsage = true;
                }
            } else {
                showUsage = true;
            }
        } else if (strncmp(argv[arg], "-t", 2)==0) {
            if (hasArgument && sscanf(argv[++arg], "%zu", &threads)) {
                if (!threads) {
                    showUsage = true;
                }
            } else {
                showUsage = true;
            }
        }
    }

    if (showUsage) {
        fprintf(stderr, "\n");
        fprintf(stderr, "usage:\n");
        fprintf(stderr, "  deepseg [-?] [-d] [-p] [-c <capture>] [-v <virtual>] [-w <width>] [-h <height>]\n");
        fprintf(stderr, "    [-t <threads>] [-b <background>] [-m <modell>]\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "-?            Display this usage information\n");
        fprintf(stderr, "-d            Increase debug level\n");
        fprintf(stderr, "-p            Show progress bar\n");
        fprintf(stderr, "-c            Specify the video source (capture) device\n");
        fprintf(stderr, "-v            Specify the video target (sink) device\n");
        fprintf(stderr, "-w            Specify the video stream width\n");
        fprintf(stderr, "-h            Specify the video stream height\n");
        fprintf(stderr, "-f            Specify the camera video format, i.e. MJPG or 47504A4D.\n");
        fprintf(stderr, "-t            Specify the number of threads used for processing\n");
        fprintf(stderr, "-b            Specify the background image\n");
        fprintf(stderr, "-m            Specify the TFLite model used for segmentation\n");
        fprintf(stderr, "-H            Mirror the output horizontally\n");
        fprintf(stderr, "-V            Mirror the output vertically\n");
        exit(1);
    }

    printf("debug:  %d\n", debug);
    printf("ccam:   %s\n", ccam);
    printf("vcam:   %s\n", vcam);
    printf("width:  %zu\n", width);
    printf("height: %zu\n", height);
    printf("flip_h: %s\n", flipHorizontal ? "yes" : "no");
    printf("flip_v: %s\n", flipVertical ? "yes" : "no");
    printf("threads:%zu\n", threads);
    printf("back:   %s\n", back ? back : "(none)");
    printf("model:  %s\n\n", modelname);

    cv::Mat bg;
    if (back) {
        bg = cv::imread(back);
    }
    if (bg.empty()) {
        if (back) {
            printf("Warning: could not load background image, defaulting to green\n");
        }
        bg = cv::Mat(height,width,CV_8UC3,cv::Scalar(0,255,0));
    }
    cv::resize(bg,bg,cv::Size(width,height));

#if !_WINDOWS
    int lbfd = loopback_init(vcam,width,height,debug);
    if(lbfd < 0) {
        fprintf(stderr, "Failed to initialize vcam device.\n");
        exit(1);
    }
#endif
#if !_WINDOWS

    cv::VideoCapture cap(ccam, CV_CAP_V4L2);

#else

    cv::VideoCapture cap;
    int deviceID = 0;             // 0 = open default camera
    int apiID = cv::CAP_ANY;      // 0 = autodetect default API
    cap.open(deviceID, apiID);

#endif

    TFLITE_MINIMAL_CHECK(cap.isOpened());

    cap.set(CV_CAP_PROP_FRAME_WIDTH,  width);
    cap.set(CV_CAP_PROP_FRAME_HEIGHT, height);
    if (fourcc)
        cap.set(CV_CAP_PROP_FOURCC, fourcc);
    cap.set(CV_CAP_PROP_CONVERT_RGB, true);

    auto modeltype = get_modeltype(modelname);
    auto norm = get_normalization(modeltype);
    if (modeltype_t::Unknown == modeltype) {
        fprintf(stderr, "Unknown model type '%s'.\n", modelname);
        exit(1);
    }
    calcinfo_t calcinfo = { modelname, modeltype, norm, threads, width, height, debug };
    init_tensorflow(calcinfo);

    // kick off separate grabber thread to keep OpenCV/FFMpeg happy (or it lags badly)
#if !_WINDOWS
    pthread_t grabber;
    cv::Mat buf1;
    cv::Mat buf2;
    int64 oldcnt = 0;
    capinfo_t capinfo = { &cap, &buf1, &buf2, 0, &ti, PTHREAD_MUTEX_INITIALIZER };
    if (pthread_create(&grabber, NULL, grab_thread, &capinfo)) {
        perror("creating grabber thread");
        exit(1);
    }
#else
    cv::Mat buf1;
    cv::Mat buf2;
    int64 oldcnt = 0;

    capinfo_t capinfo = { &cap, &buf1, &buf2, 0, &ti};

    std::thread grabber(grab_thread, &capinfo);

#endif

    ti.lastns = timestamp();
    printf("Startup: %ldns\n", diffnanosecs(ti.lastns,ti.bootns));

    bool filterActive = true;

    // mainloop
    for(bool running = true; running; ) {
        // wait for next frame
        while (capinfo.cnt == oldcnt) usleep(10000);
        oldcnt = capinfo.cnt;
        int e1 = cv::getTickCount();
        ti.waitns=timestamp();

        // switch buffer pointers in capture thread
        pthread_mutex_lock(&capinfo.lock);
        ti.lockns=timestamp();
        cv::Mat *tmat = capinfo.grab;
        capinfo.grab = capinfo.raw;
        capinfo.raw = tmat;
        pthread_mutex_unlock(&capinfo.lock);
        // we can now guarantee capinfo.raw will remain unchanged while we process it..
        calcinfo.raw = *capinfo.raw;
        ti.copyns=timestamp();
        if (calcinfo.raw.rows == 0 || calcinfo.raw.cols == 0) continue; // sanity check

        if (filterActive) {
            // do background detection magic
            calc_mask(calcinfo, ti);

            // copy background over raw cam image using mask
            bg.copyTo(calcinfo.raw,calcinfo.mask);
        } // filterActive

        if (flipHorizontal && flipVertical) {
            cv::flip(calcinfo.raw,calcinfo.raw,-1);
        } else if (flipHorizontal) {
            cv::flip(calcinfo.raw,calcinfo.raw,1);
        } else if (flipVertical) {
            cv::flip(calcinfo.raw,calcinfo.raw,0);
        }
        ti.postns=timestamp();

#if !_WINDOWS
        // write frame to v4l2loopback as YUYV
        calcinfo.raw = convert_rgb_to_yuyv(calcinfo.raw);
        int framesize = calcinfo.raw.step[0]*calcinfo.raw.rows;
        while (framesize > 0) {
            int ret = write(lbfd,calcinfo.raw.data,framesize);
            TFLITE_MINIMAL_CHECK(ret > 0);
            framesize -= ret;
        }
#else
        cv::imshow("Live", calcinfo.raw);
        if (cv::waitKey(5) >= 0)
            break;
#endif

        ti.v4l2ns=timestamp();

        if (!debug) {
            if (showProgress) {
                printf(".");
                fflush(stdout);
            }
            continue;
        }

        // timing details..
        printf("wait:%9ld lock:%9ld [grab:%9ld retr:%9ld] copy:%9ld open:%9ld tflt:%9ld mask:%9ld post:%9ld v4l2:%9ld ",
            diffnanosecs(ti.waitns,ti.lastns),
            diffnanosecs(ti.lockns,ti.waitns),
            ti.grabns,
            ti.retrns,
            diffnanosecs(ti.copyns,ti.lockns),
            diffnanosecs(ti.openns,ti.copyns),
            diffnanosecs(ti.tfltns,ti.openns),
            diffnanosecs(ti.maskns,ti.tfltns),
            diffnanosecs(ti.postns,ti.maskns),
            diffnanosecs(ti.v4l2ns,ti.postns));

        int e2 = cv::getTickCount();
        float t = (e2-e1)/cv::getTickFrequency();
        printf("FPS: %5.2f\e[K\r",1.0/t);
        fflush(stdout);
        ti.lastns = timestamp();
        if (debug < 2) continue;

        cv::Mat test;
        cv::cvtColor(calcinfo.raw,test,CV_YUV2BGR_YUYV);
        cv::imshow("output.png",test);

        auto keyPress = cv::waitKey(1);
        switch(keyPress) {
            case 'q':
                running = false;
                break;
            case 's':
                filterActive = !filterActive;
                break;
            case 'h':
                flipHorizontal = !flipHorizontal;
                break;
            case 'v':
                flipVertical = !flipVertical;
                break;
        }
    }

    pthread_mutex_lock(&capinfo.lock);
    capinfo.grab = NULL;
    pthread_mutex_unlock(&capinfo.lock);

    printf("\n");
    return 0;
}

BenBE commented 3 years ago

Looking at your code changes this liiks like you are compiling with MSVC in pre-C++11 mode, as

#if !_WINDOWS
            return normalization_t{.scaling = 1/255.0, .offset = 0};
#else
            {
                normalization_t norm;
                norm.scaling = 1 / 255.0;
                norm.offset = 0;
                return norm;
            }
#endif

uses the old style assignments. Alternatively return normalization_t(1.0f/255.0f, 0); is equivalent to both variants, but skips the nice field initializer naming for the fields that modern C++ allows. The source code itself assumes it is compiled with a fully C++11 compliant compiler with some parts probably using some C++14'isms here and there.

Apart from that you probably may want to split out the platform dependent stuff into their own implementation files with one common header for providing the interface for these functions.

As you mentioned problems with pthread: There is some work on getting the whole source code up to C++11 and thus also use the STL thread library everywhere. If you didn't have a look at the experimental branch, this is a good time to check. The advantage there also is, that the new branch uses CMake, which may in conjunction with Ninja provide a much better UX when compiling, as the build system is likely better aware of the oddities of compiling on Windows and thus may better handle conditionally compiling only certain files depending on the platform.

NB: Knowing the OpenSSL source code I got to loathe negative #if conditions …

OmarJay1 commented 3 years ago

Thanks. I'll look into enabling full C++ 11 compliance in Visual Studio. The only thing is that if it has to be done at the Application level as opposed the project level, it could cause confusion for users who would have to change settings in their Visual Studio setup to get it to build.

I guess I should have used the experimental branch in the 1st place. I'll do a Windows build for that and see how it works. From 1st glance it seems that besides the pthreadpool version, the only thing necessary would be a separate Windows viewer. I think the xwindows stuff wouldn't work on Windows, although it looks like you're using OpenGL to render and that's possible on Windows.

One thing that confuses me is I don't understand the necessity of converting to and from YUV when displaying frames.

BenBE commented 3 years ago

One thing that confuses me is I don't understand the necessity of converting to and from YUV when displaying frames.

The default pixel format for the virtual camera device is YUV, but some parts of the code operate on RGB data.

phlash commented 3 years ago

..the only thing necessary would be a separate Windows viewer. I think the xwindows stuff wouldn't work on Windows, although it looks like you're using OpenGL to render and that's possible on Windows.

Almost correct :smiley_cat: We are using OpenCV to render a monitor/debug video stream on screen which would work in Windows, however applications (eg: Zoom, Teams, etc) are expected to consume our processed video via a virtual camera. The work required to produce a virtual camera on Windows is non-trivial (and gross), it's much easier for us to plug into someone who has done the hard work, hence the separation into a library and wrapper app in the experimental branch and my OBS Studio plug-in that uses the library as a demo (plus thinking about GStreamer and Pipewire plugins, and whatever macOS has). FYI here's the OBS studio code (which uses DirectShow to create a virtual camera), and the relevant SO thread :smile:

https://github.com/obsproject/obs-studio/tree/master/plugins/win-dshow https://stackoverflow.com/questions/33693131/how-to-create-virtual-webcam-in-windows-10

..compared to the code I needed to write for OBS Studio (233 LoC, one file): https://github.com/phlash/obs-backscrub.

The default pixel format for the virtual camera device is YUV

Yep, after some experimentation by @floe to find out what works in most consumer applications (turns out, they don't like RGB at all). The v4l2loopback module will transport almost any format as long as it can identify frame boundaries. [edit] to note that OBS Studio uses NV12 video format for the Windows virtual camera (likely for similar reasons).

OmarJay1 commented 3 years ago

So you would want obs-backscrub to build on Windows as well? The Stackoverflow article says a virtual camera in Windows is a kernel mode driver, which beyond building and testing can be complex to install. Maybe I misread it.

If it's being targeted to OBS users who already have it installed, then it's probably useful to them.

I'm not saying it's not otherwise worthwhile, just trying to clarify what needs to be done.

Thanks.

phlash commented 3 years ago

AIUI OBS Studio's code and the SO post's more detailed answer uses DirectShow to create a virtual camera on Windows without a kernel driver, but there is still a lot of code to create a COM object and register it, etc. etc. Other operating systems will differ again, hence we thought it wise to avoid repeating the work others have done (in OBS Studio and other media processing frameworks), and concentrate on the unique / valuable aspect here - using a TFLite model to scrub off the background - allowing others to connect that into their chosen video processing workflow / tools, while providing one implementation for Linux via a v4l2loopback virtual camera as it's easy. Thus the separation into libbackscrub.a and deepseg in the experimental branch. As a demo I wrote an OBS Studio plugin using libbackscrub.a, which would be nice to have available for other operating systems, once we have libbackscrub.a available :smile:

Regards targetting - I chose OBS Studio as earlier commenters mentioned how popular it is amongst the streaming community and it lacks the feature we have here. If you have a different use case, then by all means address that itch first!

OmarJay1 commented 3 years ago

I'm going to have to play around with some of the virtual camera examples to see what's possible.

This https://github.com/Fenrirthviti/obs-virtual-cam apparently works by making the output of OBS a DirectShow virtual device. Since you already have an OBS plugin, that means people could use OBS + your plugin implemented on Windows to have a virtual camera.

If you want to have your viewer also output as a virtual camera with OBS, that's more complex.

On a slightly different topic, have you done any tests with like 1920x1080 video to see what it looks like? I'm curious as to how a 244x160 (? or whatever it is) mask performs and what kind of filters make make it look better with broadcast quality video. If there's going to be a generic lib, that may be an issue.

Thanks.

phlash commented 3 years ago

I'm going to have to play around with some of the virtual camera examples to see what's possible.

Have fun! :smile: - just before you hop down that rabbit hole, would you mind sharing your current build environment info that compiles TFLite, as I have not been able to build anything so far with VS2019 build tools and Microsoft supplied CMake?

This https://github.com/Fenrirthviti/obs-virtual-cam apparently works by making the output of OBS a DirectShow virtual device. Since you already have an OBS plugin, that means people could use OBS + your plugin implemented on Windows to have a virtual camera.

Yep, this was my expectation - I think the later OBS (26+) pulled the parent fork (https://github.com/CatxFish/obs-virtual-cam) into their distribution (the code I referenced above looks very similar, and is in the core repo). There is also mention of a MacOS virtual camera in the README for the parent fork.

If you want to have your viewer also output as a virtual camera with OBS, that's more complex.

I don't think this is necessary, although I did wonder if we could load/reuse the OBS virtual camera plugin DLL for ourselves? That does seem like an odd use case though - if someone has installed OBS they probably want to use all of it, not just have us steal a bit of it...

On a slightly different topic, have you done any tests with like 1920x1080 video to see what it looks like? I'm curious as to how a 244x160 (? or whatever it is) mask performs and what kind of filters make make it look better with broadcast quality video. If there's going to be a generic lib, that may be an issue.

This is a good question - I haven't myself (only having a cheapo webcam!), @floe might have some thoughts on HD+ video processing? There are some non-real time HD+ projects that get a mention in the 'other code bases' thread #58 too.

BenBE commented 3 years ago

On a slightly different topic, have you done any tests with like 1920x1080 video to see what it looks like? I'm curious as to how a 244x160 (? or whatever it is) mask performs and what kind of filters make make it look better with broadcast quality video. If there's going to be a generic lib, that may be an issue.

This is a good question - I haven't myself (only having a cheapo webcam!), @floe might have some thoughts on HD+ video processing? There are some non-real time HD+ projects that get a mention in the 'other code bases' thread #58 too.

I have two PCs with HD webcams each, that I successfully managed to use as 1280x720 MJPG video source for backscrub. The result is okay-ish but you notice some blockish artifacts at the border of the mask. There's also an issue on this subject, cf. #72 for smoothing and #65 multiple segmentation passes per image …

OmarJay1 commented 3 years ago

Thanks. I'm trying to build the experimental branch and I'm having some trouble with CMake. The main branch downloaded Tensorflow + lite for me, and I had to fix a few issues, but the experimental branch gives me an error:

CMake Error at CMakeLists.txt:17 (add_subdirectory): add_subdirectory given source "tensorflow/tensorflow/lite" which is not an existing directory.

I'm completely clueless about CMake. Is there somewhere in CMakeLists.txt where I should look to fix this problem?

Thank you.

phlash commented 3 years ago

...but the experimental branch gives me an error: CMake Error at CMakeLists.txt:17 (add_subdirectory): add_subdirectory given source "tensorflow/tensorflow/lite" which is not an existing directory.

I'm completely clueless about CMake. Is there somewhere in CMakeLists.txt where I should look to fix this problem?

That looks like the Tensorflow source tree is not present? I usually do the following:

% cd backscrub
% git submodule update --init --recursive
% cd tensorflow
% git log

to ensure I have Tensorflow source at the expected version, before running any cmake commands.

Note that the GNU make build will do this for you, but our CMake build will not, hence my manual workaround above, there is probably a 'correct' way to use CMake to checkout a submodule from git but we haven't found it (yet!)

OmarJay1 commented 3 years ago

That worked, thanks.

I mentioned an issue on obs-backscrub concerning libobs.

The experimental branch of backscrub builds on Windows, except for pthreads issues with deepseg.cc. Are pthreads going to be taken out in favor of std::thread and std::mutex?

Also, I'm still a bit unclear on the directshow implementation and displaying video in deepseg.cc. Displaying the results of webcam input with background replacement on an OpenCV video display is trivial in Windows without any need for using the loopback mechanism.

If there's going to be any directshow virtual camera implementation in conjunction with the deepseg.cc sample app, it will definitely complicate the project significantly.

BenBE commented 3 years ago

Yes. The goal is to replace pthreads with their C++ counterparts std::thread and std::mutex.

phlash commented 3 years ago

The experimental branch of backscrub builds on Windows, except for pthreads issues with deepseg.cc. Are pthreads going to be taken out in favor of std::thread and std::mutex?

I can't see any direct references to pthread left in experimental branch? Of course pthreads underlie the Linux implementation of std::thread, but on Windows, it appears to use the native OS threads^ so I'm not sure where you are seeing pthread issues?

^ https://github.com/microsoft/STL/blob/main/stl/src/cthread.cpp

OmarJay1 commented 3 years ago

My apologies. I was cloning the wrong branch.

I think what I'll do is modify the code to what I think will work on Windows and post as pull requests.

One issue that will take some experimenting is with CMake adding a compiler option for Visual Studio.

https://stackoverflow.com/questions/64889383/how-to-enable-stdclatest-in-cmake

phlash commented 3 years ago

OK, so I've spent some time with a fresh Azure development VM: starting with the standard Microsoft template: Windows Server 2019 plus VS2019 Community image. Here's what works:

Microsoft CMake build of Tensorflow Lite 😄
Microsoft CMake build of Backscrub core library (once I applied a couple of fixes^), and having downloaded OpenCV 3.4.14 binary package and informed CMake where to find it:
```
% cmake -B build -D CMAKE_PREFIX_PATH=C:\Packages\opencv\build
% cmake --build build -t backscrub
```
Microsoft CMake build of my obs-backscrub plugin (again, with minor fixes^), once I had sorted out the required libobs dependencies as follows:
- Created a package folder for libobs: C:\Packages\libobs and sub folders inc and bin for headers and binary DLLs.
- Downloaded latest OBS Studio (27.0.1) installed it and copied out OBS.DLL from bin\x64 folder to the package folder.
- Downloaded matching source release ZIP and copied out all headers files from libobs folder and below to the package folder.
- Wrote this CMake config file (zipped for github!) LibObsConfig.zip and dropped it in the package area to inform CMake where stuff is.
- Created an OBS.LIB import library from OBS.DLL using instructions here: https://stackoverflow.com/questions/9946322/how-to-generate-an-import-library-lib-file-from-a-dll This is horrid but apparently the 'official' way to go!
- I can now configure CMake with the appropriate search paths:
```
% cmake -B build -D CMAKE_PREFIX_PATH=C:\Packages;C:\Packages\opencv\build
% cmake --build build
```
Copying the output binary (obs-backscrub.dll, all 14MB of it!) into OBS Studio obs-plugin\x64 folder
Sym-linking the data directory from OBS Studio to my source tree using mklink /D
- Copying the OpenCV DLL opencv_world3414d.dll (all 55MB of it!!) alongside obs-backscrub.dll
- OBS Studio now loads the plugin and doesn't crash when fed a sample video stream 😁

^ https://github.com/phlash/backscrub/tree/windows-build https://github.com/phlash/obs-backscrub/tree/windows-build

ValentinePeltier commented 3 years ago

Hi, Thanks a lot for this code. I have a probleme which doesn't really concern deepseg but tensorflow lite. I have a webrtc application where I want to remove background, I use deepseg at first to evaluate TfLite and some models. The problem is that deepseg if far more efficient on the "invoke" and the main difference between my application and deepseg is the librairy used. In deepseg it is a static lib and I use a dll.

So I try to use the static lib instead but my application is compile in MT and Tflite is compile in MD and I can't compile it in MT,

Did some of you try to do it already and could help me ? ( I post a message on tensorflow's github but no answer since )

phlash commented 3 years ago

@JVpltr - I didn't look very deeply into the TFlite Windows build, as it 'just worked'. From here it seems CMake defaults to /MD as you have discovered, and there are ways to change that you might have tried already :smile: ?

I'm surprised there is any material performance difference between a DLL and a static library version of TFLite, unless your application is loading TFLite each time you invoke it (unlikely?), or reconfiguring TFlite on each invoke?

[edited to add] Does your DLL build of TFLite have XNNPACK enabled? This was a huge performance improvement (2x) for us.

ValentinePeltier commented 3 years ago

Hi,

Yes I have tried to force the MT compilation in TfLite CmakeFile ( and all the dependences), and even if the compilation are (it seems like) in MT, I have some errors with dependence librairy like :

Error LNK2001 Unresolved external symbol "void __cdecl ruy::KernelFloatAvx(struct ruy::KernelParamsFloat<8,8> const &)" (?KernelFloatAvx@ruy@@YAXAEBU?$KernelParamsFloat@$07$07@1@@z)peerconnection_client .......\tensorflow-lite.lib(fully_connected.obj)

I configure my interpreter only one time and the invoke is call for each frame the same (more or less than in deepseg) . It surprised me too so I check by putting the dll library I compile in the deepseg program, and the invoke take also more time .

ValentinePeltier commented 3 years ago

I've got some news.

So the compilation in MT work fine but I had to link the tfLite static lib and all the dependences in my project (wich wasn't obvious for me cause in a static lib you are suppose to have all the symbols idk).

phlash commented 3 years ago

Yay - well done :smile: As you have discovered, static libraries don't get merged together as you build a tree of them, so you need all the transitive dependencies when linking the final binary. I wrote a macro in https://github.com/floe/backscrub/blob/experimental/CMakeLists.txt to collect all these and export them for callers to link with, feel free to borrow that one!

OmarJay1 commented 3 years ago

Sorry for not following this thread. Are there any remaining Windows issues that need to be dealt with?

Thanks.

phlash commented 3 years ago

@OmarJay1 - Aside from the compatibility fixes I had to make in my fork (to avoid having to pass /std:c++latest) everything appears to build and run (albeit with massive binaries!). I'm going to wait until we have merged experimental back into main before applying any changes to build on Windows, as they can then apply across the board.

phlash commented 3 years ago

Right! We have now merged everything together, so this is next :smile:

floe / backscrub

Does this work in Windows? #92