floe / backscrub

Virtual Video Device for Background Replacement with Deep Semantic Segmentation
Apache License 2.0
734 stars 85 forks source link

restart of backscrub not possible on ubuntu 21.10 #133

Closed wilkej closed 2 years ago

wilkej commented 2 years ago

Hello,

I can start backscrub based on the readme provided and it is working as wanted. That is great! When I stop backscrub via STRG+C I can't restart it. These are my devices after the first start and stop

VirtualCam (platform:v4l2loopback-000):
    /dev/video10

HD Pro Webcam C920 (usb-0000:00:14.0-4.2):
    /dev/video2
    /dev/video3

And this is the error

./backscrub -d -d -c /dev/video2 -v /dev/video10 -b ~/Pictures/image.jpg
./backscrub version heads/main-0-g92d868d
(c) 2021 by floe@butterbrot.org & contributors
https://github.com/floe/backscrub
debug:  2
ccam:   /dev/video2
vcam:   /dev/video10
width:  640
height: 480
flip_h: no
flip_v: no
threads:2
back:   /home/user/Pictures/image.jpg => /home/user/Pictures/image.jpg
model:  selfiesegmentation_mlkit-256x256-2021_01_19-v1215.f16.tflite => /home/user/github/backscrub/models/selfiesegmentation_mlkit-256x256-2021_01_19-v1215.f16.tflite

background properties:
    vid: no
    fcc: 65706a6d (mjpe)
    fps: 25.000000
    cnt: 1

/home/user/github/backscrub/videoio/loopback.cc:69(loopback_init): Failed to set device video format: Invalid argument
Failed to initialize vcam device.

If I want to restart it, I need to restart my PC. The command bash v4l2-ctl --list-devices doesn't show the video10 device after restart.

Any ideas? Thank you

BenBE commented 2 years ago

Does the situation change when you exit backscrub in the debug window by pressing q?

wilkej commented 2 years ago

Unfortunately not.

I did a small test.

After exectuing the command sudo modprobe v4l2loopback devices=1 max_buffers=2 exclusive_caps=1 card_label="VirtualCam" video_nr=10 I checked my video capture devices and the video10 device exists.

I can start backscrub the first time, but not the second. So I guess it has nothing to do with the setup of the virtual device (the modprobe command) I think it is something backscrube doesn't "clean" when it is stopped.

BenBE commented 2 years ago

Can you check if https://github.com/floe/backscrub/commit/b2b1f698ff3311a6252bbcfa35746738cd01a7ee brings any change?

Which version of v4l2loopback are you using? I read about some people having issues with restarting the video device with other applications too. Can you e.g. check if OBS studio streaming to this virtual device shows the same behavior?

Does unloading/reloading with rmmod/modprobe resolve the problem?

Can you check the exit code from the first instance of backscrub? Should be 0 on normal exit.

Lastly: On top of the above patch, please apply the following additional change:

diff --git a/app/deepseg.cc b/app/deepseg.cc
index 643e8392..18028f2e 100644
--- a/app/deepseg.cc
+++ b/app/deepseg.cc
@@ -8,6 +8,7 @@
 #include <string>
 #include <thread>
 #include <mutex>
+#include <functional>
 #include <fstream>
 #include <istream>
 #include <regex>
@@ -35,6 +36,20 @@

 #define DEBUG_WIN_NAME "Backscrub " _STR(DEEPSEG_VERSION) " ('?' for help)"

+class on_scope_exit {
+private:
+       const std::function<void()> dtor;
+public:
+       explicit on_scope_exit(const std::function<void()>& f) : dtor(f) {}
+       on_scope_exit() = delete;
+       on_scope_exit(const on_scope_exit&) = delete;
+       ~on_scope_exit() {
+               if(dtor) {
+                       dtor();
+               }
+       }
+};
+
 int fourCcFromString(const std::string& in)
 {
        if (in.empty())
@@ -518,6 +533,9 @@ int main(int argc, char* argv[]) try {
                fprintf(stderr, "Failed to initialize vcam device.\n");
                exit(1);
        }
+       on_scope_exit lbfd_closer([lbfd]() {
+               close(lbfd);
+       });

        cv::VideoCapture cap(s_ccam.c_str(), cv::CAP_V4L2);
        if(!cap.isOpened()) {
wilkej commented 2 years ago

Thank you for the quick response, here is my feedback

Which version of v4l2loopback are you using?

v4l2loopback-dkms is already the newest version (0.12.5-1ubuntu3)

Can you check the exit code from the first instance of backscrub? Should be 0 on normal exit.

No exit code, that is the last line when I press q in the command prompt main [grab: 27771295 retr: 240663 copy: 97951 prep: 2090215 mask: 4840798 post: 100 v4l2: 3296183 FPS: 26.08] ai: [wait: 27015185 prep: 1719438 tflt: 6621171 mask: 479488 FPS: 34.83]

Does unloading/reloading with rmmod/modprobe resolve the problem? Yes, I did

sudo rmmod v4l2loopback
sudo modprobe v4l2loopback devices=1 max_buffers=2 exclusive_caps=1 card_label="VirtualCam" video_nr=10

and was able to restart backscrub.

Regarding the code changes. Just to clearify. I'll do the changes and then I need to complie (make) the application again?

BenBE commented 2 years ago

Thank you for the quick response, here is my feedback

Which version of v4l2loopback are you using?

v4l2loopback-dkms is already the newest version (0.12.5-1ubuntu3)

Running v4l2loopback version 0.12.5-1 on Lubuntu 20.04 LTS here right now and not noticing any issues. Will have to check, if there are any changes in the packaged version, though likely nothing major.

Can you check the exit code from the first instance of backscrub? Should be 0 on normal exit.

No exit code, that is the last line when I press q in the command prompt main [grab: 27771295 retr: 240663 copy: 97951 prep: 2090215 mask: 4840798 post: 100 v4l2: 3296183 FPS: 26.08] ai: [wait: 27015185 prep: 1719438 tflt: 6621171 mask: 479488 FPS: 34.83]

Okay, that looks normal. At least means we're not getting any exception thrown.

Does unloading/reloading with rmmod/modprobe resolve the problem? Yes, I did

sudo rmmod v4l2loopback
sudo modprobe v4l2loopback devices=1 max_buffers=2 exclusive_caps=1 card_label="VirtualCam" video_nr=10

and was able to restart backscrub.

Okay, looks like there are some references not getting closed. I wonder if lsof reports anything (though I doubt).

Regarding the code changes. Just to clearify. I'll do the changes and then I need to complie (make) the application again?

Yes, exactly. Suffices to just run make in the build directory; no need to clean everything and rebuild all of TFLite from scratch. make should report rebuilding the modified files and libraries (only a few).

Regarding open file handles with v4l2loopback this issue may be related on first glance: https://github.com/umlaeute/v4l2loopback/issues/457

BenBE commented 2 years ago

@wilkej Any updates?

wilkej commented 2 years ago

@BenBE I did the code changes, first on the loopback.cc and ran a make. Build was ok I think but didn't change anything.

Here is the build output

make
[  0%] Built target pthreadpool
Scanning dependencies of target videoio
[  0%] Building CXX object CMakeFiles/videoio.dir/videoio/loopback.cc.o
[  1%] Linking CXX static library libvideoio.a
[  1%] Built target videoio
[  1%] Built target absl_exponential_biased
[  1%] Built target absl_spinlock_wait
[  1%] Built target absl_log_severity
[  1%] Built target absl_raw_logging_internal
[  1%] Built target absl_throw_delegate
[  1%] Built target absl_base
[  1%] Built target absl_malloc_internal
[  1%] Built target absl_civil_time
[  1%] Built target absl_int128
[  1%] Built target absl_strings_internal
[  1%] Built target absl_strings
[  1%] Built target absl_time_zone
[  1%] Built target absl_time
[  2%] Built target absl_debugging_internal
[  2%] Built target absl_stacktrace
[  2%] Built target absl_demangle_internal
[  3%] Built target absl_symbolize
[  3%] Built target absl_graphcycles_internal
[  3%] Built target absl_synchronization
[  3%] Built target absl_hashtablez_sampler
[  3%] Built target absl_bad_optional_access
[  3%] Built target absl_raw_hash_set
[  3%] Built target absl_flags_program_name
[  3%] Built target absl_bad_variant_access
[  3%] Built target absl_flags_config
[  3%] Built target absl_flags_commandlineflag_internal
[  3%] Built target absl_flags_commandlineflag
[  3%] Built target absl_flags_private_handle_accessor
[  3%] Built target absl_wyhash
[  3%] Built target absl_city
[  3%] Built target absl_hash
[  3%] Built target absl_cord
[  4%] Built target absl_flags_reflection
[  4%] Built target absl_str_format_internal
[  4%] Built target absl_flags_marshalling
[  4%] Built target absl_flags_internal
[  4%] Built target absl_flags
[  5%] Built target absl_status
[  5%] Built target farmhash
[  5%] Built target fft2d_fftsg
[  5%] Built target fft2d_fftsg2d
[  5%] Built target flatbuffers
[  7%] Built target ruy
[  7%] Built target clog
[  8%] Built target cpuinfo
[ 93%] Built target XNNPACK
[ 98%] Built target tensorflow-lite
[100%] Built target backscrub
[100%] Linking CXX executable backscrub
[100%] Built target deepseg

Then I did the second changes. No changes :( Here is the build output. Just to be sure there was something wrong.

make                                                                                                   
[  0%] Built target pthreadpool
[  1%] Built target videoio
[  1%] Built target absl_exponential_biased
[  1%] Built target absl_spinlock_wait
[  1%] Built target absl_log_severity
[  1%] Built target absl_raw_logging_internal
[  1%] Built target absl_throw_delegate
[  1%] Built target absl_base
[  1%] Built target absl_malloc_internal
[  1%] Built target absl_civil_time
[  1%] Built target absl_int128
[  1%] Built target absl_strings_internal
[  1%] Built target absl_strings
[  1%] Built target absl_time_zone
[  1%] Built target absl_time
[  2%] Built target absl_debugging_internal
[  2%] Built target absl_stacktrace
[  2%] Built target absl_demangle_internal
[  3%] Built target absl_symbolize
[  3%] Built target absl_graphcycles_internal
[  3%] Built target absl_synchronization
[  3%] Built target absl_hashtablez_sampler
[  3%] Built target absl_bad_optional_access
[  3%] Built target absl_raw_hash_set
[  3%] Built target absl_flags_program_name
[  3%] Built target absl_bad_variant_access
[  3%] Built target absl_flags_config
[  3%] Built target absl_flags_commandlineflag_internal
[  3%] Built target absl_flags_commandlineflag
[  3%] Built target absl_flags_private_handle_accessor
[  3%] Built target absl_wyhash
[  3%] Built target absl_city
[  3%] Built target absl_hash
[  3%] Built target absl_cord
[  4%] Built target absl_flags_reflection
[  4%] Built target absl_str_format_internal
[  4%] Built target absl_flags_marshalling
[  4%] Built target absl_flags_internal
[  4%] Built target absl_flags
[  5%] Built target absl_status
[  5%] Built target farmhash
[  5%] Built target fft2d_fftsg
[  5%] Built target fft2d_fftsg2d
[  5%] Built target flatbuffers
[  7%] Built target ruy
[  7%] Built target clog
[  8%] Built target cpuinfo
[ 93%] Built target XNNPACK
[ 98%] Built target tensorflow-lite
[100%] Built target backscrub
Scanning dependencies of target deepseg
[100%] Building CXX object CMakeFiles/deepseg.dir/app/deepseg.cc.o
[100%] Linking CXX executable backscrub
[100%] Built target deepseg

I exported a lsof export to search for open files but I didn't really now what to look for. Any hints?

BenBE commented 2 years ago

The compilation output of both steps looks fine. In each case the affected object files (loopback.cc.o and deepseg.cc.o) were rebuilt and backscrub subsequently linked.

Regarding lsof: Look for references to /dev/videoX in its output.

wilkej commented 2 years ago

:( nothing found when searching for videoX

BenBE commented 2 years ago

But with those two patches the restart issue still persists?

wilkej commented 2 years ago

sorry, clicked wrong button.

unfortunately yes. I still have the problem. I'll double check again later or tomorrow.

dengste commented 2 years ago

I have the same issue on Ubuntu 21.10. Looking with strace, I can see that the problem is that the VIDIOC_QUERYCAP ioctl is missing the V4L2_CAP_VIDEO_OUTPUT capability on the second invocation, and subsequently the VIDIOC_G_FMT just plain fails.

In detail: On first startup I see

ioctl(3, VIDIOC_QUERYCAP, {driver="v4l2 loopback", card="Virtual Webcam", 
bus_info="platform:v4l2loopback-000", version=KERNEL_VERSION(5, 13, 19),
capabilities=V4L2_CAP_VIDEO_OUTPUT|V4L2_CAP_EXT_PIX_FORMAT|V4L2_CAP_READWRITE|V4L2_CAP_STREAMING|V4L2_CAP_DEVICE_CAPS,
device_caps=V4L2_CAP_VIDEO_OUTPUT|V4L2_CAP_EXT_PIX_FORMAT|V4L2_CAP_READWRITE|V4L2_CAP_STREAMING}) = 0

whereas the second invocation returns

ioctl(3, VIDIOC_QUERYCAP, {driver="v4l2 loopback", card="Virtual Webcam", 
bus_info="platform:v4l2loopback-000", version=KERNEL_VERSION(5, 13, 19), 
capabilities=V4L2_CAP_EXT_PIX_FORMAT|V4L2_CAP_READWRITE|V4L2_CAP_STREAMING|V4L2_CAP_DEVICE_CAPS, 
device_caps=V4L2_CAP_EXT_PIX_FORMAT|V4L2_CAP_READWRITE|V4L2_CAP_STREAMING}) = 0

I can confirm that lsof does not show any usage of /dev/videoX after the program ends. Removing and reloading the the v4l2loopback resets this state and backscrub works again.

dengste commented 2 years ago

Loading v4l2loopback without exclusive_caps also makes it work, but I guess this will lead to problems with some applications? Anyway, it still seems to work fine with Teams. So I guess it is some problem with the device not being properly released, which is why it's missing the VIDEO_OUTPUT capability on subsequent starts of the program?

BenBE commented 2 years ago

I recently pushed two patches:

  1. https://github.com/floe/backscrub/commit/b2b1f698ff3311a6252bbcfa35746738cd01a7ee / https://github.com/floe/backscrub/commit/56d94fde75b8bdcef304ca7ed06c3d51e5b065ae
  2. https://github.com/floe/backscrub/commit/e3d1c229c2345a82cc2b2b77ae9571bc96521743

Please check if your local version includes both of them.

Also: Locally I have Linux Kernel 5.13.0-30 here on Ubuntu 20.04.3 LTS with v4l2loopback-dkms 0.12.5-1 and things run as intended.

dengste commented 2 years ago

I now compiled the experimental branch, still no change. The on_scope_exit() will not be called when terminating with Ctrl+C, but it also does not work when I cleanly exit in debug mode with 'q', where I verified the device is closed correctly. Looking at v4l2loopback, it seems that somehow the 'ready_for_output' flag is not reset when the device is closed. I would have to compile it with some debug messages to figure out what's going on... I'm on 5.13.0-28 and v4l2loopback-dkms 0.12.5-1.

BenBE commented 2 years ago

Just tested again with 0643e2bbbcfab6c9cf920cbde52985c2b69347a5 with both quitting with q from the debug mode (backscrub -d -d) and hitting Ctrl+c on the console window. In both cases restarting backscrub right after closing works as expected.

The only difference I currently still see is the minor difference in the kernel version (which I can't test right now, because that version isn't available in the focal repos anymore. Though TBH I doubt it's the running kernel that's causing issues in itself, because 5.13.0-27 worked fine for me too.

One effect that you may look into is if there is any consumer still trying to access the vcam device. I sometimes have this effect with Skype that it keeps the device open even while not using it and thus causing the start of backscrub to fail until I close Skype. Not sure if something similar may happen with Teams too.

BenBE commented 2 years ago

k, recently updated my system to Lubuntu 22.04 LTS and can now reproduce the issue here. What's strange here is that after the first invocation I'm also seeing the drop of the capabilities:

Freshly loaded with all needed caps:

[ 6359.192995] v4l2loopback driver version 0.12.5 loaded
[ 6372.823287] video0: VIDIOC_QUERYCAP: driver=v4l2 loopback, card=MyCam, bus=platform:v4l2loopback-000, version=0x00051101, capabilities=0x85200002, device_caps=0x05200002
[ 6372.823697] video0: VIDIOC_S_FMT: type=vid-out, width=1280, height=720, pixelformat=YUYV little-endian (0x56595559), field=none, bytesperline=2560, sizeimage=1843200, colorspace=8, flags=0x0, ycbcr_enc=0, quantization=0, xfer_func=0

After restarting the program:

[ 6412.626947] video0: VIDIOC_QUERYCAP: driver=v4l2 loopback, card=MyCam, bus=platform:v4l2loopback-000, version=0x00051101, capabilities=0x85200000, device_caps=0x05200000
[ 6412.626956] video0: VIDIOC_S_FMT: error -22: type=vid-out, width=1280, height=720, pixelformat=YUYV little-endian (0x56595559), field=none, bytesperline=2560, sizeimage=1843200, colorspace=8, flags=0x0, ycbcr_enc=0, quantization=0, xfer_func=0

To reproduce the kernel log messages:

echo 255 > /sys/devices/virtual/video4linux/video0/dev_debug

Still studying the v4l2loopback source if I can find anything I might be missing to return the device back to allow output to be written.

NB: While further looking at the code of the v4l2loopback kernel module I'm more and more convinced there's a bug with the exclusive_caps option, as even after closing the last writer on normal program termination I see the device loosing both CAPTURE and OUTPUT capabilities. Only workaround so far is unloading and reloading the kernel module, which gets kinda annoying fast. Interestingly when running close manually right after loopback_init and re-opening the device afterwards, I see things succeed. Something is absolutely strange with this!

BenBE commented 2 years ago

I may have a solution for the issue. Will file a bug report upstream though, because the device behavior is absolutely inconsistent and badly documented.

Anyways, @wilkej, could you please test the following patch (based on current experimental)?

diff --git a/app/deepseg.cc b/app/deepseg.cc
index d77e53c6..b2d80e45 100644
--- a/app/deepseg.cc
+++ b/app/deepseg.cc
@@ -256,7 +256,7 @@ int main(int argc, char* argv[]) try {
        }

        on_scope_exit lbfd_closer([lbfd]() {
-               close(lbfd);
+               loopback_free(lbfd);
        });

        cv::VideoCapture cap(ccam.c_str(), cv::CAP_V4L2);
diff --git a/videoio/loopback.cc b/videoio/loopback.cc
index e30c4c59..9b90bac3 100644
--- a/videoio/loopback.cc
+++ b/videoio/loopback.cc
@@ -37,7 +37,7 @@ int loopback_init(const std::string& device, int w, int h, int debug) {

        int ret_code;

-       int fdwr = open(device.c_str(), O_RDWR);
+       int fdwr = open(device.c_str(), O_RDWR|O_CLOEXEC);

        if(fdwr < 0) {
                fprintf(stderr, "%s:%d(%s): Failed to open video device: %s\n", __FILE__, __LINE__, __func__, strerror(errno));
@@ -53,21 +53,6 @@ int loopback_init(const std::string& device, int w, int h, int debug) {
        }

        memset(&vid_format, 0, sizeof(vid_format));
-       vid_format.type = V4L2_BUF_TYPE_VIDEO_OUTPUT;
-
-       ret_code = ioctl(fdwr, VIDIOC_G_FMT, &vid_format);
-       if(ret_code < 0) {
-               fprintf(stderr, "%s:%d(%s): Failed to get device video format: %s\n", __FILE__, __LINE__, __func__, strerror(errno));
-               close(fdwr);
-               return -1;
-       }
-
-       if(ret_code < 0) {
-               fprintf(stderr, "%s:%d(%s): Failed to get device video format: %s\n", __FILE__, __LINE__, __func__, strerror(errno));
-               close(fdwr);
-               return -1;
-       }
-
        vid_format.type = V4L2_BUF_TYPE_VIDEO_OUTPUT;
        vid_format.fmt.pix.width = w;
        vid_format.fmt.pix.height = h;
@@ -85,12 +70,40 @@ int loopback_init(const std::string& device, int w, int h, int debug) {
                return -1;
        }

+       ret_code = ioctl(fdwr, VIDIOC_STREAMON, &vid_format.type);
+
+       if(ret_code < 0) {
+               fprintf(stderr, "%s:%d(%s): Failed to start streaming: %s\n", __FILE__, __LINE__, __func__, strerror(errno));
+               close(fdwr);
+               return -1;
+       }
+
        if (debug)
                print_format(&vid_format);

        return fdwr;
 }

+int loopback_free(int fdwr) {
+
+       const v4l2_buf_type type = V4L2_BUF_TYPE_VIDEO_OUTPUT;
+
+       int ret_code = ioctl(fdwr, VIDIOC_STREAMOFF, &type);
+
+       if(ret_code < 0) {
+               fprintf(stderr, "%s:%d(%s): Failed to stop streaming: %s\n", __FILE__, __LINE__, __func__, strerror(errno));
+               return -1;
+       }
+
+       ret_code = close(fdwr);
+       if(ret_code < 0) {
+               fprintf(stderr, "%s:%d(%s): Failed to close video device: %s\n", __FILE__, __LINE__, __func__, strerror(errno));
+               return -1;
+       }
+
+       return 0;
+}
+
 #ifdef standalone

 #define FRAME_WIDTH 640
diff --git a/videoio/loopback.h b/videoio/loopback.h
index d5edfbca..2362854e 100644
--- a/videoio/loopback.h
+++ b/videoio/loopback.h
@@ -7,5 +7,6 @@
 #include <string>

 int loopback_init(const std::string& device, int w, int h, int debug);
+int loopback_free(int fdwr);

 #endif // _LOOPBACK_H_

To explain: There's only one place in the v4l2loopback kernel module where the opener->type field is set as WRITER, which is when enabling the video streaming by calling ioctl with operation VIDIOC_STREAMON and parameter V4L2_BUF_TYPE_VIDEO_OUTPUT. Not running this ioctl causes the opener role to stay undefined and thus failing to reset the V4L2_CAP_VIDEO_OUTPUT by restoring the dev->ready_for_output marker in the v4l2_loopback_close handler.

phlash commented 2 years ago

@wilkej FYI this is now in main as a bug fix if you want to check that rather than experimental. Many thanks to @BenBE for the digging :smile:

wilkej commented 2 years ago

@phlash and @BenBE thank you for your work and apologies for the late feedback. I'm able to restart backscrub after I closed it.