elan-ev / opencast-studio

Web-based recording studio for Opencast
https://studio.opencast.org
MIT License
50 stars 45 forks source link

Feature Request: Blur background of presenters camera #740

Open JFD23 opened 3 years ago

JFD23 commented 3 years ago

Like on MS Teams or Zoom users want to blur their camera background (for privacy reasons when they record lessons at home)

LukasKalbertodt commented 3 years ago

That's an interesting idea! However, implementing this is not trivial at all. It also likely makes Studio use even more resources (as in CPU & GPU), which is already a problem for some less powerful notebooks.

We will definitely keep this in mind, but due to time constraints, funding and other reasons, we won't be able to implement this anytime soon.

JFD23 commented 3 years ago

That is what I guessed.

thanks for keeping it in mind for future development.

mwuttke commented 2 years ago

A successful example for a possible implementation for virtual camera backgrounds or the blur effect is BigBlueButton, I guess.

lkiesow commented 1 year ago

For building something like this, see Edit live video background with WebRTC and TensorFlow.js

This probably also means that we would need to record the camera differently, though. We can no longer just record the media source.

JulianKniephoff commented 9 months ago

OK so I looked into this a bit now. Wall of text incoming.

First the good news: There is now a prototype of this feature deployed here (branch (permalink); commit c5203de8481e117f3e170af9d374f9f8636b9c23). This uses @shiguredo/virtual-background of @shiguredo/media-processors fame. I say "fame," but really this is a rather obscure library, and we probably don't want to rely on it for this. It's fine for the prototype, since we just wanted to see how a client-side blur would perform at all.

Which brings me to the bad news: The package ecosystem regarding this stuff is still rather barren, unfortunately. As far as sustainable open source turnkey solutions go, I could really only find two. Apart from the Shiguredo stuff, there is also skyway-video-processors. Both are of Japanese origin, interestingly, and sadly both their documentation is mostly in Japanese as well. They both come from Japanese video conferencing companies (or similar services, my Japanese isn't perfect), and neither is compatible with all modern browsers. Shiguredo specifically also has some questionable notes regarding contributions in their docs.

Apart from that there are smaller projects by individual developers that are either unstable or long dead (the projects, not the devs (or so I hope)). The other thing you find when you research the relevant keyword ("virtual background") is SDKs that integrate with third party services like Twilio or EffectsSDK. In particular, these use server side processing, as far as I could gleam from looking at their stuff.

So to me that basically means we have to do some of things required ourselves, and if you look at existing solutions, like the mentioned BBB, that's what people generally do, apparently. So the question becomes "how?" I researched that a bit as well and want to collect my findings here, but while doing so I also got an increasing sense of the question actually being "whether:"

The reason why the SkyWay and Shiguredo stuff doesn't work in all browsers is that they use (different) new but not widely supported web APIs, namely MediaStreamTrackProcessor and HTMLVideoElement#requestVideoFrameCallback, respectively. (Before someone asks, yes I have tried to polyfill the latter with rvfc-polyfill; it didn't work.) And digging around a bit further in that space will show you that there is movement in this area: https://developer.chrome.com/blog/background-blur/. The direction the web standards are moving in rather looks like blur and other camera effects are something that the Browsers can support natively, and the platform just exposes APIs to read out and maybe control those settings, like we can do for resolution, aspect ratio, etc. right now. And if you look at macOS OS vendors are already implementing these things that browsers could then integrate with. This really looks like the ideal future to me, where we would "just" need to add some UI to control these features and be done with it, instead of burdening the app with the considerable complexity that I can hopefully convey below. So yeah, I think it's a fair question to ask whether we want to get our hands dirty now when something like this is on the horizon anyway. 🤷‍♀️

Anyhow, if we were to build it ourselves, here's roughly what it would take: The problem basically has two parts: a) Image segmentation, i.e. recognizing what's in the foreground vs. the background and creating a mask from that, and b) blurring the camera stream according to that mask.

Problem a) is a hard computer vision problem that we just won't solve ourselves from scratch. State of the art algorithms employ "AI" (read: machine learning, specifically convolutional neural networks AFAIK), and luckily there are pretrained models for it and even JavaScript packages that make it work in the browser (given a fast enough CPU and/or GPU). The most used solution here is Google's MediaPipe which offers several image segmentation models, most notably/notoriously the so called selfie segmentation. MediaPipe is a rather high level framework for applying ML stuff to media; a bit more low-level you find TensorFlow.js which "just" gives you the generic ML framework TensorFlow in the browser. With this, a few more models are (more or less easily) accessible. One that you also find a lot of references to is BodyPix (which seems to be superseded by this, though).

Now, while these solutions solve the hard CV problem, they still leave a lot of engineering to us: These frameworks can run using different backends (on the client, not server backends), like the CPU vs. the GPU, there are WASM versions, etc., so integrating them and tuning them properly is still a bit of work.

Problem b) on the other hand is a rather basic computer graphics problem. (Well, you can certainly go down the rabbit hole of fast, high quality blur algorithms, but a masked Gaussian blur on a 60FPS video is something that most consumer grade devices should be able to do these days and it looks just fine.) There aren't really any libraries specifically for it because it is a rather basic task. Higher level frameworks might have helpers for it but I wouldn't want to pull in a 3D engine to blur a video frame. The devil is again in the details, though: There is some research to be done as to what's the best way to a) extract the individual frames from the browser media APIs, probably rendering them to some canvas, b) applying the actual blur calculation hopefully in real time to then c) create a media stream from said canvas again. The most interesting step is b) where you again have the choice of using WebGL and/or WASM, or even built in browser features for blurring and composition if they are supported.

I'll leave it at that for now, because our current funding for this sadly ran out anyway. Maybe this helps us down the line when people want to invest more in this; at the very least it gives us a basis to gauge how much further funding it would even take, and–again–whether or not it is even worth it. 🤷‍♀️