MetaProvide / talked

Call recording for Nextcloud Talk
GNU Affero General Public License v3.0
29 stars 6 forks source link

Good start, but.... #34

Open ASerbinski opened 3 years ago

ASerbinski commented 3 years ago

The first line in the readme: "Talked works by launching a Firefox instance in a virtual framebuffer, joining the Talk call, and then recording whatever is on screen using FFmpeg."

Seems like a very high overhead approach. It really shouldn't be using firefox (or any kind of web browser) at all. This is especially important since most SERVERS won't have firefox or any of the thousand GUI-related dependencies installed. Instead, it should hook directly up to the streams being sent from the server, and feed them into a multi-track MKV file. There is no reason for any transcoding, since the streams are already compressed for transfer over the network, just capture each video and audio track straight to disk. Media players like VLC are capable of presenting MKV files with any number of video and audio tracks.

You can check with the implementation in the mobile applications for how to connect to a call and receive the streams. As this would be a recording-only client, you obviously don't need to implement the sending part.

https://nextcloud-talk.readthedocs.io/en/latest/ Also look up webrtc Here's somewhere to look at what NC does internally; https://help.nextcloud.com/t/how-do-i-connect-webrtc-for-talk-manually/108167

Kixunil commented 3 years ago

I believe everyone knows this. The way it currently works is supposedly easier to implement but would like to see you disproving it by providing some code. :)

ASerbinski commented 3 years ago

Nothing I've said is meant for you @Kixunil -- move along and troll someone else.

Kixunil commented 3 years ago

I know, I just explained that the author is well aware of what you're saying and expressed interest in a proof that this code is not actually (significantly) simpler than recording directly. I'd rather use direct recording but if there's no such code indirect is a good temporary solution.

I have no idea what caused you to think I'm trolling given my conversations in this project and various other Open Source contributions. I'd love to avoid being misjudged in the future so if you can tell me what to change that'd be appreciated.

mwalbeck commented 3 years ago

It's important to remember that there are many different use cases for recording Talk conversations, as well as the people who are going to use this functionality.

It's true, as you say, that the method used for Talked has a lot of overhead compared to just downloading each video stream and saving them in a mkv container. Furthermore, it's also true that doing as you suggest would be a better solution in terms of dealing with the video and audio feeds, but it comes with its own challenges, and as @Kixunil noted, it makes the implementation a lot more complex.

Firstly, for me personally I'm fine with a multi-track mkv file, but the users I'm supporting don't even know what a mkv file is. They don't want to fiddle with playback settings, they just want to click on the video and have it play. They also want to be able to watch the video directly in Nextcloud, but mkv isn't supported by all browsers and I doubt, though I haven't tested it, the viewer in Nextcloud can handle showing multiple tracks at once. So to make it work I would have to create logic for merging all the video tracks into one video feed after the recording finishes.

Secondly, but very much related to the first point, is the issue of handling screensharing and users expecting that watching a recording of a call will be a similar experience to participating in that same call. You need to first handle situations where a user starts a screenshare in the middle of a recording. You need to ensure continuity and also how to present it in the end. My users expect a screenshare to cover the whole screen, so that is one more thing that needs to be taken care of in the post-processing.

Thirdly, what about elements that are added on top of the video feeds that can be useful to also have in the recording, like a raised hand.

Lastly, this would most likely require a partial Talk client to be written to properly handle streams of users hopping in and out of calls, as well as turning video on and off, plus logic on top of that to make sure everything is neatly packed into a multi-track mkv file.

So it might be more ideal in terms of dealing with the video and audio feeds alone, but complexity wise they aren't at all comparable.

I'll gladly be of assistance if someone wants to work on it, but it isn't something I've the time to implement myself. And I would rather spend the time I have to work on improving what is already here and working.

Also, @Kixunil is very much not a troll, but a very welcomed contributor to this project.

Anyone is welcome to participate in the discussion as long as their input is relevant to the discussion at hand, and it's done in a polite manner.

MaxHillebrand commented 3 years ago

Good points @mwalbeck, thanks for the detailed writeup.

I can confirm that the feature set of Talked is "good enough" for me. But no clue how a direct recording might help for multi-audio-tracks, or audio-only recordings.

Kixunil commented 3 years ago

One thing I imagine could be cool even though probably very laborious to write is collecting metadata like events which person is talking, when a person joined//left, raised hands... and produce a project file for some video editor (e.g. Kdenlive which I use). One could then open the project right away and get something close to what happened with the ability to tweak things.

MaxHillebrand commented 3 years ago

Ontop of that - a timestamped transcript would be great too...

mwalbeck commented 3 years ago

I can confirm that the feature set of Talked is "good enough" for me. But no clue how a direct recording might help for multi-audio-tracks, or audio-only recordings.

Audio-only recordings are fairly simple to add with the current setup and I plan on adding the feature fairly soon.

Audio-only recordings in a direct recording setup is definitely easier than including video as there are fewer things you have to take care of, but you would still need a talk client that can handle the signaling and webrtc and one would still need to ensure continuity. Though you would get multi-track audio recordings more or less for free with a direct recording implementation. But from an audio production standpoint, I think local recording would be better, since you can skip the lossy encoding entirely. From my perspective I think local recording of audio would be more useful, and most likely also easier to implement, than direct recording.

One thing I imagine could be cool even though probably very laborious to write is collecting metadata like events which person is talking, when a person joined//left, raised hands... and produce a project file for some video editor (e.g. Kdenlive which I use). One could then open the project right away and get something close to what happened with the ability to tweak things.

Yeah, this would definitely require a talk client to be written, but if that is implemented direct recording makes a lot of sense. It would definitely be cool from a video production standpoint, where the intention is to publish it afterwards.

I think the ideal solution would be to support indirect, direct and local recording as they all have their pros and cons and which one is best depends on the use case.

Ontop of that - a timestamped transcript would be great too...

That would be very useful, and maybe something that is slowly being worked on already albeit as a separate tool. 😉