Report number of written frames with AudioHandler::fill_buffer

tyranron commented 4 years ago

I'm trying to wrap Connection + AudioHandler into a tokio::AsyncRead which is then fed into a FFmpeg's STDIN. In the implementation I have some in-between buffer, which is filled with AudioHandler::fill_buf and then written as BE bytes to the given buf. The problem is that after doing AudioHandler::fill_buf I don't know how much the given buffer was filled exactly, so don't know how much data should be read from that in-between buffer.

tyranron commented 4 years ago

@Flakebi

It seems that I don't understand quite well why AudioHandler works the way it works.

I've done some similar hand-baked implementation to mix multiple talkers. The main difference with AudioHandler, which confuses me, is that my hand-baked mixing doesn't write any data, if there is an active talker and his samples queue is empty, because in such case we will corrupt the sample rate. However, AudioHandler writes as much data as it has, and I don't understand why it works.

My point is, that independetly on number of talkers, the result samples stream must provide a constant sample rate. When there is one talker - it's OK, the audioopus does the job, and we have a samples stream with a desired sample rate. But when there is 2 or more talkers, to preserve the sample rate and don't produce more samples than expected, we should mix concurrent samples of each talker into a single samples stream (AudioHandler does this), but only when we have samples for all concurrent talkers (AudioHandler doesn't do this). Why it's important?

Consider the situation:

We have two active talkers: Alice and Bob.
Alice's AudioQueue contains 5 samples.
Bob's AudioQueue contains 0 samples.
We call AudioHandler::fill_buff and it mixes 5 samples from the Alice's AudioQueue with 0 samples from the Bob's AudioQueue and writes them.
New audio packet arives from Bob and his AudioQueue now contains 7 samples.
We call AudioHandler::fill_buff and it mixes 0 samples from the Alice's AudioQueue and 7 samples from the Bob's AudioQueue and writes them.
The result sample rate is corrupted it produced: 5 + 7 samples were written instead of just 5.

How it should work:

We have two active talkers: Alice and Bob.
Alice's AudioQueue contains 5 samples.
Bob's AudioQueue contains 0 samples.
We call AudioHandler::fill_buff and it writes nothing as Bob's AudioQueue has no samples.
New audio packet arives from Bob and his AudioQueue now contains 7 samples.
We call AudioHandler::fill_buff and it mixes 5 samples from the Alice's AudioQueue with 5 samples from the Bob's AudioQueue and writes them.
The result sample rate is expected: 5 mixed samples were produced.

I'm not a big expert in audio, but that way of mixing talkers' audios I've implemented was a result of a naive implementation hadn't worked well producing the corrupted audio with >1 talkers and battle-tested for quite a while. So, do I miss something with that?

Flakebi commented 4 years ago

Hi, Audio mixing is a complicated topic and it took me a while to arrive at the current implementation. I built the AudioHandler with desktop clients in mind. That said, the AudioHandler should be useful for other cases too, but this use case may explain the way it works. The AudioHandler expects that a consumer polls for audio regularly, e.g. for playback. That means, for stereo audio and 20 ms buffer time (the usual size of opus packets), a buffer holds 48 000 / 50 * 2 = 1920 floats and a consumer should call AudioHandler::fill_buffer with this size every 20 ms. In other words, the AudioHandler does not do any timing, the consumer has to do this.

On desktop clients, the timing is usually given by the sound hardware (afaik, I don’t know the details here). SDL or gstreamer use the hardware clock and poll a function with the frequency dictated by the hardware. This function then fills the buffer with AudioHandler::fill_buffer.

I think mixing audio without a timer is hard, if not impossible. If a client like Bob has no samples available, it can be because

Bob’s packet is delayed and we should wait with mixing
Bob finished speaking but he didn’t tell us
Bob lost his connection or he is lagging a lot

Not sure how you want to use the ts audio streams, so my suggestion is

for a desktop client, the hardware clock should be used
otherwise, an easy way is to just fill a buffer every 20 ms
if the latency should be really low and you want to reduce the amount of buffers to the bare minimum, I think the best way is buffer only once before the final mixing.

In that case, things to consider are packet loss (and fec/forward error correction), out-of-order packets, keeping the buffer big enough to cope with jitter and small enough to reduce latency.

Subjectively, the current implementation of AudioHandler works quite well for desktop client use. I only tested it on stable ethernet connections though (so ~20 ms ping, not much jitter and not much packet loss).

I hope this explanation was useful, take all this with a grain of salt, this is my first audio project :)

tyranron commented 4 years ago

@Flakebi thank you for such a detailed answer. Now I think I see the root of the problem: we've designed stuff for quite a different playbacks.

In your case, playback is quite slow. It reads data by a timer, as you've described.

In my case, playback reads data as fast as possible:tokio::io::copy drains the AudioHandler every time it has some data at least, then feeds that data directly to FFmpeg's STDIN, which, in turn, collects it into his interfnal buffer and after that feeds to the real playback with his own wallclock.

That's it. So, the problem I've described above is theoretically totally true for your case too, but in practice, due to playback being slow enough, you just never hit it, because at the time playback requests data, all talkers' queues already have enough data arrived. As my playback takes data as fast as possible, I hit the problem inevitably.

If you don't mind, I can rewrite AudioHandler in a separate PR to be able to work with fast exhaustive playbacks. And I bet this will make no difference for your use case, where playback is slow enough, just because at the time it will request data, there will be enough data queued for writing in both cases.

Flakebi commented 4 years ago

If you get something working, I’m happy to accept PRs.

I’m curious on how you want to solve the problem without a timer, the case I’m thinking about is

Alice and Bob have empty queues
The client receives 20 ms audio data from Alice. How much of this is sent to ffmpeg? 1 ms, 5 ms, 20 ms?
5 ms later, the client receives 20 ms audio data from Bob. Now the streams need to be mixed, depending on how much data from Alice were consumed before.
Alice and Bob keep talking

The idea playback would be, 5 ms Alice, 15 ms Alice and Bob mixed, 5 ms the rest from Bob’s first packet mixed with the next packet from Alice.

tyranron commented 4 years ago

@Flakebi well, after playing with it quite a while and spotting more subtle problems with my solution, I just can conclude that you were right.

I've followed your advise and have implemented a wallclock-based AsyncRead, and that just works like a charm. Because it produces a stable constant sample rate, it also simplified my FFmpeg's pipeline.

Thank you again for the explanation. It has helped me a lot! ❤️

I'm closing this PR, as it doesn't make much sense now.

Flakebi commented 4 years ago

Nice! Good to hear that you got it working.

ReSpeak / tsclientlib

Report number of written frames with AudioHandler::fill_buffer #15