Closed tyranron closed 4 years ago
@Flakebi
It seems that I don't understand quite well why AudioHandler
works the way it works.
I've done some similar hand-baked implementation to mix multiple talkers. The main difference with AudioHandler
, which confuses me, is that my hand-baked mixing doesn't write any data, if there is an active talker and his samples queue is empty, because in such case we will corrupt the sample rate. However, AudioHandler
writes as much data as it has, and I don't understand why it works.
My point is, that independetly on number of talkers, the result samples stream must provide a constant sample rate. When there is one talker - it's OK, the audioopus
does the job, and we have a samples stream with a desired sample rate. But when there is 2 or more talkers, to preserve the sample rate and don't produce more samples than expected, we should mix concurrent samples of each talker into a single samples stream (AudioHandler
does this), but only when we have samples for all concurrent talkers (AudioHandler
doesn't do this). Why it's important?
Consider the situation:
Alice
and Bob
.Alice
's AudioQueue
contains 5 samples.Bob
's AudioQueue
contains 0 samples.AudioHandler::fill_buff
and it mixes 5 samples from the Alice
's AudioQueue
with 0 samples from the Bob
's AudioQueue
and writes them.Bob
and his AudioQueue
now contains 7 samples.AudioHandler::fill_buff
and it mixes 0 samples from the Alice
's AudioQueue
and 7 samples from the Bob
's AudioQueue
and writes them.How it should work:
Alice
and Bob
.Alice
's AudioQueue
contains 5 samples.Bob
's AudioQueue
contains 0 samples.AudioHandler::fill_buff
and it writes nothing as Bob
's AudioQueue
has no samples.Bob
and his AudioQueue
now contains 7 samples.AudioHandler::fill_buff
and it mixes 5 samples from the Alice
's AudioQueue
with 5 samples from the Bob
's AudioQueue
and writes them.I'm not a big expert in audio, but that way of mixing talkers' audios I've implemented was a result of a naive implementation hadn't worked well producing the corrupted audio with >1 talkers and battle-tested for quite a while. So, do I miss something with that?
Hi,
Audio mixing is a complicated topic and it took me a while to arrive at the current implementation.
I built the AudioHandler
with desktop clients in mind. That said, the AudioHandler
should be useful for other cases too, but this use case may explain the way it works.
The AudioHandler
expects that a consumer polls for audio regularly, e.g. for playback. That means, for stereo audio and 20 ms buffer time (the usual size of opus packets), a buffer holds 48 000 / 50 * 2 = 1920
floats and a consumer should call AudioHandler::fill_buffer
with this size every 20 ms. In other words, the AudioHandler
does not do any timing, the consumer has to do this.
On desktop clients, the timing is usually given by the sound hardware (afaik, I don’t know the details here). SDL or gstreamer use the hardware clock and poll a function with the frequency dictated by the hardware. This function then fills the buffer with AudioHandler::fill_buffer
.
I think mixing audio without a timer is hard, if not impossible. If a client like Bob
has no samples available, it can be because
Bob
’s packet is delayed and we should wait with mixingBob
finished speaking but he didn’t tell usBob
lost his connection or he is lagging a lotNot sure how you want to use the ts audio streams, so my suggestion is
In that case, things to consider are packet loss (and fec/forward error correction), out-of-order packets, keeping the buffer big enough to cope with jitter and small enough to reduce latency.
Subjectively, the current implementation of AudioHandler
works quite well for desktop client use. I only tested it on stable ethernet connections though (so ~20 ms ping, not much jitter and not much packet loss).
I hope this explanation was useful, take all this with a grain of salt, this is my first audio project :)
@Flakebi thank you for such a detailed answer. Now I think I see the root of the problem: we've designed stuff for quite a different playbacks.
In your case, playback is quite slow. It reads data by a timer, as you've described.
In my case, playback reads data as fast as possible:tokio::io::copy
drains the AudioHandler
every time it has some data at least, then feeds that data directly to FFmpeg's STDIN, which, in turn, collects it into his interfnal buffer and after that feeds to the real playback with his own wallclock.
That's it. So, the problem I've described above is theoretically totally true for your case too, but in practice, due to playback being slow enough, you just never hit it, because at the time playback requests data, all talkers' queues already have enough data arrived. As my playback takes data as fast as possible, I hit the problem inevitably.
If you don't mind, I can rewrite AudioHandler
in a separate PR to be able to work with fast exhaustive playbacks. And I bet this will make no difference for your use case, where playback is slow enough, just because at the time it will request data, there will be enough data queued for writing in both cases.
If you get something working, I’m happy to accept PRs.
I’m curious on how you want to solve the problem without a timer, the case I’m thinking about is
Alice
and Bob
have empty queuesAlice
. How much of this is sent to ffmpeg? 1 ms, 5 ms, 20 ms?Bob
. Now the streams need to be mixed, depending on how much data from Alice
were consumed before.Alice
and Bob
keep talkingThe idea playback would be, 5 ms Alice
, 15 ms Alice
and Bob
mixed, 5 ms the rest from Bob
’s first packet mixed with the next packet from Alice
.
@Flakebi well, after playing with it quite a while and spotting more subtle problems with my solution, I just can conclude that you were right.
I've followed your advise and have implemented a wallclock-based AsyncRead
, and that just works like a charm. Because it produces a stable constant sample rate, it also simplified my FFmpeg's pipeline.
Thank you again for the explanation. It has helped me a lot! ❤️
I'm closing this PR, as it doesn't make much sense now.
Nice! Good to hear that you got it working.
I'm trying to wrap
Connection
+AudioHandler
into atokio::AsyncRead
which is then fed into a FFmpeg's STDIN. In the implementation I have some in-between buffer, which is filled withAudioHandler::fill_buf
and then written as BE bytes to the givenbuf
. The problem is that after doingAudioHandler::fill_buf
I don't know how much the given buffer was filled exactly, so don't know how much data should be read from that in-between buffer.