Split the dual input/output PyAudio stream into two threads, each handling input and output separately and blocking where needed. This reduces the latency of MVL by nearly half.
Examples below assume a 400ms latency selected (would show as 800ms in MVL alpha)
MVL implementation with callback:
400ms passes for the first callback from PyAudio, which sends the audio to the model, and outputs an empty data.
400ms passes, 800ms total, for the second callback from PyAudio. It sends the new audio to the model, and retrieves the output audio from the last callback, sending it to the listener.
This approach means the performance of the model doesn't matter under normal operating conditions (your GPU is fast enough). It will always take 800ms to get audio, even if the model only needed 100ms to do the work.
Split PyAudio threads as implemented here:
400ms passes while the Input thread is blocking on PyAudio.stream.read(). Once it has data, it queues it for the Model's thread, then blocks again for the next chunk.
The model is blocking on the input queue, and starts conversion as soon as the Input thread puts to it. When it finishes, it puts into the output queue.
The output thread is blocking on the output queue. As soon as the model puts into it, it takes the converted audio chunk and outputs it with a blocking call to PyAudio.stream.write(). Then it loops back waiting for the next chunk on the output queue. It will put the output out as fast as the model produces it, but as the audio from the model is 400ms worth, it should never overflow itself. 400ms of audio is going in, 400ms of audio is coming out. The delay between the two is determined by the model.
With this approach, the performance of the model matters! It will take 400ms to get audio, and then an unknown amount of ms for the GPU to run the inference. The total round trip will depend on the performance of your GPU. You could see 450ms, 550ms, 750ms, 900ms, etc. Input will keep arriving in 400ms chunks. The model will produce output as fast as it can, which will be output in sequence.
If your GPU is fast, your latency is now improved, congrats! If your GPU is slow, or your latency too low, you may have choppiness. The lower the latency, the faster data is sent to the model. But the same amount of data is sent every time with a rolling buffer of ~1.48 seconds. The output represents audio equal to the latency milliseconds. So if you lower latency too low, the model won't produce audio fast enough for the output thread to create a seamless stream. (MVL Alpha calls this frame dropping)
What if your GPU is slow? In the original MVL, if the model was outputting too slowly, you would get 400ms drops. It would always be at least 400ms silent, then the output produced in the meantime would occur. With this approach, the silent period will only be as long as it takes the model to return more audio, so it should be a little smoother.
Split the dual input/output PyAudio stream into two threads, each handling input and output separately and blocking where needed. This reduces the latency of MVL by nearly half.
Examples below assume a 400ms latency selected (would show as 800ms in MVL alpha)
MVL implementation with callback:
This approach means the performance of the model doesn't matter under normal operating conditions (your GPU is fast enough). It will always take 800ms to get audio, even if the model only needed 100ms to do the work.
Split PyAudio threads as implemented here:
With this approach, the performance of the model matters! It will take 400ms to get audio, and then an unknown amount of ms for the GPU to run the inference. The total round trip will depend on the performance of your GPU. You could see 450ms, 550ms, 750ms, 900ms, etc. Input will keep arriving in 400ms chunks. The model will produce output as fast as it can, which will be output in sequence.
If your GPU is fast, your latency is now improved, congrats! If your GPU is slow, or your latency too low, you may have choppiness. The lower the latency, the faster data is sent to the model. But the same amount of data is sent every time with a rolling buffer of ~1.48 seconds. The output represents audio equal to the latency milliseconds. So if you lower latency too low, the model won't produce audio fast enough for the output thread to create a seamless stream. (MVL Alpha calls this frame dropping)
What if your GPU is slow? In the original MVL, if the model was outputting too slowly, you would get 400ms drops. It would always be at least 400ms silent, then the output produced in the meantime would occur. With this approach, the silent period will only be as long as it takes the model to return more audio, so it should be a little smoother.