Use new IAudioClient3 interface for low-latency audio in shared mode

adzm commented 3 years ago

The new IAudioClient3 interface supports lower latency audio as described here although I am uncertain how to apply this to pa_win_wasapi.c offhand. The IAudioClient3 interface is already used, though the functions InitializeSharedAudioStream and GetSharedModeEnginePeriod / GetCurrentSharedModeEnginePeriod are not used apparently.

dmitrykos commented 3 years ago

To my understanding the additional API of IAudioClient3 is just a helper functions (facade) to make it easier to create shared stream with a desired latency. PA WASAPI implementation does its best to provide the lowest supported latency in the Shared mode and the usage of the IAudioClient3::InitializeSharedAudioStream is not required to achieve it.

If you have comparison of the lowest latency you could achieve with IAudioClient3::InitializeSharedAudioStream and PA WASAPI please provide it.

adzm commented 3 years ago

I was under the impression this could let us get around the usual minimum latency in shared mode; however I may indeed be misunderstanding. I'll try to give it a shot though. Thanks for the input.

rakosrudolf commented 3 years ago

I believe using IAudioClient3 could reduce WASAPI shared mode latency to <10ms which would be great for tools like FlexASIO. See https://github.com/dechamps/FlexASIO/issues/55 .

It looks like this might reduce latency by a few of milliseconds as the system switches to small buffers for that endpoint. https://docs.microsoft.com/en-us/windows-hardware/drivers/audio/low-latency-audio#faq

... By default, all applications in Windows 10 will use 10ms buffers to render and capture audio. If an application needs to use small buffers, then it needs to use the new AudioGraph settings or the WASAPI IAudioClient3 interface, in order to do so. However, if one application in Windows 10 requests the usage of small buffers, then the Audio Engine will start transferring audio using that particular buffer size. In that case, all applications that use the same endpoint and mode will automatically switch to that small buffer size. When the low latency application exits, the Audio Engine will switch to 10ms buffers again.

dmitrykos commented 3 years ago

@rakosrudolf, thank you for referencing Microsoft docs regarding this issue. According to the documentation the promise about low-latency in Shared mode is not guaranteed by the platform:

it is driver dependent (e.g. if driver supports <10 ms in Shared mode)
there is race condition: if another Shared stream is opened in non low-latency mode then low-latency can not be achieved by the stream initialized with InitializeSharedAudioStream API

Anyway, taking into account that low-latency possibility might exist, this new API can be incorporated into PA WASAPI as previously proposed by @adzm to provide such possibility for PA WASAPI users.

RossBencina commented 2 years ago

We've set this to priority P3 (Normal) but @dmitrykos can change it to whatever he likes.

danryu commented 1 year ago

Was there any progress on this? It would be a very welcome enhancement if so.

dmitrykos commented 1 year ago

As I have mentioned earlier, if someone would develop a small test which would try to contrast IAudioClient3::Initialize vs IAudioClient3::InitializeSharedAudioStream showing that it is really possible to achieve a lower latency then it would make sense to add the support for IAudioClient3::InitializeSharedAudioStream as an additional option.

Also, IAudioClient3::InitializeSharedAudioStream may also introduce some uncertainty and bugs from WASAPI side, so adding it blindly wouldn't be a great idea having PA WASAPI backend in a fairly stable condition.

danryu commented 1 year ago

The best I can do that is within reasonable time-scope for me (I'm not familiar with either Windows audio or PortAudio codebase and I'm severely time-constrained) is a practical round-trip latency experiment. I measure round-trip latency with RTL Utility with IAudioClient2 using FlexASIO/PortAudio and IAudioClient3 using the built-in "Shared Low Latency" mode of RTL Utility.

Setup:

Lenovo Thinkpad L15 Gen2 with generic Realtek Audio chipset
Intel HD Audio driver installed (as per https://learn.microsoft.com/en-us/windows-hardware/drivers/audio/low-latency-audio#measurement-tools)
Headset/mic combo plugged into 3.5mm jack
Both input and output device configured at 24-bit / 48khz
Lowest buffer size of 128 set (with IAudioClient3)
Where KoordASIO(FlexASIO) used, configured at lowest reasonable buffer size of 32

Software:

RTL Utility - round-trip latency tester (uses current JUCE internally)
KoordASIO (FlexASIO clone with configurator for WASAPI Shared and Exclusive modes)

IAudioClient2 KoordASIO/FlexASIO link against current PortAudio, and thus use IAudioClient2 (when configured with WASAPI)

IAudioClient3 RTL Utility uses JUCE which since 2020 has supported IAudioClient3 (referred to as "Shared Low Latency" mode in JUCE/RTL Utility)

Test Results (All tests were repeated several times to ensure a consistent result was being delivered.)

The results are really interesting! Firstly: IAudioClient3 is impressive - both the low Exclusive Mode time, and the sub-20ms Shared Low Latency Mode result. This immediately suggests that IAudioClient3 has had a marked improvement on Shared Mode performance.

Then for the PA/IAudioClient2 results - firstly, the sub-10ms result for Exclusive Mode is absolutely phenomenal (if anywhere near accurate!). Incredibly unfortunately, when actually recording and playing back with this configuration, the output is full of a crackly distortion which makes it unusable (and doesn't disappear when varying buffer size). This is so frustrating as it shows how close we are to having usable sub-10ms round-trip latency with generic Windows hardware.

Then the Shared Mode result is as previously expected. Interestingly it is basically double the IAudioClient3 Shared Mode result.

CONCLUSION Considering the available configurations tested above, and assuming the above-mentioned crackly distortion problem with PortAudio WASAPI Exclusive mode is a "WON'T FIX", being able to do 13ms Exclusive Mode or 17ms Shared Mode round-trip latency with generic Windows hardware would be immensely important and useful to countless real-time audio applications.

dechamps commented 1 year ago

I'm sceptical you can compare RTL Utility's built-in "Windows Audio" mode with KoordASIO/PA. The code paths are very different and in particular they likely have different internal buffer sizes which would act as confounding factors. If I understand your protocol correctly, you didn't even pick the same buffer sizes between the two. This makes it difficult to draw any meaningful conclusions.

What would be much more interesting is to compare RTL Utility's "Windows Audio" with "Windows Audio (Shared Low Latency Mode)". Presumably that's just switching between IAudioClient::Initialize() and IAudioClient3::Initialize() while keeping everything else the same, thus producing an apples-to-apples comparison. You didn't include "Windows Audio" in your results.

dmitrykos commented 1 year ago

@danryu it is great to see test results.

The Exclusive mode testing is not useful in relation to IAudioClient3::InitializeSharedAudioStream as it is only for a Shared mode as per docs. Therefore, difference you see is the difference of implementation of two different apps.

On Windows 10 and higher PA is using IAudioClient3 for Shared and Exclusive modes.

According to the docs IAudioClient3::InitializeSharedAudioStream is simply a wrapper for IAudioClient3::Initialize which calculates hnsBufferDuration for IAudioClient3::Initialize internally: "Unlike IAudioClient3::Initialize, this method does not allow you to specify a buffer size. The buffer size is computed based on the periodicity requested with the PeriodInFrames parameter. It is the client app's responsibility to ensure that audio samples are transferred in and out of the buffer in a timely manner."

The difference of latency in Shared mode you observed, 34 vs 17, is due to double buffering used by PA WASAPI implementation. I will check if double buffering can be safely omitted and if yes, propose to add an additional PA WASAPI option to switch off double buffering, so that user would be able to achieve lowest possible latency in Shard mode at expense of some pops & clicks of course if CPU of the machine gets loaded with other tasks.

danryu commented 1 year ago

@dechamps thanks for weighing in. I fully admit that the test was not very rigorous, and involved apples and oranges - it was simply intended as a quick-and-dirty indicator of the different configurations' potential. I am primarily interested in getting sub-10ms RTL on generic Windows hardware - so any route that can get me there is interesting. Hence why I just set to lowest practical buffers with whatever configuration I had available (128 in Windows Audio, 32 in FlexASIO). I actually couldn't get reliable results from RTL Utility with plain "Windows Audio" at less than 256 buffer size - I'm not sure why. At 256 "Windows Audio" delivered RTL of ~51ms, and "Windows Audio Shared Low Latency" gave ~36ms.

@dmitrykos Thanks for all the notes. I appreciate now why InitializeSharedAudioStream would not serve a purpose here. I'm glad the tests were in some way helpful.

I will check if double buffering can be safely omitted and if yes, propose to add an additional PA WASAPI option to switch off double buffering, so that user would be able to achieve lowest possible latency in Shard mode at expense of some pops & clicks of course if CPU of the machine gets loaded with other tasks.

That would be very welcome - many thanks.

danryu commented 1 year ago

I will check if double buffering can be safely omitted and if yes, propose to add an additional PA WASAPI option to switch off double buffering

@dmitrykos Would it be useful if I opened a separate issue for this, for tracking purposes?

Also, I'm very happy to fork and do some quick hacks/tests. I was wondering if there was perhaps a simple hack to do in _RecalculateBuffersCount() which I could test out?

dmitrykos commented 1 year ago

@danryu I got possibility to debug PA WASAPI implementation. In my tests I am not able to get lower Shared Mode latency than 22 ms for 48000 Hz input stream.

IAudioClient::Initialize() was called with period equal to 10000 that is 480 frames which were also reported by IAudioClient3::GetSharedModeEnginePeriod() in pFundamentalPeriodInFrames, pMinPeriodInFrames, and pMaxPeriodInFrames.

For experiment I also replaced IAudioClient::Initialize() with IAudioClient3::InitializeSharedAudioStream() with PeriodInFrames equaling 480.

I also tried polling or event mode (AUDCLNT_STREAMFLAGS_EVENTCALLBACK).

In all cases initialized audio client instance returns 1056 frames as max endpoint buffer via IAudioClient::GetBufferSize(). So basically on my PC I am not able to reach lower than 22 ms latency of Shared Mode stream for input or output (checked both). It seems internally WASAPI is using double-buffering approach.

If you have interest, could you modify _GetFramesPerHostBuffer function to such and check if you are able to get lower latency, i.e. 10 ms, on your machine:

static PaUint32 _GetFramesPerHostBuffer(PaUint32 userFramesPerBuffer, PaTime suggestedLatency, double sampleRate, PaUint32 TimerJitterMs)
{
    PaUint32 frames = userFramesPerBuffer + max( 0, (PaUint32)(suggestedLatency * sampleRate) );
    frames += (PaUint32)((sampleRate * 0.001) * TimerJitterMs);
    return frames;
}

davidebeatrici commented 1 year ago

https://github.com/PortAudio/portaudio/blob/68e963a990da19bb013133dcbad59c2ed8ea0cf9/src/hostapi/wasapi/pa_win_wasapi.c#L3947-L3951

Please note that AUDCLNT_STREAMFLAGS_AUTOCONVERTPCM and AUDCLNT_STREAMFLAGS_SRC_DEFAULT_QUALITY are not accepted: MicrosoftDocs/sdk-api#1498

I encountered the issue while experimenting with IAudioClient3 in libcrossaudio.

mirh commented 1 month ago

FWIW people here and here reported a minimal latency of 2.67ms in shared mode, for as much as testing conditions weren't exactly clear (conversely some of the best scholars in the world, had to give up testing exactly for some kind of bug in this library)

PortAudio / portaudio

Use new IAudioClient3 interface for low-latency audio in shared mode #385