make audio data format conversion policy crystal clear

RossBencina commented 1 year ago

The task is to specify and document exactly when clients can expect PortAudio to convert between different audio sample data formats.

The question here is when should format conversion occur now how it should be performed.

Background

Since before the version 19 API, PortAudio has provided facilities to automatically convert sample data to/from an appropriate format that is supported by the native audio API. For example, if the user supplies 32-bit integers, and the native device only accepts 16-bit integers, PortAudio performs an appropriate conversion. The principle here is that PortAudio clients can always pass data in any of the PortAudio formats to/from PortAudio.

Format conversions entail scaling and/or bit shifting, and may incorporate dither and/or clipping. The PortAudio API provides options for switching on/off dithering and clipping when performing these conversions. The choice of scaling and dithering algorithms are discussed in other tickets.

Floating-point to integer conversions have special status because most (all?) audio hardware uses (linear) integer sample formats. As discussed elsewhere, PortAudio chose the float-integer scaling factors to ensure that an amplitude +/- 1.0 sine is not clipped under float -> int conversion. This is (arguably) an important API contract. Historically, native audio APIs accepted only integer formats and passed them through to the driver/hardware, consequently PortAudio always took responsibility for float-integer conversions, and clients could rely on PortAudio providing specific scaling behavior. More recently, some (not all) native APIs accept floats, and use floats internally for dsp and mixing operations. In such cases it may be desirable to pass-through floats unaltered rather than converting to an integer format.

Maintaining integrity of audio data is a very important to any audio API. The question in 2023 is to what degree PortAudio should attempt to impose its own conversions (i.e. to provide predictable, consistent conversion behavior across all-platforms) and to what degree it should get out of the way and let the OS do the work (i.e. to provide native behavior, which for example might result in different float-integer conversions when using different host APIs, but on the other hand, might reduce the number of float-integer conversions). This is particularly important with float-integer conversions because the previously advertised policy of deterministic float-integer conversions on all platforms no longer looks like the correct choice for some common use-cases where the OS is likely to convert integers back to floats again.

796
543
390
112
100
35

RossBencina commented 1 year ago

Some useful definitions:

native-API-required conversions are conversions that must be performed because the user supplies (or accepts) audio data in a format that is not accepted by the native audio API. Different native APIs impose different required conversions.
hardware-required conversions are conversions that must be performed because the hardware does not support the user format.
well-specified conversion/passthrough is the idea that the client gets guaranteed, well-specified, predictable conversion or passthrough on the audio data path between the PortAudio API and the audio ADC/DAC. There are two considerations here: (1) does PortAudio offer well-specified conversion/passthrough in some or all cases, (2) does the native audio stack (including native API, native audio engine(s), drivers) offer well-specified conversion/passthrough. Native APIs range from full well-specified (e.g. ASIO, ALSA HW) through to no-guarantees (e.g. Android, where the HAL can do whatever audio manipulations it likes).
delegated conversion is when the conversion is performed by the native audio API, OS or driver (even in cases where PortAudio could perform the conversion.) Since native APIs may or may not guarantee well-specified conversion, delegation may have the consequence of changing the "well-specifiedness" of a particular conversion. Another possibility is that native APIs may provide well-specified conversion but this may disagree with a well-specified conversion provided by a different native API, or by a required conversion performed by PortAudio.
guaranteed consistent cross-platform well-specified conversion is something that PortAudio may provide under certain circumstances thus providing portable applications with strong guarantees about the integrity of the audio data path. For example, what exactly is the 0dBFS value for a particular sample format? if there is a choice, which conversion algorithm is used? will this format reach the DACs with bit-perfect integrity?
lossless conversion is a conversion that does not alter the represented audio data values
canonical lossless conversion is a lossless conversion for which there is only one reasonable mapping. Examples: Byte-swapping for endianness conversion, and bit-shifting for left-right justification of 24-bit data in a 32-bit container are two examples of canonical lossless conversion. Upconverting 8-bit to 16-bit by bit shifting is lossless but not canonical. (x -> x << 8 and x -> x + (x << 8) are two reasonable mappings).
bit-perfect passthrough is the idea that by using a correctly scaled and/or shifted PCM data format (including a floating-point data format that losslessly represents integer samples) that the PortAudio client can achieve an exact 1:1 mapping to ADC/DAC values or digital (SPDIF, MADI, Dante, etc.) sample values.
best-effort bit-resolution principle is the idea that within the limits of a particular native audio API, PortAudio will attempt to feed audio data through without degrading the bit resolution of the data. Failure to follow the best-effort bit-resolution principle is a bug. Such bugs exist because native APIs that used to only accept 16-bit data now accept wider data formats in some OS versions. Under this principle float32 and 24-bit integer data are equivalent (although they may not be equally lossless or well-specified).

Observation: In some sense, float-integer-float conversions are always hardware-required conversions. However some native APIs (e.g. CoreAudio, Android) use floats as the default, internal, or only data representation. Should PortAudio perform the conversion to provide guaranteed cross-platform well-specified conversion?

dechamps commented 1 year ago

Thanks for the write-up. Here are my thoughts. (Note that to make this less verbose I am going to assume the playback direction, i.e. output to a DAC - obviously, everything below applies to recording from an ADC as well.)

The question in 2023 is to what degree PortAudio should attempt to impose its own conversions (i.e. to provide predictable, consistent conversion behavior across all-platforms) and to what degree it should get out of the way and let the OS do the work (i.e. to provide native behavior, which for example might result in different float-integer conversions when using different host APIs, but on the other hand, might reduce the number of float-integer conversions)

I've said this before, and I'll say it again: PortAudio should not surprise its users.

When PortAudio has the option of taking the audio data from the app and handing it off as-is to the OS, without any conversion, then that's what it should do, because that's what any reasonable person would expect to happen. It's simple, efficient, and it works. No reasonable person would expect PortAudio to just throw in extra unnecessary conversions out of the blue - that's just bizarre.

The only reason why we seem to be arguing about this is because you seem obsessed with some very niche use case along the lines of "clients could rely on PortAudio providing specific scaling behavior". I'd argue that very few users care about this. If I wanted to have this kind of complete control over sample bit patterns, then I would implement the conversion myself in my app - I would not use an audio I/O library such as PortAudio to do it, because that kind of extremely narrow use case doesn't fit the scope of a generalist audio I/O library.

Native APIs range from full well-specified (e.g. ASIO, ALSA HW) through to no-guarantees (e.g. Android, where the HAL can do whatever audio manipulations it likes).

For Windows you can add "WASAPI Exclusive, WDM-KS" to your "well-specified" list, and "MME, DS, WASAPI Shared" to your "no-guarantees" list.

More specifically, on Windows Vista+, when using the shared audio pipeline (i.e. WASAPI Shared, which MME and DS redirect to internally), the Windows audio engine will automatically do sample rate conversion, sample type conversion, downmixing/upmixing, mixing, software volume control (if required), audio limiting (CAudioLimiter), and then there are APOs where audio device manufacturers (and, to a lesser extent, third parties) can decide to do literally whatever they want to the audio signal as it passes through the pipeline.

Given the above, the whole idea of trying to guarantee "well-specified conversion" when using WASAPI Shared, MME or DS is absurd on its face. In the vast majority of cases, downstream processing in the Windows audio engine will destroy your "well-specified conversion" many times over before it reaches the DAC. The battle is over before it started. It doesn't make sense to try to introduce concepts such as "delegated conversion" in this setup - we're way past that already.

And, to be clear, this is perfectly fine. Reasonable users will not expect perfectly accurate operation when using the shared Windows audio pipeline, because they know that it's not designed for it. The Windows audio engine is designed for convenience (automatic conversions, system-wide effects, etc.), not accuracy. If a user is chasing perfect accuracy (i.e. bit-perfectness and the like), and they configure PortAudio to use WASAPI Shared, MME or DS, then they're doing it wrong, and the PA docs should reflect as such. Consistent with that philosophy, it makes sense for PortAudio to do the simple, efficient, obvious thing and just pass the application's audio data through as-is to the OS. There is no point in doing anything else.

Users who are after accuracy should use OS audio facilities that are designed for that use case. For Windows, that means WASAPI Exclusive, WDM-KS, or ASIO. These 3 APIs come with a reasonable expectation that the client will be given exclusive, direct access to the DAC's audio buffers with no automatic conversions.

Currently, when using one of the aforementioned "bit-perfect" host APIs in PortAudio, PortAudio will determine which formats are supported by the hardware, and automatically convert as necessary. I'd argue this is doing the user a disservice, because if they are explicitly configuring PortAudio to use one of these specialist host APIs, then chances are they care deeply about their samples making it to the DAC untouched, and they do not want PortAudio to mess with them in any way, even if the conversion is lossless - they'd prefer PortAudio to return an error instead. Thankfully the WASAPI Host API provides a flag to that effect, but it really feels like that should just be a general PA frontend option, not an Host API specific one.

So, in conclusion, here's how I would like PortAudio to behave in an ideal world:

Introduce a new PortAudio stream flag, let's call it paNeverConvertSamples, alongside paClipOff and paDitherOff. (This flag is really just a generalisation of paWinWasapiExplicitSampleFormat for all of PortAudio.)
The flag modifies the behaviour of PortAudio depending on whether the Host API in use can pass through the samples directly to the OS:
- If the OS accepts the user's format, then the flag has no effect. (Note this means that the flag has no effect when using WASAPI Shared, MME, DS on modern Windows, because these accept all the formats PortAudio currently supports.)
- If the OS rejects the user's format, and the flag is not set, then PortAudio does the conversion as it does today.
- If the OS rejects the user's format, and the flag is set, then PortAudio returns an error and it's up to the user to try a different format or give up.

This would then map to the following use cases:

Typical "I just want to play some audio" use case, where the end user values convenience over perfect accuracy or bit-perfectness: use the "normal" Host API for the OS (on Windows, that means either WASAPI Shared, MME, or DS), and don't pass paNeverConvertSamples. The OS will do the conversion (or, in rare cases, PA might). This is the simplest, most efficient and least surprising approach, and is consistent with what a typical audio application would do.
I'm after accuracy over all else: explicitly pick a Host API with direct hardware access (on Windows, that means either WASAPI Exclusive, WDM-KS, or ASIO). Optionally, pass in paNeverConvertSamples if you don't even want PortAudio to convert samples for you, and handle errors accordingly.

PortAudio / portaudio

make audio data format conversion policy crystal clear #825

Background

796

543

390

112

100

35