Proposal for minimal timestamp API to allow for synchronising media with CPAL streams

mitchmindtree commented 4 years ago

This is a proposal to begin addressing #279 with the most minimal API necessary.

Background

The most seemingly accurate and thorough research I could come across on this topic is Ross Bencina's excellent paper PortAudio and Media Synchronisation - It's All in the Timing. It contains an overview of the media synchronisation problem with example scenarios, visual diagrams, etc that make it more intuitive.

http://www.portaudio.com/docs/portaudio_sync_acmc2003.pdf

The first few sections of the paper describe some hypothetical scenarios and different techniques for synchronising audio with some other kind of media. A MIDI clock is the primary example used in the paper, but the same techniques apply to presenting frames of graphics and other forms of media sync.

Section 6 describes the minimal set of information necessary in order to make these synchronisation techniques possible:

Sample rate. We already have this.
Buffer start times. We do not yet provide this. This refers to the most accurate form of monotonic clock time available on the system. It is also essential that users have access to the same source of time that provides this value in order to timestamp their media events. This means 1. describing the exact source for each host in the docs and possibly 2. providing a function for easily retrieving this value (portaudio do so via a GetStreamTime(Stream* s) function).

PortAudio decided to provide this monotonic time in seconds using a double-precision floating-point data type:

The double data type was chosen after considerable deliberation because it provides sufficient resolution to represent time with high-precision, may be manipulated using numerical operators, and is a standard part of the C and C++ languages.

Section 7 also describes implementation issues. They can be roughly summed up as follows:

7.1 Sample rates: Subtle variations between the nominal sample rate and the observed sample rate occur between sound card / chipsets, resulting in subtle inaccuracies occurring within the aforementioned synchronisation techniques. PortAudio provides an actual sample rate via its stream info parameters. Calculating this requires a high-resolution system clock, though this isn't always available.
7.2 One Shared Time-base: The time source of timestamps provided via audio callbacks sometimes differ from the source used to provide timestamps for other media events (e.g. Windows' MIDI API). This is why PortAudio found it necessary to provide a function for easy access to the correct source (GetStreamTime).
7.3 Buffer playback times: Exact buffer playback times are often unprovided or inaccurate. PortAudio takes on the initiative of trying to calculate this for the user in the case that it isn't provided by the platform. ASIO buffer timestamps have a best-case resolution of 1ms, significantly worse than necessary for sample-level synchronisation.

Proposal

I propose that we add the following:

A StreamInstant struct representing a monotonic time instance retrieved from either 1. the stream's underlying audio data callback or 2. the same time source used to generate time stamps for a stream's underlying audio data callback. No guarantees are made about the duration that the value represents, only that it is monotonic and begins either before or equal to the moment the stream was started. Internally we could represent the instant in a similar manner to std::time::Duration, providing methods for easy access to more accessible representations e.g. .as_secs_f64(), etc.
The following timestamp structs:
- InputStreamTimestamp
- OutputStreamTimestamp Both structs contain two fields of type StreamInstant:
  1. callback indicating the instant at which the data callback was called.
  2. buffer_adc and buffer_dac representing the instance of capture and playback from the audio device for the input and output streams respectively. An instance of these structs would be provided to the respective user's data callback.
A fn now(&self) -> StreamInstant method for the Stream handle type, allowing users to produce an instant in time via the same source used to generate timestamps for the data callback, useful for media sync. It will be important to document exactly what system API is used for each host and to list any notable limitations (e.g. the 1ms best-case resolution on ASIO).

I've been doing some research into the way that timing information is provided by each of the different hosts supported by CPAL. I'll add a follow-up comment soon with the relevant info for some more context for those interested and for myself to refer back to during implementation.

The transport API discussed within #279 has been intentionally omitted in the hope that it can be implemented on top of the proposed timestamp API. In the case that it cannot, this is likely best left to be addressed in a future PR either way.

mitchmindtree commented 4 years ago

CPAL Timing API Research

ASIO

Timing information is updated via the bufferSwitch() callback which is called by the ASIO driver implementation:

SystemTime describes the system time associated with the first sample of the callback. On Windows (the only OS on which ASIO is supported by CPAL), ASIO apparently retrieves this via the multimedia timer, timeGetTime(), which only provides a resolution of 1 ms.
SamplePosition seems to describe the sample position of the first sample of the callback since the stream began.

From the ASIO SDK 2.3 docs:

In order to provide proper media synchronization information to the host application a driver should fetch, at the occurrence of the bufferSwitch() or bufferSwitchTimeInfo() callback invocation event (interrupt or timed event), the current system time and sample position of the first sample of the audio buffer, which will be past to the callback. The host application retrieves this information during the bufferSwitch() callback with ASIOGetSamplePosition() or in the case of the bufferSwitchTimeInfo() callback this information is part or the parameters to the callback.

The following example is provided for a stream with a buffer size of 1024 samples, sample rate of 44100 Hz and a SystemTime start of 2000 ms:

Callback No:	0	1	2	3	4	5	6
BufferIndex:	0	1	0	1	0	1	0
SystemTime(ms):	2000	2000	2023	2046	2069	2092	2116
SamplePosition:	0	1024	2048	3072	4096	5120	6144

The docs also note that initially, the callback will be called multiple times at the same system time in order to prepare the buffers. This can also be seen in the table above. This is another interesting motivation for ensuring that applications consider the callback's provided system time for synchronisation rather than simply trying to count samples.

CoreAudio

Provides a AudioTimeStamp as an argument to the stream data callback.

https://developer.apple.com/documentation/coreaudiotypes/audiotimestamp

First, the AudioTimeStampFlags must be checked to determine which of the contained timestamp representations are actually valid. The following members seem most relevant to us:

mHostTime: "The host machine's time base (see CoreAudio/HostTime.h)." I could find no further docs, but found this SO answer: https://stackoverflow.com/questions/675626/coreaudio-audiotimestamp-mhosttime-clock-frequency From my understanding, this is retrieved via mach_absolute_time() which represents a number of ticks since startup. To convert from this ticks value to nanoseconds, the mach_timebase_info must be used. There's an old example here: https://shiftedbits.org/2008/10/01/mach_absolute_time-on-the-iphone/
mSampleTime: f64: The absolute sample frame time.
mRateScalar: "The ratio of actual host ticks per sample frame to the nominal host ticks per sample frame."
mWordClockTime: u64: The docs don't give any explanation, but according to some comment on HN this is a sample counter that "ticks" up each sample.

ALSA

https://www.kernel.org/doc/html/latest/sound/designs/timestamping.html

The ALSA API can provide two different system timestamps:

Trigger_tstamp is the system time snapshot taken when the .trigger callback is invoked.
tstamp is the current system timestamp updated during the last event or application query. The difference (tstamp - trigger_tstamp) defines the elapsed time.

Also provides the following:

avail how much data can be written in the ring buffer
delay the time it will take to hear a new sample after all queued samples have been played out. This could be useful for acquiring the "playback" instant.

These are provided along with a snapshot of system time. Options for snapshot are:

CLOCK_REALTIME - NTP corrections, may jump backwards.
CLOCK_MONOTONIC - NTP corrections but won't jump backwards.
CLOCK_MONOTONIC_RAW - No NTP corrections.

We definitely want one of the MONOTONIC options for CPAL, but it's unclear to me whether or not we want NTP corrections. It would be nice to clarify what kind of corrections are applied, e.g. are the corrections a subtle skewing of the rate? Can it jump forwards in time by large steps? Until we can answer these questions, I'm intuitively inclined to use the raw timestamp for potentially more consistent clock behaviour.

An audio_tstamp is also provided containing the timing of the different stages. Useful diagram:

--------------------------------------------------------------> time
  ^               ^              ^                ^           ^
  |               |              |                |           |
 analog         link            dma              app       FullBuffer
 time           time           time              time        time
  |               |              |                |           |
  |< codec delay >|<--hw delay-->|<queued samples>|<---avail->|
  |<----------------- delay---------------------->|           |
                                 |<----ring buffer length---->|

RustAudio / cpal