Considerations for putting music over sync PN noise

This is a thread about the following question: Can an operator run this test without having to listen to (annoying) pure white noise? This raises the follow-up question: What is the required audio SNR for sync detection using the current (6/28/2021) audio mezzanine white noise sequence?

I was asked to post this as an Issue to keep it from getting lost. The following doesn't make a conclusion or strong recommendation; this is some info en route to a later solution. Discussion encouraged; we're also expecting more on a proposed coded spectral band algorithm as well (which would perform better). (And apologies for the lengthy post, I'm trying to capture a lot.)

tl;dr: Simulation of the current audiomezz.py output white noise sync mechanism suggests it might be reasonable to try to combine human audio (music, beeps) and the white noise timing sync. Like all simulations, this is more of a direction than a conclusive result, and the real test will be in the lab.

Music played during the sync sequence PN noise would--depending on relative power--interfere with reliable detection (timing extraction would be "less robust"). We also expect some reduced robustness--unquantified as yet--if a speaker-out-microphone-in (non-wired connection) is used from the playout device (or device under test, DUT) to the Observation Framework (OF).

The OF is using a Band-Limited Pseudo-random Noise signal ("BLPN signal"). The audiomezz.py code builds it as follows. 1) Generate a uniform-selection PN bit sequence at the sample rate, non-repeating for the duration of the test; 2) The PN sequence is then band-limited to 7kHz (to limit noise from high frequency replication that occurs with some audio codecs); this is the BLPN signal. 3) The BLPN signal is written to an audio file (write to .WAV format) for playout during a test.

For timing detection, the proposed algorithm will attempt to match “received PN audio” with “offset copy of the original PN”. By correlating waveforms while adjusting (sliding) the offset, we can find a correlation peak that identifies the timing.

As mentioned, this is the current concept for timing sync, although one member proposed coded spectral bands in the DPC Test meeting of 6/22.

I did a simple simulation of a cross-correlation detection method. The generated audio is the BLPN signal, plus random noise values. The random noise is scaled to approximate the desired audio SNR. The resulting samples at any audio sample time k are BLPN[k] + {factor * random[0..1)}, where 'factor' is chosen to set the desired SNR.

Ideally, the noise can be much louder than the PN because then we can have music (the noise) with BLPN quietly in the background. For the simulation I used parameters etc. from the current WAVE content generation code. For cross-correlation I use a string of 100 consecutive samples of BLPN (longer is better, this is just what I used).

Results:

For -12 dB SNR (music 12 dB greater power than the BLPN signal), the correlation peak is no longer in the correct spot (false timing sync).
At -6 dB the peak is usually correct. False sync occurs 10-20% of the time.
At 0 dB (music and BLPN at equal power), false sync was not observed in 100 test cycles.

From this simulation I would guess we might get acceptable performance for 0 dB SNR, meaning music and PN are the same average power. That is not as bad as I had thought. Note that this is not a definitive simulation, there are a few differences between my correlation and the one planned, but it is should be similar to actual results.

Net--it seems reasonable to try to combine human audio and the BLPN sequence. Audio blips and background music may not destroy the programmatic sync from the BLPN. At least, from this rough simulation, it appears worth trying.

A final point is that audio sync detection is currently out of scope for the observation framework, so this is all for the future.

Tl;dr: We should pursue a 2-phase approach for audio testing. Phase A would be based on wired audio with no added music, and would fit the DPC Specification's current audio test requirements, with a reasonably high degree of confidence. Phase B would seek to move to speaker output with added music, but leans more towards “research” and has more uncertainty as to feasibility. We should start work on Phase A proof-of-concept test jigs next, as resources are available. Some initial work on such a test jig is going on now to further inform this effort.
...

Based on additional study and helpful side discussions with Eurofins, Dolby and Thomas S., here are a few preliminary points that may be helpful in setting expectations. To hit the key point first, using speakers with added music and extracting a PN sync code appears technically feasible if we work with 60-second audio samples. Unfortunately, the ~20 mS resolution implied by the DPC Specification makes it more difficult to implement than the wired approach, and these enhancements should remain in "research" while we build a first working wired test system.

Approach: Backing up from implementation or research, we should consider the DPC specification. From requirements we can get implementation and capabilities, then consider enhancements.

Whether we have speakers and music, or just PN, is a matter of “added noise”. This goes to signal-to-noise ratio (SNR). We can derive available SNR from the basics of the system (On-Off-Keying modulation, signal bandwidth, observation period, etc.) And we need a minimum implementation at least, while we work through these more “stretch goal” concepts.

So I suggest a two-phase approach: • Phase A (wired connection, no music) • Phase B (attempt to move to speaker output with added music)
In Phase A, there will be minimal-but-nonzero system noise and implementation loss. In Phase B, the “speaker out” enhancement will add room noise, room echoes and cross-channel interference; “added music” will add consequential interference power.

Deriving Requirements: Here is a summary of the DPC specification audio observations requirements (using dpctf-s00001-v032-WAVE-DPC-v1.02, and ignoring human observations like "no clicks or pops").

DPC Specification Audio Observation Requirements (sections 8 & 9):

Every audio sample renders
Audio samples render in increasing presentation time order
For random access (time or fragment), no earlier audio sample shall be rendered
Audio playback duration matches expected value, at audio-sample-level resolution.
Audio startup delay to be sufficiently low, at audio-sample-level resolution.
The presented audio sample matches the one reported by currentTime.

Note: In this context, an “audio sample” is understood to be an audio “ISOBMFF media sample”, the length of which may be on the order of 20 mS. (Note the difference between this “audio sample” interpretation, and the “audio sample” definition that means the data values collected at the audio ADC clock rate of 48k Hz.)

These six basic requirements all imply use of audio samples. Each audio sample can be used to extract its actual presentation time. The shorter the audio sample, the less robust to noise (i.e.. music, echoes, whatever).

It should be noted that this is currently the DPC Spec requirement summary, but there are also "tbd"s in that spec. We may also want ancillary test capabilities, such as vetting the test setup. So this requirement set is a starting point and a minimum at this time.

Implementation & Capabilities: If we assume a 20 mS audio sample time (“Observation Time”), calculations show a required SNR of 11.4 dB. A music/PN file with the PN “noise” at 11.4 dB under the music would be fairly unlistenable. That isn't meeting the goal of "music for someone to listen to instead of PN noise."

Therefore, the mezz/encoded content that supports these six requirements must start off as: A. Wired audio and a short (e.g., 20 mS) minimum Observation Time (which equals the minimum allowed media sample time).

For some additional test capabilities beyond those basic six requirements, we can also consider: B. Longer Observation Time, but must constrain media samples (audio samples) to be correspondingly longer, such as 1 second or 60 seconds

For enhancements of speaker-out transmission and a music overlay, we would need: C. Filtering to remove the added music and speaker-out impairments (stretch goal). Item C may seem a bit ambitious, but there is a theoretical path to doing so and existing Python libraries to at least try it out. It’s a longer path with less assurance of success, and some downstream inconveniences with actually using the approach, but there is a path.

Content Splice Testing: In addition, as mentioned today in the DPCTF Test Runner meeting; for content splicing, we will need to distinguish the timestamp of different splice contents. So if we have Main content playing and an Ad spliced into it, the Main will use PN1 and the Ad must use a PN2 that is different from PN1.

This is an update, not a final. Next step is to build a test jig for wired sync and API development for the Observation Framework.

Here, for the record (as Jon said, "two years later someone trying to figure out what we did") is the link budget for the reception of the PN code.

Index | Item | Value | Units | How | Comment -- | -- | -- | -- | -- | -- A. | Modulation | BPSK | | | Actually On-Off Keying or Amplitude Shift Keying; BPSK is equivalent for Eb/N0 determination. B. | BER | 1.00E-03 | | | Required "bit error rate", in this case it will be a false sync rate. 1E-3 --> 7 dB. C. | Eb/N0 | 7 | dB | | Theoretical ratio of energy-per-bit (Eb) to noise power spectral density (N0) ratio; determined by modulation scheme (from Eb/N0 vs. BER graph, BPSK/QPSK curve). D. | Data rate | 48000 | bps | | The PN bits are WAV encoded as one bit per audio sample; audio sample rate is 48000 kHz. E. | BW | 7000 | Hz | | We apply a 7kHz filter as pre-conditioning for compression schemes like HE-AAC, which uses SBR for higher-band content on decode. F. | SNR | 15.4 | dB | = C + 10* log10(D/E) | Required signal-to-noise ratio at the required failure rate. "Signal" is the PN sequence. G. | TOP | 0.02 | sec | | Observation period. Longer is better. H. | Chipping rate | 960 | bps | = D * G | #bits used per observation; in spread spectrum would be the # of spreading bits for each data bit I. | Process gain | 29.8 | dB | = 10*log10(H) | Gain resulting from using multiple bits for the decision, expressed as dB (i.e., 10*log10(chipping rate)) J. | Implementation losses | 3 | dB | | Budget for real hardware, plus audio compression artifacts post-decode. Need to measure with real audio compression as well, this number will go up. Can add additional "comfort factor" margin here as well. K. | CNR actual | -11.4 | dB | = (F - I) + J | Usable ratio of PN signal to (music+noise+etc.). Negative implies louder music than PN data.

A basic modulation-EbNo-BER curve (from the Wikipedia article on Eb/N0. See the dotted red line for the operating point of 10E-3 (one failure in 1000 tests):

We now have code (in a "research" repo) that matches the performance predicted here, except that Implementation Loss is not 3 dB, it appears closer to 1.5 dB. Actual required CNR (a.k.a. audio SNR) appears to be about -13 dB. Having a negative SNR requirement means that the PN sequence audio track can be "buried" under a 13 dB stronger audio track before failure occurs. Some margin will be required, so not exactly 13 dB.

Also this required SNR is relative to the peak audio power in the music track, not average power. Actual music has instantaneous peaks higher than the running average power (non-zero peak-to-average power ratio when expressed in dB). This detail is controllable to a certain extent when selecting and conditioning audio files prior to encoding (techniques include compressing loudness and suppressing peaks) but at the cost of perceived sound quality.

This wraps up the research stage; next step is integration of the result into the OF, assuming wired connections. For "futures", still hoping to get speakers+mic working as well, but after wired connections.

Closing this issue as the research phase is done.

cta-wave / mezzanine

Considerations for putting music over sync PN noise #39