Audio experts: what are suitable audio features needed in mezzanine content to enable required observations (manual or automated) for audio streams?

bobcampbell-resillion commented 4 years ago

In most of the spec sections a set of generic observation requirements are stated:

Every sample S[k,s] shall be rendered and the samples shall be rendered in increasing presentation time order.

The playback duration of the playback matches the duration of the CMAF Track, i.e. TR[k, S] = TR [k, 1] + td[k].

(or something similar adjusted for e.g. random access cases, see also #62 ) and also:

The start-up delay should be sufficiently low, i.e., TR [k, 1] – Ti < TSMax.

The presented sample matches the one reported by the currentTime value within the tolerance of the sample duration

Note also under 9.2.5.1, 9.3.5.1 and 9.4.5.1 there is an implied AV sync requirement:

While continuing playback, the media samples of different tracks with the same presentation times are presented jointly

(definition of "jointly" under discussion in issue #64)

For video, some annotations are possible on each frame that would be both human readable, and facilitate later automation.

For audio, some observations seem challenging to achieve: do any audio experts have suggestions or even examples of suitable streams that would be a good basis for a mezzanine audio track?

Note that there may or may not be background audio from the source mezzanine video, it is assumed that is unlikely to be useful in determining the above requirements...

The spec may also benefit from articulating more explicitly human verifiable means of confirming the required observations in the case of audio media...

jpiesing commented 4 years ago

I'm not an audio expert but here are some of the possibilities that occurred to me.

The audio is a tone of smoothly increasing frequency from a lower bound to an upper bound and then decreasing back down the the lower bound and then increasing again. For manual tests, an operator can just listen for any irregularities. For automated tests, hopefully there are some utilities that can extract the frequency given a (say) 0.2s sample of such an audio file.
Same as 1 but instead the frequency increases in steps of (say) 0.2s. This may be easier for automated tests but may be slightly harder for manual tests.
A square wave with a period of (say) 0.2s.

The use of 0.2s is purely arbitrary in these examples. It could be longer or shorter depending on how long a sample would be needed for analysis tools to determine the frequency.

In the automated (or semi-automated) case, I have no idea of what open source tool might be able to analyse an audio file and provide information on the frequency and how that varies over time within the audio file. It may be easier to find a tool that works with step changes in the frequency than with smoothly increasing or decreasing frequencies.

rdoherty0 commented 4 years ago

For this sort of audio testing, which is proper presentation of the streams, the key element to test is usually audio sync. This is standard for testing, video test patterns should always include an audio sync element. Where human testing is involved, dialogue is also a useful addition, as the human ear is very adept at detecting out-if -sync dialogue, though it doesn’t replace good sync mark testing.

Next in line would be proper reproduction of the channel configuration. Test patterns we often ship include a visual representation of a channel-isolated sound to sync with the audio (i.e. “left channel, left channel”).

I don’t see much value in testing the audio quality itself (for noise, jitter, whatever) as you would be testing things that are out-of-scope such as the actual decoders or compression algorithms. However, it would always be welcome to have audio tone (typically 1kHz) as part of a test pattern – this is always handy for audio chain calibration, and could even be used in automated testing to make sure there are no changes in playback speed or wild audio degradations.

Richard

From: Jon Piesing notifications@github.com Sent: Monday, April 6, 2020 12:50 PM To: cta-wave/device-playback-task-force device-playback-task-force@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [cta-wave/device-playback-task-force] Audio experts: what are suitable audio features needed in mezzanine content to enable required observations (manual or automated) for audio streams? (#74)

I'm not an audio expert but here are some of the possibilities that occurred to me.

The audio is a tone of smoothly increasing frequency from a lower bound to an upper bound and then decreasing back down the the lower bound and then increasing again. For manual tests, an operator can just listen for any irregularities. For automated tests, hopefully there are some utilities that can extract the frequency given a (say) 0.2s sample of such an audio file.
Same as 1 but instead the frequency increases in steps of (say) 0.2s. This may be easier for automated tests but may be slightly harder for manual tests.
A square wave with a period of (say) 0.2s.

The use of 0.2s is purely arbitrary in these examples. It could be longer or shorter depending on how long a sample would be needed for analysis tools to determine the frequency.

In the automated (or semi-automated) case, I have no idea of what open source tool might be able to analyse an audio file and provide information on the frequency and how that varies over time within the audio file. It may be easier to find a tool that works with step changes in the frequency than with smoothly increasing or decreasing frequencies.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cta-2Dwave_device-2Dplayback-2Dtask-2Dforce_issues_74-23issuecomment-2D610001702&d=DwMCaQ&c=lI8Zb6TzM3d1tX4iEu7bpg&r=eIF5Xry7L4bwoE7wCr2GE-F8ZIaGmGaqlg-enmujDio&m=_YFCfpFx5EVVeTXZK65-0Ri40UYMV6PLsUCh0bo6Zk4&s=VkilN_R81D0nyVevaNeWrPCNaiTpLbvpdnK2QMFyBmE&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AE64FY5URJDNQBYINM5WFKDRLIXAHANCNFSM4MCJGWHQ&d=DwMCaQ&c=lI8Zb6TzM3d1tX4iEu7bpg&r=eIF5Xry7L4bwoE7wCr2GE-F8ZIaGmGaqlg-enmujDio&m=_YFCfpFx5EVVeTXZK65-0Ri40UYMV6PLsUCh0bo6Zk4&s=X6UMOJT5lqHJZ8fQIJw602hejnkhlwSYQ_-9unBtBMs&e=.

bobcampbell-resillion commented 4 years ago

Thanks both. Assuming the definition of "jointly" discussed in #64 can be resolved, I agree one should add some synchronised flashes and beeps overlaid on the underlying content (if no sync mark exists already), which work ok for manual and automated observations. But, in context of a parallel conversation about alternative “open source” mezzanine content, sounds like Tears of Steel would be better than Big Buck Bunny, due to the live action segments where lip sync might be more obvious.

jpiesing commented 4 years ago

I think we've lost the original issue here. Features to include in audio to test audio/video sync are one thing. They are somewhat understood and the current version of the mezzanine content script includes flashes and beeps based on the work from the BBC.

The current "single track media playback" requirements include a variation on this as a "required observation" for all media formats;

Every sample S[k,s] shall be rendered and the samples shall be rendered in increasing presentation time order.

For video, every sample means every frame and the mezzaine content script adds a distinct QR code, a timecode and a frame number to every video frame.

Is there is something similar for audio?

If not then perhaps this requirement should be moved from under "required observations" / "general" to being under "video".

bobcampbell-resillion commented 4 years ago

FWIW I don't think its practical or useful to verify this requirement in audio:

Every sample S[k,s] shall be rendered and the samples shall be rendered in increasing presentation time order.

...but if someone thinks up a means to do so then great. If its moved so it doesn't apply to audio, I suggest replacing with something softer that relates to "the audio plays". Otherwise a WAVE device could fail to play certain audio "properly" as not sure under what other requirement that would be verified.

haudiobe commented 4 years ago

One sentence needs to be moved to only apply for video and not for both. Otherwise still input wanted for audio.

nicholas-fr commented 3 years ago

Following today's DPCTF call, here is a summary of what the mezzanine content currently includes:

3kHz beeps synced with flashes (based on work from the BBC)
ToS audio with some limited dialogue (we can select a different portion of ToS to include more dialogue with visible faces if necessary)
1kHz beep synced with first (green) and last (red) frame (the beeps may be too difficult to detect, this hasn't been tested yet)

haudiobe commented 2 years ago

DPCTF 2022/01/26 this has been covered by quite some efforts on creating the mezzanine content. Please check here for detailed discussion: https://github.com/cta-wave/mezzanine.

A code is proposed: https://github.com/cta-source/audio-watermark-study

Can we use this as mezzanine content and document it in the specification in the Annex. The Test TF will address whether this approach is agreeable. Once completed, we will address documentation in specification.

@cta-source please correct or add additional information @jpiesing please discuss this in Test TF

cta-source commented 2 years ago

I suggest we close this issue.

gitwjr commented 1 year ago

Closed per recommendation by @cta-source following discussion on DPCTF call held March 8, 2023.

cta-wave / device-playback-task-force

Audio experts: what are suitable audio features needed in mezzanine content to enable required observations (manual or automated) for audio streams? #74