Generate a second audio mezzanine

haudiobe commented 3 years ago

Create an additional audio mezzanine that permits emulating different languages. The mezzanine should be clearly identifiable as a different language in order to support switching observation. Ideas welcome.

haudiobe commented 3 years ago

Options:

Beep pitch is different
not a beep, but a jingle
other ideas

jpiesing commented 3 years ago

It would be preferable if this could be detected automatically. The pitch of the beep might (or might not) be detectable in an mp4 file downloaded from a camera or captured from a webcam. Another option would be the ATSC 3 audio watermark although the Verance example code for this isn't under an open source license.

@haudiobe Is this relevant for anything except NGA with multiple preselections muliplexed in the same stream? If not then is it relevant at all as MSE doesn't support exposing preselections as AudioTrack objects.

haudiobe commented 3 years ago

Jon, we need multiple languages or variants at least, when using the test vectors for switching audio in DASH tests. This is not a CTA WAVE problem in the first place, but if we want to rely on CTA WAVE test vectors for client testing as well, then it would be good to add this here.

jpiesing commented 3 years ago

One thing you can do is take an existing audio track with suitable rights and mix something else into it. For example record someone saying "German" and someone saying "English" and make one version with "German" mixed in every 15s mixed into it and another version with "English" mixed in every 15s.

The HbbTV DASH-DRM reference app does this. To see this, go to http://refapp.hbbtv.org/production/catalogue/, then navigate to "1.7 Multiple audio (v4)" using the cursor keys - down, right, right and press enter to play the stream.

Big bug bunny (etc) use a version of the Creative Commons license that permits derivative works.

nicholas-fr commented 3 years ago

We could generate the audio for that using the Python pyttsx3 library. For example, this gives decent voice output that can be mixed into existing (or new) audio tracks: import pyttsx3 engine = pyttsx3.init() engine.setProperty('rate', 100) engine.say("English") engine.runAndWait()

It can be easily saved to file for mixing into mezzanine content: engine.save_to_file("English", 'EnglishAudio.mp3') engine.runAndWait()

More info here: https://pyttsx3.readthedocs.io/en/latest/engine.html#module-pyttsx3.voice

bobcampbell-resillion commented 3 years ago

Adding a link to a related preceding discussion on audio Audio experts: what are suitable audio features needed in mezzanine content to enable required observations (manual or automated) for audio streams?

Some other thoughts:

To automate an assertion that the correct audio is playing, a beep (or beep pattern) at a given pitch and duration is going to be easier to detect than spoken word. We would suggest that a set of beeps at different pitches, as suggested above, are agreed and several mezzanine audio tracks produced that can be encoded by the media proponents for use in different test cases as necessary.
Presumably any alternative audio track also ought to be in sync with the video, in which case the timing of the beeps (edge) ought to sync with flashes in the video. The BBC did this for a different use case in a project here http://hbbtv-testing.virt.ch.bbc.co.uk/streams/simple-companion-seqs/
Having both beep and spoken audio, so a human can tell the difference more easily is also not a bad idea - when it comes to debugging and sanity checks making it easier for humans to know the "right" audio is playing is worthwhile. But...
Spoken "english channel", "spanish channel" might not be the best choice for generic mezzanine representing "another language" (think of the tester hearing "spanish audio" when selecting "not spanish" language because the device is intended for a different region) "alternate audio A/B/C" might be more generic, or else you want a handful of regions covered.
Would make automation easier not to have the beep and spoken audio overlap: probably not essential, just more processing needed to extract one from other if they do, which increases the risk of occasional false failures...

jpiesing commented 3 years ago

* Spoken "english channel", "spanish channel" might not be the best choice for generic mezzanine representing "another language" (think of the tester hearing "spanish audio" when selecting "not spanish" language because the device is intended for a different region) "alternate audio A/B/C" might be more generic, or else you want a handful of regions covered.

While I agree with most of the comment, I don't completely agree with the above. If the manifest can signal a language (like DASH can) then the audio tracks would need to contain something that can be linked back to the language signalled in the manifest. Using DASH as an example (apologies to HLS), if the manifest contains an English and a Spanish Adaptation Set then having the English audio identify itself as "alternate audio A" and the Spanish Adaptation Set identify itself as "alternate audio B" would IMHO be really confusing. I would expect that a device only intended for (say) Germany or Dubai or Taiwan would still be able to do something sensible when presented with a manifest with only English and Spanish Adaptation Sets.

bobcampbell-resillion commented 3 years ago

I would expect that a device only intended for (say) Germany or Dubai or Taiwan would still be able to do something sensible when presented with a manifest with only English and Spanish Adaptation Sets.

I'm not sure, and was trying to think of a generic approach to mezzanine so we didn't have to think ahead to what might happen. But...

having the English audio identify itself as "alternate audio A" and the Spanish Adaptation Set identify itself as "alternate audio B" would IMHO be really confusing.

...I won't argue with that. I did think having say Taiwan audio identify itself as "Spanish audio" would be equally confusing but concede confusing testers in most common cases for the sake of a few (possibly corner) cases doesn't seem a good trade off.

haudiobe commented 3 years ago

2020/2/12 DPCTF: Let's go with the Proposal from Jon and Nicholas - thanks

bobcampbell-resillion commented 3 years ago

Let's go with the Proposal from Jon and Nicholas

I don't know what that means. Jon originally proposed something like watermarks, then agreed with several of my points, and not one - which I am not arguing with. Nicholas described how it might be created, not what it contained, I think.

It would help to close this thread with a clearer statement of what the mezzanine requirements for this use case that have been agreed actually are...

Then, who is going to implement them?

jpiesing commented 3 years ago

January 5th meeting Still issue to be resolved with levels. Discussion about how many mezannine streams need 2 audio tracks. Proposal to just do it for the highest resolution of each frame rate family, 1920x1080p. One is better than none. Next steps Will include this when mezzanine content is regenerated.

nicholas-fr commented 3 years ago

Pull request #22 includes a first implementation of this. The add_second_audio_track.py script generates a copy of a mezzanine stream, with an additional (2nd) audio track that contains the word "English" spoken every 15 seconds, mixed with the original audio.

Feedback and proposals for further improvement are welcome.

nicholas-fr commented 2 years ago

As we can generate mezzanine with a second audio track for emulating language selection and no further discussion has occurred for over 8 months, I propose to close this issue and raise more specific issues to address any issues with the current proposal implemented in add_second_audio_track.py.

Further improvements will likely be possible following the work on audio mezzanine as discussed in #39, #41 and #44.

cta-wave / mezzanine

Generate a second audio mezzanine #20