Explicit Padding of Subtitles and Timed Text tracks to comply with CMAF track/CMAF presentation model

RufaelDev commented 4 years ago

CMAF presentations composed of audio, video and timed text should have tracks defining them, and tracks are composed of CMAF fragments or segments.

In some practical cases, subtitles or timed text are not available (e.g. at the end of a presentations), to comply with the CMAF presentation model, it would be nice if CMAF could include a recommendation for using empty subtitle timed text samples that have a timespan but do not contain text. This is supported in MPEG-4 part 30 but it is not explicit or required in CMAF. My recommendation is to define a default method for padding fragments when no timed text or subt. is available. This way it will be more explicit for media presentations with timed text or subtitles to comply to the CMAF presentation model.

My recommendation would be to recommend fragments with a sample carrying VTTEmptyCueBox or a sample containing valid TTML document.

It would be great if section 11 could make a suggestion of how CMAF tracks with partially no subtitle can be supported by padding and fragments, perhaps with an example.

Again, I think the padding can be done in different ways, but making this an explicit recommendation would be helpful. Too many times we see a subtitle track that is much shorter than the audio video or has a gap.

jeanlf commented 4 years ago

wouldn't the flag duration-is-empty in tfhd fits this ? This would avoid inserting blank samples which would need removal when de-fragmenting the file.

RufaelDev commented 4 years ago

I am affraid this will crash most players, as this duration-is-empty is not what is adopted in MPEG-4 part 30 or in CMAF
I am not sure that such gaps are allowed by CMAF this can be debated regarding the definition of empty sample

so overall, no I disagree because we think this is not supported well in players, while inserting ttml or VTTEmptyCueBox is always supported. For defragmenting, removing WVTT empty cue should already be supported by any defragmenter, while removing ttml may not be supported but should be straightforward to do if you want to do that, again keeping it in a defragmented file causes no harm either.

mikedo commented 4 years ago

The minimalist conformant TTML document is:

<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="en" xmls="http://www.w3.org/ns/ttml"/>

ATSC will probably recommend this for sparse tracks.

RufaelDev commented 4 years ago

@mikedo thanks, I would recommend similar in DVB-DASH (this is also in EBU Tech 3381), and I would hope that eventually such recommendation can be included in CMAF eventually, inlcuding how/when to do the paddings

jeanlf commented 4 years ago

I am affraid this will crash most players, as this duration-is-empty is not what is adopted in MPEG-4 part 30 or in CMAF

I cannot find in CMAF nor in part 30 anything against usage of duration-is-empty flag

I am not sure that such gaps are allowed by CMAF this can be debated regarding the definition of empty sample

Same thing, I cannot find anything in CMAF regarding this

The general problem I have with this approach is that rather than using a tool that is well documented and has no impact on the source content (hence is transparent for packagers), we now insert empty samples which are all format specific, hence make the packager codec specific. We're doing it here for WebVTT and TTML, but in a few years we'll end up with thousands of sparse metadata formats (haptics, annotations, etc ...) that will follow this same approach, each new format requiring a patch of the packager (and likely defragmenter). And that worries me.

RufaelDev commented 4 years ago

we have two issues:

a) with duration-is-empty you don't have media, i.e. gaps which is not allowed in CMAF and many players (cannot handle this) b) at the end when using this to fill a gap would lead to large segments, which is not supported in DASH numbering that allow only 50% deviation and will also hike your max-segment duration and therefore latency.

Both the approach for TTML empty and WebVTT are explicit in MPEG-4 part 30 either VTTEmptyCue or TTML without body, I think the only gap is to describe in CMAF some recommendations for padding to fullfill CMAF track/switchingset/ presentation requirements.

jeanlf commented 4 years ago

with duration-is-empty you don't have media, i.e. gaps which is not allowed in CMAF and many players (cannot handle this)

I cannot find anything in CMAF stating this is not allowed, maybe I'm missing something. For media players, what kind of behaviors do you observe ? Parsing issues, decoding issues?

at the end when using this to fill a gap would lead to large segments, which is not supported in DASH numbering that allow only 50% deviation and will also hike your max-segment duration and therefore latency

I disagree, the tfdt can still be present in the empty fragment to indicate a "current decode time" although there is nothing to decode. You can insert as many of these empty segments with updated tfdt as required for your segment duration constraints, just like you insert fake empty samples currently.

cconcolato commented 4 years ago

There are several aspects here:

A design question: If you think in terms of production of subtitle content, it seems awkward for the subtitle generator (upstream) to have to generate empty subtitles because of packaging constraints (downstream). Especially for live. Your subtitle generator would have to have a heartbeat production. I'd like to have others opinion here, e.g. @nigelmegitt. For VoD, you could imagine that it's less of a problem because you can author your documents such that there is always a document covering any point in time, e.g. extending the duration of the previous sample. But that makes codec-agnostic file manipulations impossible.
a specification question. Looking for "gap" in CMAF, you find Annex F (informative) which explains what happens in case of production error. I don't think that covers what we are discussing here. The only normative statement is in 7.3.2.3:

CMAF chunks in a CMAF track shall not overlap or have gaps in decode time

One can interpret that as not missing tfdt, in the sense that the tfdt of a chunk should equal the tfdt of the previous plus its duration. "duration-is-empty" still means that there is a duration. Also, as Jean indicates, there is no explicit restriction regarding the use of duration-is-empty.

I would suggest we create example content with duration-is-empty and have people test their implementation and report. We can then decide to restrict it (e.g. to a new structural brand such as cmf3) or to explicitly allow it or even to recommend it.

RufaelDev commented 4 years ago

If DVB and ATSC and MPEG-4 part-30 and EBU 3381 recommend using VTTEmptyCue and/or TTML without body as empty sample, it would be safer to recommend that in CMAF aswell instead of introducing another approach. This would only increase incompatibility which should not be the goal of CMAF.

duration-is-empty is not referred in part-30 and not in CMAF, having no samples with decode time (regardless of tfdt) implies discontinuitiy or gap. CMAF writes down what is allowed (the rest is not allowed) and this is not part of it.

For content creation it may be the CMAF packagers job to do the padding, or by the live encoder producing DASH/CMAF, not the subtitle generator, so i disagree with your statement on design @cconcolato .

cconcolato commented 4 years ago

MPEG-4 Part 30 says:

For sections of the track timeline that have no associated subtitles or timed text content, ‘empty’ samples may be used, as defined for each format, or the duration of the preceding sample extended. Samples with a size of zero are not used.

Note the use of "may" not "should".

This would only increase incompatibility which should not be the goal of CMAF.

Of course, CMAF is about improving interoperability. If indeed other SDOs are frozen on a solution, we should let them use it, but does not mean we cannot evolve CMAF into a more efficient solution.

CMAF writes down what is allowed (the rest is not allowed) and this is not part of it.

That's not correct. CMAF puts restrictions on ISOBMFF. When it does not put restriction on something, it does not mention it. For example, in the same ISOBMFF section 8.8.7.1 where duration-is-empty is defined, default-sample-duration-present is defined. It is not mentioned in CMAF. Are you saying it is not allowed?

For content creation it may be the CMAF packagers job to do the padding, or by the live encoder producing DASH/CMAF, not the subtitle generator, so i disagree with your statement on design @cconcolato .

Then the packager is not codec-agnostic, right? It has to be TTML-aware or VTT-aware or at least have a mapping between codec and an 'empty' sample definition. I was just hinting that this design is not scalable.

RufaelDev commented 4 years ago

my point is part-30 does not mention duration-is-empty and CMAF neither, the may is used because in a non-fragmented format which is also supported in MPEG-4 part 30 you do not need this. So yes using VTTEmpty Cue or empty ttml is really optional, but it is currently the only method defined and used to implement the CMAF track model for subtitles.

Sure CMAF could define or evolve to something better, but typically technological advance should be in the technology standards first e.g. MPEG-4 part 30 and only after that be considered in CMAF.

Restricting ISOBMFF in my opinion implies writing what is allowed, it is a matter of wording, so i still believe i am correct, as CMAF restricts both the flags and boxes that can be used and duration-is-empty

Yes packagers are always codec agnostic, that is a fact, just as there are ISOBMFF bindings for AVC/HEVC/VVC/AV1/MPEG-H audio you name it. Each have their own binding to the file format. So I dont really understand your point about this not being scalable.

nigelmegitt commented 4 years ago

Thanks for bringing me in here @cconcolato .

Subtitle encoder requirements

In terms of the design question I consider a subtitle encoder to be responsible for generating data that effectively encodes a continuous stream of subtitle presentation, in the same way that an audio encoder generates encoded data that, when decoded, produces a continuous stream of audio samples. Clearly the encoded data are time-division-multiplexed, according to the packaging requirements, as set by whoever is configuring the encoding and packaging chain.

So from that perspective, it is reasonable to expect the subtitle encoder to generate encoded subtitle samples which, when decoded, mean "for the duration of this subtitle sample, present nothing". If I saw a subtitle encoder simply stop producing output for a while, I would think it is broken, not that there are no subtitles to present.

Implementation experience

When we implemented the EBU-TT Live Interoperability Toolkit (LIT) we designed the Resequencer component, in its "output a new subtitle document every n seconds" mode, so that it would output documents containing no content for periods when it had received no subtitles. Feeding those documents to the EBU-TT-D encoder then generates empty documents. I mention this because it was my assumption that the subtitle encoder would generate empty documents, at that time.

Packaging

From a packaging perspective, if the input temporarily disappears, it may not be straightforward to update the manifest to remove the subtitle components and then add them back in again, and the impact on players may not be desirable either. So it probably would make sense for packager implementers and/or operators to make a call on whether they want to supply default "empty" subtitle documents or let the client device get a 404 when fetching the non-existent subtitles. And that in turn might depend on the player's behaviour on getting those 404s.

Empty subtitle segment TTML format

In terms of the precise format of an empty TTML / IMSC / EBU-TT-D document, this is something where the different profiles of TTML differ slightly in what is permitted, and the encoders I am aware of also differ. EBU-TT-D is the only profile that requires that the body element contains at least one div element and the div element contains at least one p element, which has the consequence that the only conformant EBU-TT-D document with no subtitles is one without a body element present at all. Then, additionally, as Cyril points out, EBU Tech3381 also specifies a specific empty document that shall be accepted, which also omits the head element. I believe that would be IMSC conformant too (but I haven't checked recently). It is:

<tt xml:lang="" xmlns="http://www.w3.org/ns/ttml"/>

Note that unlike @mikedo 's suggestion in https://github.com/MPEGGroup/CMAF/issues/9#issuecomment-649592099 this excludes the XML header, which is not formally required, because it is optional in XML 1.0, which is the basis used for encoding all current versions of TTML, EBU-TT-D and IMSC. (thank you to @tairt for pointing this out to me some time ago!)

Implementation experience

One encoder supplier whose EBU-TT-D output I have had the opportunity to review in detail currently creates empty subtitle documents that are not actually conformant EBU-TT-D: they contain an empty div element instead of omitting the body. This is IMSC Text Profile conformant though.

When faced with the fact that this is not conformant, naturally, the supplier wanted to know the real world impact on players, and naturally I was unable to provide an all-encompassing answer; it may well be that many players would simply continue without any user impact at all.

Always omit the `body`?

Therefore I raised with EBU the possibility that a future version of EBU-TT-D relaxes this constraint, but I would be very interested to know if, from a CMAF perspective, it would be preferable instead to push in the other direction, by proposing that all empty CMAF TTML subtitle segments omit the body element.

cconcolato commented 4 years ago

@RufaelDev and I had an offline discussion. Our summary is: The discussion reveals 2 separate aspects relevant for CMAF:

In terms of structural constraints, is the flag "duration-is-empty" supported in CMAF in general? It is defined in ISOBMFF and not restricted in CMAF it seems, but what is its level of support in the wild, in players.
In terms of Timed Text, should CMAF recommend a practice to handle periods when no subtitle content is produced, in particular at the end of the stream (padding)? Or should it leave it to applications?

The suggestion would be to put these questions into a Defect Report/Tuc and welcome contributions. Maybe liaise with other SDOs to get feedback.

mikedo commented 4 years ago

Maybe a survey of current practices would best be done by an industry forum rather than MPEG? MPEG could proactively document how to best do a sparse timed text track for encoder and player vendors to strive to sooner than later?

RufaelDev commented 4 years ago

@mikedo i agree, my suggestion was to include industry fora CTA, DASH-IF, and SDO DVB, ATSC and maybe EBU I think indeed mpeg should be pro-active to at least gain undertanding how CMAF users would solve this today, and if possible document a best practice.

one other point, it is not only sparse subtitle tracks, it could also be for audio/video tracks padding that we could ask feedback on padding to achieve approximately the same length. We see people running into this problem of tracks that are not of the same length when trying to use CMAF, so I do think the issue is important for CMAF, as it is hampering adoption or making it more painfull than necessary.

mikedo commented 4 years ago

Yes, seems like the solution should be general to any kind of track. Although unusual, it could also be used for black video and muted audio padding, even if the coded data is nominally present.

nigelmegitt commented 4 years ago

Apologies if I've missed this somewhere and it already exists, but it might be helpful to be able to publish/signal a 'null' segment in the same way as an init segment is signalled now.

init + seg + seg + seg + seg + [null] + seg + [null] + [null] + ...

etc.

Then whatever encoded version of a null segment is appropriate for the media type could be created once and referenced whenever it is needed. For TTML it would be that empty document, for other types it would be some other kind of resource.

Just thinking out loud. Forgive me if this is already covered.

dwsinger commented 4 years ago

Can't one say in the MPD "nothing happens for this duration, please move along"? Downloading a resource which then explicitly says "fooled you! there's nothing here!" seems silly.

nigelmegitt commented 4 years ago

Can't one say in the MPD "nothing happens for this duration, please move along"?

Perhaps you can, if the meaning of "nothing happens" is completely clear for the media type concerned. Unfortunately it is not. A scheme that defines "nothing" explicitly so it can be referenced later would help tidy that up.

In the case of subtitles, say, one presentation style I have seen shows a dark rectangular area where the text would be all the time when the subtitles are enabled, even if there is no text. That area is presumably defined in the subtitle documents. If no text is present for an entire segment, how would you signal to continue showing the dark area?

(disclaimer: BBC doesn't typically use this style)

dwsinger commented 4 years ago

Perhaps you can, if the meaning of "nothing happens" is completely clear for the media type concerned. Unfortunately it is not. A scheme that defines "nothing" explicitly so it can be referenced later would help tidy that up.

agreed, it has to be defined or obvious for each media type. Sound, well, it's silence. Video, nothing paints, (not even "we regret the loss of picture" (as the BBC used to say when the studio failed). For captions it seems fairly obvious?

nigelmegitt commented 4 years ago

Video, nothing paints

Already I can think of at least 3 schemes that would mean "nothing paints" and I don't know which one is right!

Last known frame continues to be displayed statically
All pixels on the screen is filled with some "nothing" colour - I don't know what that should be. Mid grey? Black?
Display analogue noise-like signal

For captions it seems fairly obvious?

Does it? I think otherwise, as per https://github.com/MPEGGroup/CMAF/issues/9#issuecomment-654072901

dwsinger commented 4 years ago

I agree video is the hardest case. In the case of captions, I think it's "as if the captions were not there or not enabled", so no, you don't get the black rectangle.

For video, it no longer obscures what's below. If there is nothing below (it's not an overlay but the bottom document in the rendering stack), we're staring into the void, it's an application-specific fill (like "we regret the loss of picture").

RufaelDev commented 4 years ago

just a few points to consider for the live/low latency streaming cases:

updating the manifest is preferably avoided, hence segments are the same duration and a numbering based adressing scheme is used for the segments, making it easy to request a next segment by incrementing the number, or a time based scheme (using the time in the segment uri), in this case a player can find the next uri by the actual segment duration and extend the timeline and find the next segment uri without a manifest update
to fit with this streaming model at the client, the same model is often applied for subtitles, that is fixed duration segments etc, hence that is were the use of segments with VTTemptyCue or ttml without body comes in which is a segment that says for this duration no subtitles are displayed. In such a case duration-is-empty would not allow finding the next time as the duration is missing, hence the timeline extension without mpd update would not work for example.
the case when no next segment is available (404 or not in manifest) that may result in stalling as a typical player would need all segments at the live edge to continue playing at the live edge, this is why this "silly" presence of such segments makes some sense for live streaming, as live edge needs to be continuously updated and all representations need to be available
gaps/missing segments in DASH/CMAF can exist but the typical behaviour is to skip all A/V representations, while this is not the intended behaviour for subtitles, in DASH one can signal discontinuity using segment timeline or by introducing a new period. But this is different from the "empty" or "do nothing" segment that exists for subtitles, that do not introduce a skip in the playback
In addition, with each segment becoming available it should be possible to detect the live edge of the presentation, and these segments should be available for all representations.

These are some things to take into account for the case of live (low latency streaming). But also in VoD were you use segment index, you will need to fill gaps/missing/empty segments content to make your index segment work on byte ranges and time ranges, so such segments are also applicable there.

Note that CMAF tracks padding can be done already and we are ok in practice, but there is a risk that people pad differently, so the question was if some explicit recommendation is needed. Also my thinking was that it would help adoption of the spec if this was a bit more clear as tracks of unequal length give problems. As for timed text/subtitle the problem occurs most frequently, that was the main case. If not done in CMAF itself this issue might be better discussed and processed in an industry forum. My intention with this issue was not to be introducing new client/player behaviour, but only to recommend a best practice with CMAF as is.

nigelmegitt commented 4 years ago

gaps/missing segments in DASH/CMAF can exist but the typical behaviour is to skip all A/V representations, while this is not the intended behaviour for subtitles

@RufaelDev are you sure it's not the intended behaviour? Just wondering if this is documented anywhere: it seems weird to have predefined levels of importance for different types of representation, rather than making it content or application specific.

dwsinger commented 4 years ago

In such a case duration-is-empty would not allow finding the next time as the duration is missing, hence the timeline extension without mpd update would not work for example.

I don't understand.

duration-is-empty: this indicates that the duration provided in either default-sample-duration, or by the default-sample-duration in the TrackExtendsBox, is empty, i.e. that there are no samples for this time interval.

So you get a duration. I am not sure honestly that this flag helps much; the two useful cases are that the MPD tells you not to bother to fetch (saves a fetch); or that the file fetched tells you exactly what to do (e.g. paint a caption region with no text, as Nigels suggests). Once you've fetched something, you may as well be clear.

I understand that if you're using algorithmic segment-URL generation, you always need a segment, and so the MPD telling you that there is nothing there is not possible, as you're not fetching new MPDs.

RufaelDev commented 4 years ago

gaps/missing segments in DASH/CMAF can exist but the typical behaviour is to skip all A/V representations, while this is not the intended behaviour for subtitles

@RufaelDev are you sure it's not the intended behaviour? Just wondering if this is documented anywhere: it seems weird to have predefined levels of importance for different types of representation, rather than making it content or application specific.

An example is end of sect. 6.6.8 of CMAF, by skipping I meant A/V/T (not only A/V note this is also in the CMAF spec text), sorry for the misunderstanding . A player may skip all A/V/T for a part that has a discontinuitity (e.g. in DASH a new period may be used).

my point is that for sparse subtitles all such behavior intended for gaps/discontinuities seems rather undesirable.

RufaelDev commented 4 years ago

In such a case duration-is-empty would not allow finding the next time as the duration is missing, hence the timeline extension without mpd update would not work for example.

I don't understand.

duration-is-empty: this indicates that the duration provided in either default-sample-duration, or by the default-sample-duration in the TrackExtendsBox, is empty, i.e. that there are no samples for this time interval.

So you get a duration. I am not sure honestly that this flag helps much; the two useful cases are that the MPD tells you not to bother to fetch (saves a fetch); or that the file fetched tells you exactly what to do (e.g. paint a caption region with no text, as Nigels suggests). Once you've fetched something, you may as well be clear. I understand that if you're using algorithmic segment-URL generation, you always need a segment, and so the MPD telling you that there is nothing there is not possible, as you're not fetching new MPDs.

Yes indeed and for this functionality one needs the fragment duration, not the (default) sample duration or zero (given there are no samples one would not know how to calculate the fragment duration).

In low latency the mpd is not always updated (e.g. numbering or time extension in DASH) so it cannot always tell what (not) to fetch, and yes in the ideal world the segment would tell me exactly what to do :-) , that is why i think why a segment with VTTEmptyCue in samples and ttml without body in samples may be more helpful than a segment with a duration-is-empty flag as I know what to do with that information in the first case, that is render no subtitles for the duration of the fragment, while for the second i am still not sure.

Last regarding comment #9, there is not a well established way to say in the MPD "nothing happens for this duration" for a representation or adaptationset.

RufaelDev commented 4 years ago

m55342 http://wg11.sc29.org/doc_end_user/documents/132_OnLine/wg11/m55342-v3-m55342_v2.zip studies and highlights some of the text around gaps/continuity and handling that.

gap would result in skipping all representations i
cmaf tracks are continuous (both segments and samples)
sparsity and missing segment handling could be made more explicit (perhaps this can be documented in the DuI/TuC)

cconcolato commented 3 years ago

This issue is related to MPEG internal issue http://mpegx.int-evry.fr/software/MPEG/Systems/ApplicationFormat/CMAF/-/issues/30

cconcolato commented 3 years ago

The group discussed this issue as part of the discussion on contribution m55778 and decided to close this issue.

MPEGGroup / CMAF