Dash-Industry-Forum / Live

Collects issues about the Live document
5 stars 0 forks source link

Synchronization at Period start #3

Closed haudiobe closed 10 years ago

haudiobe commented 10 years ago

Kilroy,

inline! We need a clear write down on this as well as a diagram that discusses the different cases. I try to answer via e-mail, but I do think we should move it into an editable document asap to make sure we have an agreement on this.

Period Start Time (PST) ==> AST + Period@start Earliest Presentation Time of a Segment (EPT). There exist two options: decode time + composition offset + edit list documented in the segment index Presentation Time in Period EPT - @presentationTimeOffset Period presentation is not fully defined in the On-Demand case. In dynamic case the EPT of a segment is suggested to presented at time: PST + (EPT - @presentationTimeOffset) + SPD SPD is the suggested presentation delay

Based on this see my comments below

I agree with what you said and require implementations to use negative composition offsets to make the first sample composition time in a Segment equal the first sample decode time expressed in the track’s timescale in baseMediaDecodeTime field in ‘tfdt’.

[TS] OK, good ;)!

Only one offset edit is allowed per track, and none in the video track. The movie presentation time is the baseMediaDecodeTime plus offset edit (if present).

[TS] What do you mean “offset edit”? An edit list correction? And we should presume that video is the master. It can also be the audio. The offset can be anywhere.

Using offset edits to adjust for video composition time offsets doesn’t work for adaptive video and bitstream switching because different tracks can have different DPB sizes and delays (i.e. a Representation sampled at 50%H and 50% V could contain 4X as many reference pictures and corresponding removal delay).

[TS] Agree and not agree. The first really relevant statements are: Synchronization within an Adaptation Set and across adaptation sets as well as segment alignment is all on presentation times Presentation Time means that this is either: decode time + composition offset + edit list earliest presentation time in segment index

Based on this there are multiple options to resolve this: use different initialisation segments with edit lists use different tfdt values for the same sample and do the correction with negative composition offsets. For example, the IDR frame has tfdt 4 in one representation and you have negative offsets -4 -3 -2 -1 for preceding pictures and you have a tfdt of 2 for another representation with negative offsets -2 and -1. W/O an edit list, edit list zero is assume so we are ok. add the sidx to express the earliest presentation time. Note that this can be supplementary to the 2nd bullet point.

Unless each Segment had its own offset edit list (not possible) it won’t work for bitstream switching,

[TS] Bitstream switching not well-enough defined to say it works or does not. We need to really write down what this means. I do think your switching is that you can feed movie fragments from different Representations into the MSE buffer, correct?

and can be problematic applying a different edit list for each Segment in the decoder based on where the player application got it.

[TS] I agree that the decoder may be confused if tfdt’s are not sequential, but this may be still the right way to go.

Inserting an Initialization Segment in stream before each Representation change is required if SPS parameters aren’t allowed to change; but if SPS can change over time, for instance during live encoding, spicing, etc., then the worst case is an Initialization Segment has to be downloaded from the live encoder and inserted before each Segment … but it can only change the presentation offset of the whole track with current ISO media definition, so it can’t handle changes in composition delay and SPS parameters made between movie fragments/Segments in the track.

[TS] see above. In my opinion this can be solved and this is already solved.

Negative composition offsets in video movie fragments and ‘avc3’ sample description with SPS/PPS in the elementary stream handle all those cases with bitstream switching of dynamic subsampling and DPB size as well as adaptive resolution switching of Representations with static encoding parameters.

[TS] I am still not sure it does it at the beginning, but we should write down the details.

Thomas, What do you think happens when the first audio sample and video CVS are not aligned (like your example), but no edit list exists because this is live encoding and there is no “file” that starts with aligned samples?

[TS] As said in the very beginning, this is straightforward. The presentationTimeOffset provides the mapping to the Period start and this provides the sync.

I think each Adaptation Set has its own @presentationTimeOffset in the timebase equal to the ‘tkhd’ Timescale, e.g. 48,000 for audio and 90,000 for 29.97Hz video. The presentation time offsets could indicate fractional Segments.

[TS] I agree, or even more specific it just addresses the sync. for example it could be that you have a small gap at the beginning for one media component. You may even have a presentationTimeOffset that is larger than the earliest presentation time. In this case the play-out of the first samples may be omitted (or you do some other adjustments). We need to document all of this, I agree.

In your example, the extra few milliseconds would happen before the Period@start so presentation of the audio sync frame would start somewhere inside its e.g. 20ms duration. That isn’t hard to do seamlessly.

[TS] Yes!

I think the video CVS should align with the start of the Period so a player doesn’t have to do faster than realtime decoding in a single video decoder (e.g. an ad Period ending in the middle of a program Segment); but that isn’t a syntax limitation of the presentationTimeOffset to only increment in samples or Segments.

[TS] It may be reasonable to recommend.

The @presentationTimeOffset should be fed into MSE for playback sync and player timeline tracking, so those offsets should apply to decoding. An edit list might cause an error rather than fix one. They are two ways of expressing the same relative A/V offset, so you don’t want to apply it twice.

[TS] You wanna define an API? Let’s work on this based on the discussion above.

Thomas