AOMediaCodec / av1-isobmff

Official specification of the AOM group for the carriage of AV1 in ISOBMFF
https://AOMediaCodec.github.io/av1-isobmff
64 stars 16 forks source link

Please explain how to compute initial_presentation_delay from the bitstream #101

Open ocrete opened 6 years ago

ocrete commented 6 years ago

The current definition of the initial_presentation_delay is "interesting" if you want to build a decoder, but it's completely non-trivial to figure out what value to put there when doing a muxer. When trying to produce it in GStreamer, I'm completely lost. I noticed that neither ffmpeg nor libaom has any code that implements it.

Can we please, either:

tomfinegan commented 6 years ago

The current definition of the initial_presentation_delay is "interesting" if you want to build a decoder, but it's completely non-trivial to figure out what value to put there when doing a muxer. When trying to produce it in GStreamer, I'm completely lost. I noticed that neither ffmpeg nor libaom has any code that implements it.

I agree that the initial_presentation_delay_minus_one section is fairly complex, but I think the example included makes the necessary steps pretty clear. Can you elaborate on where you're getting lost?

Update libaom to give that value explictly

That could potentially be done, but this is not the appropriate issue tracker for such a feature request. Please file an issue in the AOM issue tracker: https://bugs.chromium.org/p/aomedia/issues/entry?template=Feature+Request.

Adding support to libaom for returning the value in terms of samples (aka Temporal Units) would not help in a remux. I don't think adding it to libaom really solves the problem for muxers.

Include a procedure/algorithm to calculate this value from the sequence header

It's not possible to calculate given only a sequence header OBU. You need the initial_presentation_delay_minus_one value to know the number of frames, and then the (de)muxer must calculate the value in samples it by counting the number of frames in each Temporal Unit.

When initial_presentation_delay_minus_one is not present in the Sequence Header OBU, the value in frames is assumed to be 10. A (de)muxer would then follow the same procedure.

robUx4 commented 6 years ago

Technically the encoder should know this value if it fills the initial_display_delay_minus_1 field. It also knows how many frames it would pack in a single TU, which only happen in some cases with some time contraints (an encoder is probably never going to provide 2 S-frames 10 frames apart ?).

If not provided by the encoder it will be very tricky to fill the value properly without parsing the whole stream, so will only be known a posteriori.

Maybe in practice it will always be initial_display_delay_minus_1 + 1 ?

robUx4 commented 6 years ago

BTW, I don't see the mention of a default to 10 in the ISOBMFF document.

ocrete commented 6 years ago

If I understand from the example in the spec, this value is the same as initial_display_delay_minus_1 but counted in TUs instead of in frames ?

Can we add something along the lines of this to the document:

To compute initial_presentation_delay_minus_one from a stream, one must:

  1. Iterate over the entire stream and count the number of frames in each TU and get the maximum (let name this max_frames_in_one_tu)
  2. Read the value of initial_display_delay_minus_1 from the sequence header (I assume there can't be more than one sequence header per track in ISOMBPFF ?)
  3. Apply the following formula: initial_display_delay_minus_1 = (????) - 1;

Also, I wonder if it would make sense to extend the AV1 bitstream to add this into a metadata block or to extend the Sequence Header somehow, as this value seems like some useful even outside of MP4 files, it seems like having this as a MP4 specific information sounds like a workaround for a design gap in the bitstream.

ocrete commented 6 years ago

Maybe in practice it will always be initial_display_delay_minus_1 + 1 ?

If that is true, maybe we can just drop this initial_presentation_delay from the MP4 header entirely?

VFR-maniac commented 6 years ago

When initial_presentation_delay_minus_one is not present in the Sequence Header OBU, the value in frames is assumed to be 10. A (de)muxer would then follow the same procedure.

The value 10 comes from BufferPoolMaxSize (=10)? But why 10 not 9 instead. I think the decoder can't hold frames more than 10 frames so the decoder can't hold more than 10 temporal units so initial_presentation_delay_minus_one + 1 <= 10.

I'm also wondering how initial_presentation_delay_minus_one affects presentation delay on the presentation timeline. The spec of AV1-in-ISOBMFF specifies no ctts box. This means, in generally, Decoding Time == Composition Time and the decoder takes a sample at the Decoding Time then the decoder output a decoded something with the Composition Time (== Decoding Time) unless there is compositionToDTSShift > 0, and there is no delay in units of ISOBMFF sample (here, correspond to AV1 sample and Temporal Unit). But initial_presentation_delay_minus_one says there may be delay in units of ISOBMFF sample. It's really confusing. I'm reading the spec of AV1 but I don't get why there is decoder delay in units of timestamped access unit, it's just like the packed bitstream which was a popular hack used for B-VOPs-in-AVI for VfW and no delay to output decoded frame at the decoder side. If no composition time offsets, it is strange that initial_presentation_delay_minus_one is present.

cconcolato commented 6 years ago

FYI, I just filed a feature request to aomenc, see https://bugs.chromium.org/p/aomedia/issues/detail?id=2150

tomfinegan commented 6 years ago

Technically the encoder should know this value if it fills the initial_display_delay_minus_1 field. It also knows how many frames it would pack in a single TU, which only happen in some cases with some time contraints (an encoder is probably never going to provide 2 S-frames 10 frames apart ?).

It doesn't matter if an encoder knows it or not when re-muxing AV1.

If not provided by the encoder it will be very tricky to fill the value properly without parsing the whole stream, so will only be known a posteriori.

Why do you think you need the entire bitstream? You need enough TUs (samples) to count 10 frames.

Maybe in practice it will always be initial_display_delay_minus_1 + 1 ?

Where is this calculation coming from?

BTW, I don't see the mention of a default to 10 in the ISOBMFF document.

It's from the AV1 spec[1]. It's the value of BufferPoolMaxSize.

edit: forgot this link: [1] https://aomediacodec.github.io/av1-spec/av1-spec.pdf#page=661

tomfinegan commented 6 years ago

The value 10 comes from BufferPoolMaxSize (=10)? But why 10 not 9 instead. I think the decoder can't hold frames more than 10 frames so the decoder can't hold more than 10 temporal units so initial_presentation_delay_minus_one + 1 <= 10.

I was including the +1, but I should have been more clear. Sorry.

cconcolato commented 6 years ago

If not provided by the encoder it will be very tricky to fill the value properly without parsing the whole stream, so will only be known a posteriori.

I agree. It should be possible to determine the value a posteriori. I had started doing that, see https://github.com/cconcolato/av1_decoder_model. You could input an initial_display_delay and check if the bitstream was valid according to the decoder model. It has not been updated in a while though but if anyone is interested, feel free to suggest updates.

Maybe in practice it will always be initial_display_delay_minus_1 + 1 ?

It depends on the number of alt-ref images but maybe.

To compute initial_presentation_delay_minus_one from a stream, one must:

I don't think this algorithm works. Again, you have to run the decoder model and see what minimum initial_display_delay validates the model.

cconcolato commented 6 years ago

I'm also wondering how initial_presentation_delay_minus_one affects presentation delay on the presentation timeline. The spec of AV1-in-ISOBMFF specifies no ctts box. This means, in generally, Decoding Time == Composition Time and the decoder takes a sample at the Decoding Time then the decoder output a decoded something with the Composition Time (== Decoding Time) unless there is compositionToDTSShift > 0, and there is no delay in units of ISOBMFF sample (here, correspond to AV1 sample and Temporal Unit).

As you mentioned, composition offsets are not used so you cannot use compositionToDTSShift > 0.

But initial_presentation_delay_minus_one says there may be delay in units of ISOBMFF sample. It's really confusing.

Sorry for that. There is no composition offset because if you feed TUs to a decoder it will produce as many output frames as input TUs and with the same presentation order as the decoding order. If you assume instantaneous decoding (as usual in ISOBMFF), this means CTS = DTS.

The initial_presentation_delay concept is introduced to cope with problems happening in real implementation. When your decoder operates at the decoding speed limit of a level, if you don't wait to fill some reference buffers before starting to display, you may experience smoothness issues. The delay tells you how long your player should wait. If no information is provided, an AV1 decoder should wait for 10 frames to be decoded, but for some bitstreams you may need less than that.

I'm reading the spec of AV1 but I don't get why there is decoder delay in units of timestamped access unit,

I'm not sure what you mean by "in units of timestamped access unit". There is a delay, mostly because of 'show_frame = 0'. The delay at the elementary stream level is expressed in number of decoded frames. At the ISOBMFF level, it is express in number of decoded samples, because a player may not have access to the internals of a decoder to know how many frames were decoded when a TU is passed.

If no composition time offsets, it is strange that initial_presentation_delay_minus_one is present.

Hope I clarified.

ocrete commented 6 years ago

I don't think this algorithm works. Again, you have to run the decoder model and see what minimum initial_display_delay validates the model.

Can you please give me some pseudo-code or algorithms to compute it? No theoretical decoders that can't fail please, just a real algorithm that a stupid programmer like myself can implement.

The part I don't understand is why counting 10 frames is enough? Does this delay only apply to the first 10 frames? What if there is a bigger grouping later, is that forbidden by the AV1 spec?

cconcolato commented 6 years ago

Can you please give me some pseudo-code or algorithms to compute it? No theoretical decoders that can't fail please, just a real algorithm that a stupid programmer like myself can implement.

Unfortunately, that has not been done yet ...

The part I don't understand is why counting 10 frames is enough? Does this delay only apply to the first 10 frames? What if there is a bigger grouping later, is that forbidden by the AV1 spec?

That's the upper bound according to the AV1 spec. If you decode 10 frames before presenting the first one, you are guaranteed to be able to present the bitstream smoothly (if the decoder operates at the decoding speed (or faster) given by the level definition)

agrange commented 6 years ago

Maybe a little background on the need for initial_display_dela would help?

The problem arises due to the concept of hidden frames, being defined as frames with show_frame = 0 in the frame header.

Think about a (ridiculous!) worst case - unlikely to be useful in practice

An encoder produces a first temporal unit (TU0) containing a first keyframe. It then produces a second temporal unit (TU1) consisting of a large number of frames, say 101, the first 100 of which are hidden frames (show_frame = 0), the last one being showable (show_frame = 1). The decoder first decodes TU0 to produce the keyframe, then it decodes TU1 to produce 101 frames only the last one is showable. Now, if the keyframe is displayed as soon as it is decoded, it is likely that the 2nd displayable frame will not have been decoded in time for display. Because the decoder has to decode 100 additional frames between the two displayable frames. Thus, playback will not be smooth. Of course, if your decoder runs 100 times faster than that required to satisfy the AV1 level-defined sample throughput criteria then the decoder may be able to keep up, but all we can assume in general is that the decoder just meets the minimum performance criteria specified by the signaled level.

In practice hidden frames are used more conservatively, to implement a pyramid coding structure for example, and the resulting GOP structure might need a maximum of 4-5 hidden frames in a single TU. In these cases we can compute a minimum time period that the display of the first frame should be delayed to ensure smooth playback for the entire stream, initial_display_delay, which we express in terms of the number of frames that are required to be decoded before display of the first frame.

One might think that this delay could be at most 9 frames, being the 8 reference buffer slots, plus a buffer to hold the frame currently being decoded. However, we routinely use GOPs that require a 10 frame delay. As seen in the ridiculous example above, the delay can "theoretically" extend beyond 10, but this was deemed to be a sensible compromise.

Hope this helps.

Regards, Adrian

On Tue, Sep 11, 2018 at 2:06 PM, Olivier Crête notifications@github.com wrote:

I don't think this algorithm works. Again, you have to run the decoder model and see what minimum initial_display_delay validates the model.

Can you please give me some pseudo-code or algorithms to compute it? No theoretical decoders that can't fail please, just a real algorithm that a stupid programmer like myself can implement.

The part I don't understand is why counting 10 frames is enough? Does this delay only apply to the first 10 frames? What if there is a bigger grouping later, is that forbidden by the AV1 spec?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AOMediaCodec/av1-isobmff/issues/101#issuecomment-420425544, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQoPsCYyWllXbnemO59CE40Cyw8swrFks5uaCXEgaJpZM4WZjCV .

ocrete commented 6 years ago

That all makes sense. The part that I don't get is why looking at the first 10 frames of a stream is enough? Isn't it possible to have 10 frames with show_frame=1, then have the next 8 with show_frame=0, or have 1000 showed frames before the first non-showed one. So to compute the value of the initial delay (ie, the size of the dpb), we'd need to parse every frame in the stream since the value in the sequence header is not useful for muxing. But Tom says that parsing the first 10 frames is enough?

cconcolato commented 6 years ago

A muxer has to either trust the encoder to give the value or run the analysis on the entire stream. 10 is a upper boundary of the value you will find.

tomfinegan commented 6 years ago

But Tom says that parsing the first 10 frames is enough?

I was referring to Temporal Units or samples, not frames, since samples can contain multiple frames.

What I said was that to calculate the value a (de)muxer could count the frames in samples that begin the stream. Since 10 is:

A (de)muxer can calculate a value in samples by counting frames in Temporal Units. When the value is not present in the Sequence Header OBU a (de)muxer would use the value 10.

Whether it is present in the Sequence Header OBU or not, a muxer would count the frames in Temporal Units until it processes the TU where frames == initial_display_delay_minus_1 + 1, and then set initial_presentation_delay_minus_one = number of TUs - 1.

edit: added '+ 1' to the first bullet point

robUx4 commented 6 years ago

A muxer has to either trust the encoder to give the value or run the analysis on the entire stream. 10 is a upper boundary of the value you will find.

As @agrange noted, it could theoretically extend beyond 10.

I still think this information can be provided by the encoder. A posteriori in all cases, and a priori if it knows exactly how it can manege not showable frames.

It's true that remuxing partially a file may change the value needed to have smooth playback if less are needed for that part (it can never be more). Does it make the remuxed file invalid if it claims a value for initial_presentation_delay_minus_one but it's actually less (and the goal of that value is to find our when it's less than 10) ? Or should we either void the value/presence on remux or reparse all OBUs to get the proper value ?

agrange commented 6 years ago

Olivier:

Determining he value of initial_display_delay may need to be based on the analysis of more than the first 10 frames, see my example. In practice, the encoder is responsible for putting the correct number in the sequence header, and would either know or calculate based on the GOP structure it uses. (Note: The spec uses 4 bits to signal initial_display_delay_minus_1 so we can signal values up to 16).

If a middlebox remuxes the stream - to extract different operating points for example - then the encoder needs to ensure that the value specified is the worst case for all the operating points.

Steve:

initial_display_delay is only a (strong) recommendation to the application that if it starts displaying frames too early then it may encounter problems later. Values that are too bigger than the optimal value just mean that the application may be over-cautious and introduce an unnecessary delay in startup time. Specifying a value that is too small would run the risk of disrupted playback. The stream would still be valid either way.

Adrian

On Thu, Sep 13, 2018 at 2:24 AM, Steve Lhomme notifications@github.com wrote:

A muxer has to either trust the encoder to give the value or run the analysis on the entire stream. 10 is a upper boundary of the value you will find.

As @agrange https://github.com/agrange noted, it could theoretically extend beyond 10.

I still think this information can be provided by the encoder. A posteriori in all cases, and a priori if it knows exactly how it can manege not showable frames.

It's true that remuxing partially a file may change the value needed to have smooth playback if less are needed for that part (it can never be more). Does it make the remuxed file invalid if it claims a value for initial_presentation_delay_minus_one but it's actually less (and the goal of that value is to find our when it's less than 10) ? Or should we either void the value/presence on remux or reparse all OBUs to get the proper value ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AOMediaCodec/av1-isobmff/issues/101#issuecomment-420942417, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQoPpnPxHcceh9rRGRhyiaEKEtmW4ZOks5uaiQ4gaJpZM4WZjCV .

cconcolato commented 6 years ago

As @agrange noted, it could theoretically extend beyond 10.

I don't think that's what @agrange said. AV1 limits the number of buffers to 10, so the 11th frame could not be decoded. @agrange can you clarify?

I still think this information can be provided by the encoder.

I agree.

A posteriori in all cases, and a priori if it knows exactly how it can manege not showable frames.

Yes, but a posteriori it could also be another tool (muxer or else) but this requires running the decoder model.

Does it make the remuxed file invalid if it claims a value for initial_presentation_delay_minus_one but it's actually less (and the goal of that value is to find our when it's less than 10) ?

No. However, you have to be careful that if you splice two streams that have different values, you either have to use the larger one or not give the value.

agrange commented 6 years ago

As @agrange https://github.com/agrange noted, it could theoretically extend beyond 10.

I don't think that's what @agrange https://github.com/agrange said. AV1 limits the number of buffers to 10, so the 11th frame could not be decoded. @agrange https://github.com/agrange can you clarify?

I interpreted Steve's comment to mean that, as per the example I provided, we can conceive of a GOP structure that requires more than 10-frame buffers. Whilst AV1 restricts signaling of the delay to 4-bits, so 16 frames, a decoder that only provides the minimal 10 frame buffers that AV1 mandates would not be able to decode the 11th frame until one of the 10 available frame buffers becomes free, presumably as the result of a display event.

A decoder / application may choose to provide a larger number of frame buffers, 100 say, which would allow it to decode way beyond the 10th frame before displaying any frame. But a compliant bitstream cannot rely on that over-provisioning. And we are still able to signal a maximum delay of 16 frames.

VFR-maniac commented 6 years ago

@cconcolato I still don't get from your explanation.

At the first, clarify the definition of the composition time in AV1-in-ISOBMFF. The absence of the Composition Time to Sample Box does not mean the absence of the definition and/or the concept of the composition time applied to AV1-in-ISOBMFF. I can see there is no concept of the composition time from your explanation.

Personally, I really dislike this indication of the delay outside the common time structure in the scope of the ISOBMFF. Why not just adding the Composition Time to Sample Box consisting of only one entry which indicates the presentation delay time as the sample_offset, instead.

I'm also wondering how to treat when the edit list is applied. The media_time=0 specify the presentation of the AV1 track starts from time=0 on the media timeline but the presentation is delayed after the time at the (initial_presentation_delay_minus_one+1)-th AV1 sample?

VFR-maniac commented 6 years ago

I'm not sure what you mean by "in units of timestamped access unit". There is a delay, mostly because of 'show_frame = 0'. The delay at the elementary stream level is expressed in number of decoded frames. At the ISOBMFF level, it is express in number of decoded samples, because a player may not have access to the internals of a decoder to know how many frames were decoded when a TU is passed.

I don't get this part at all. I can understand there is a delay in frame level. But I can't understand there is a delay in TU level. As far as I understand, a TU is a gathering of frames delimited by timestamp which can be assigned to an output frame. This mean the decoder takes a TU with timestamp T, then the decoder can output a shown frame with T without waiting the next TU. The spec of AV1 says Each temporal unit must have exactly one shown frame.. So, I strongly think the decoder take a TU then the decoder can output a frame smoothly. What am I wrong? Or to output the first frame after the TU0, that frame could depend on TU1 or later TU?

cconcolato commented 6 years ago

a TU is a gathering of frames delimited by timestamp

Almost. I would say delimited by a temporal delimiter in the input bitstream, but they are associated with the same timestamp.

which can be assigned to an output frame

Only one of the frames in the TU wil produce an output frame.

This mean the decoder takes a TU with timestamp T, then the decoder can output a shown frame with T without waiting the next TU.

Yes.

I strongly think the decoder take a TU then the decoder can output a frame smoothly.

If you take out the word "smoothly", yes. Given a TU, a decoder can always output a frame. The problem is that a decoder cannot always decode the TU in the time during which the previous frame has to be presented (assuming fixed frame rate for simplification here). Because the TU may contain multiple frames.

Or to output the first frame after the TU0, that frame could depend on TU1 or later TU?

No. A TU has no dependency on future TU.

agrange commented 6 years ago

All that we're saying here is:

On Thu, Sep 13, 2018 at 4:15 PM, Yusuke Nakamura notifications@github.com wrote:

I'm not sure what you mean by "in units of timestamped access unit". There is a delay, mostly because of 'show_frame = 0'. The delay at the elementary stream level is expressed in number of decoded frames. At the ISOBMFF level, it is express in number of decoded samples, because a player may not have access to the internals of a decoder to know how many frames were decoded when a TU is passed.

I don't get this part at all. I can understand there is a delay in frame level. But I can't understand there is a delay in TU level. As far as I understand, a TU is a gathering of frames delimited by timestamp which can be assigned to an output frame. This mean the decoder takes a TU with timestamp T, then the decoder can output a shown frame with T without waiting the next TU. The spec of AV1 says Each temporal unit must have exactly one shown frame.. So, I strongly think the decoder take a TU then the decoder can output a frame smoothly. What am I wrong? Or to output the first frame after the TU0, that frame could depend on TU1 or later TU?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AOMediaCodec/av1-isobmff/issues/101#issuecomment-421181663, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQoPsj6EJZf0_K_aqa1c1YGtpYlM-BZks5uaucZgaJpZM4WZjCV .

cconcolato commented 6 years ago

At the first, clarify the definition of the composition time in AV1-in-ISOBMFF. The absence of the Composition Time to Sample Box does not mean the absence of the definition and/or the concept of the composition time applied to AV1-in-ISOBMFF. I can see there is no concept of the composition time from your explanation.

I'm not sure what the question is here.

Personally, I really dislike this indication of the delay outside the common time structure in the scope of the ISOBMFF. Why not just adding the Composition Time to Sample Box consisting of only one entry which indicates the presentation delay time as the sample_offset, instead.

We could have used the ctts box (although not with a single entry) but it introduces lots of complexity (requires an edit list for AV-sync or negative CTS offsets ...). I strongly believe the chosen approach is simpler: players can ignore initial_presentation_delay and muxers are only required to put a value if it is correct, otherwise they can omit the value. initial_presentation_delay is only an indication/hint for players if they want to reduce the playback latency.

I'm also wondering how to treat when the edit list is applied. The media_time=0 specify the presentation of the AV1 track starts from time=0 on the media timeline but the presentation is delayed after the time at the (initial_presentation_delay_minus_one+1)-th AV1 sample?

The initial_presentation_delay does not affect composition or decode times. So there is no impact. Edit lists are applied as usual.

robUx4 commented 6 years ago

I interpreted Steve's comment to mean that, as per the example I provided, we can conceive of a GOP structure that requires more than 10-frame buffers. Whilst AV1 restricts signaling of the delay to 4-bits, so 16 frames, a decoder that only provides the minimal 10 frame buffers that AV1 mandates would not be able to decode the 11th frame until one of the 10 available frame buffers becomes free, presumably as the result of a display event. A decoder / application may choose to provide a larger number of frame buffers, 100 say, which would allow it to decode way beyond the 10th frame before displaying any frame. But a compliant bitstream cannot rely on that over-provisioning. And we are still able to signal a maximum delay of 16 frames.

I think I understand the nuance now. A compliant decoder should only cache 10 frames at most. So even if the TU contains 100 frames the decoder will still only have 10 max frames in its cache. So it can never be more than 10 (minus/plus 1 depending on how you count).

agrange commented 6 years ago

A minimally compliant decoder that just about achieves all the minimum requirements of the level that it advertises is guaranteed to be able to decode a valid stream if it caches 10 frames. If it caches fewer frames then it is not.. A decoder may cache more than 10 frames - if it is capable of decoding faster than required and wants to run ahead of schedule, for example - but it doesn't have to.

On Fri, Sep 14, 2018 at 12:04 AM, Steve Lhomme notifications@github.com wrote:

I interpreted Steve's comment to mean that, as per the example I provided, we can conceive of a GOP structure that requires more than 10-frame buffers. Whilst AV1 restricts signaling of the delay to 4-bits, so 16 frames, a decoder that only provides the minimal 10 frame buffers that AV1 mandates would not be able to decode the 11th frame until one of the 10 available frame buffers becomes free, presumably as the result of a display event. A decoder / application may choose to provide a larger number of frame buffers, 100 say, which would allow it to decode way beyond the 10th frame before displaying any frame. But a compliant bitstream cannot rely on that over-provisioning. And we are still able to signal a maximum delay of 16 frames.

I think I understand the nuance now. A compliant decoder should only cache 10 frames at most. So even if the TU contains 100 frames the decoder will still only have 10 max frames in its cache. So it can never be more than 10 (minus/plus 1 depending on how you count).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AOMediaCodec/av1-isobmff/issues/101#issuecomment-421252113, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQoPtyXAWD5vChNKnEpIlq4z_vaVmdOks5ua1T4gaJpZM4WZjCV .

jeeb commented 6 years ago

It is really unfortunate that this discussion spawned around/after the v1 of the specification got "frozen", and this is probably partially because people only start implementing something when they know that it will not wildly change any more. But this is what it is.

This whole value seems like something that should be a header flag in the AV1 bit stream a la max_decoder_latency/max_decoder_buffer_required rather than something that should be in the container... You can always replicate the field in the container if you really want to (the HEVC-in-ISOBMFF specification writers seemed to think so), but if the flag for the first time appears on the container level and there's nothing on the bit stream level to read it from, then it becomes a parsing nightmare if you need to know the maximum reorder delay throughout the full stream to fill that value.

Also one would think that things like this could be handled on the SW side of the hwdec implementation, where it would just return "feed me more" until it can actually return the following coded image in PTS/CTS order. As I think this is how hwdec generally works for AVC/HEVC? Given that required header/initialization values are available to the actual decoder/parser that feeds to the hwdec implementation, of course.

... muxers are only required to put a value if it is correct, otherwise they can omit the value. initial_presentation_delay is only an indication/hint for players if they want to reduce the playback latency.

So do I understand it correctly that writing this value at all is 100% voluntary and that there is a boolean somewhere to mention if you could come up with a value for this field or not? Or do you mean writing the default (10 buffered frames) as "omit"?

cconcolato commented 6 years ago

So do I understand it correctly that writing this value at all is 100% voluntary and that there is a boolean somewhere to mention if you could come up with a value for this field or not?

Yes

VFR-maniac commented 6 years ago

If you take out the word "smoothly", yes. Given a TU, a decoder can always output a frame. The problem is that a decoder cannot always decode the TU in the time during which the previous frame has to be presented (assuming fixed frame rate for simplification here). Because the TU may contain multiple frames.

That is just a composition time offset at the TU where the decoder requires more TUs, isn't that?

We could have used the ctts box (although not with a single entry) but it introduces lots of complexity (requires an edit list for AV-sync or negative CTS offsets ...). I strongly believe the chosen approach is simpler: players can ignore initial_presentation_delay and muxers are only required to put a value if it is correct, otherwise they can omit the value. initial_presentation_delay is only an indication/hint for players if they want to reduce the playback latency.

I don't think this approach makes the issue simpler. If you really want to avoid negative CTS offsets it is enough that the spec forbits it. Also I don't think the edit list is a complex thing. You say it is hint for players, but if players ignore it even if initial_presentation_delay is present, there is possibly jerkiness, isn't there? So I think it is not an ignorable thing and a hint for almost all players. This is a similar thing the demuxer or player don't know the edit list and may introduce AV-async. The initial_presentation_delay only makes sense for AV1. To support AV1 in ISOBMFF, the muxer and demuxer need support initial_presentation_delay in addition to the decoder initialization record as the minimum implementation. I believe that container file formats should hides and opacify the CODEC specific properties as much as possible to treat any encapsulated CODECs by the common ways. If the initial_presentation_delay is defiend in the spec of ISOBMFF, I don't give such unpleasant words. :(

robUx4 commented 6 years ago

Also one would think that things like this could be handled on the SW side of the hwdec implementation, where it would just return "feed me more" until it can actually return the following coded image in PTS/CTS order. As I think this is how hwdec generally works for AVC/HEVC? Given that required header/initialization values are available to the actual decoder/parser that feeds to the hwdec implementation, of course.

The issue here is that a TU (Sample in ISOBMFF/Block in Matroska) may contain more than one frame to decode, invisible to the container. But it's not necessarily at the beginning of the stream. The Sequence Header OBU may have the information but as frames, not TU. And it doesn't say how many frames can be packed in a TU during the whole Sequence it describes. So in any case, we cannot have this information from the decoder.

cconcolato commented 2 years ago

For context, a related issue in dav1d: https://code.videolan.org/videolan/dav1d/-/issues/406

tdaede commented 2 years ago

Is it always safe to copy initial_display_delay from the sequence header to initial_presentation_delay in ISOBMFF? Looking at the definitions, although one is per frame and one is per sample, I cannot think of a case where copying the value violates the ISOBMFF wording, because every sample is guaranteed to output a frame.

(If so, at least for some encoders, producing initial_display_delay values is trivial and would make for an easy conformance bitstream)

cconcolato commented 1 year ago

We should revisit this issue once we have conformance streams exercising the feature.

cconcolato commented 10 months ago

We intend to close this issue when conformance files are provided (#180)