Decoder initialization feedback

cconcolato commented 2 years ago

It has been assumed in the past that whatever is needed for decoder initialization (including opaque sequences of bytes) should be in the sample entry. Doing so should be considered carefully as it leads to the problematic dichotomy: live vs. ondemand (avc1 vs avc3, hvc1 vs. hev1, …). In live cases, all opaque sequences of bytes for the entire session are not necessarily known upfront, and creating a new sample entry on the fly is not possible in ISOBMFF. Usually the concern with decoder initialization is initialization latency, but often latency is due to memory allocation which for example for video can be done by knowing width, heigh, and depth.

It is not always clear what is actually defined as "decoder initialization". Is it simply instantiating the decoder and allocating buffers? In that case, isn't the information from the base sample entry (e.g. width, height) and some simple codec-specific information (e.g. profile, level) sufficient? Or is it codec-specific initialization (e.g. VVC-specific tools)? Also, what is the gain of doing the latter "deep" initialization, compared to the former "shallow" initialization? Is it worth the complexity of the inband/out-of-band dichotomy?

Some common APIs expect actual codec specific data (e.g. parameter set NAL units) as an input to the API when initializing the decoder e.g. MediaCodecAPI CSD buffer, CreateFromH264ParameterSets. Here is also a pointer to Nvidia codec SDK cuvidReconfigureDecoder, there they seem to be ok with re-configuring on the fly as long as you set the maximum values when you first initialize.

The file format group welcomes feedback on this issue.

sambushell commented 2 years ago

I would ask the group to distinguish between codec-specific and codec-agnostic data. By codec-specific data, I'm referring to the kinds of things in parameter sets that are not seen in other codecs: generally, parameter values that affect later parsing. Codec-agnostic data encompasses Video Usability Information like colour space triples, full range, sample aspect ratio, clean aperture, HDR colour space info, etc.
From my perspective, it is the codec-agnostic data that critically needs to be available prior to sample data acquisition or delivery.

cconcolato commented 2 years ago

Thank you @sambushell. In your perspective, does the codec-agnostic data need to be provided as an opaque sequence of bytes (e.g. SPS NALU, SH OBU, i.e. structures not decomposed at the container level) or can/should it be provided in different boxes (colr, clap, mdcv, ...)?

sambushell commented 2 years ago

I think codec-agnostic representation (such as pasp, colr, mdcv etc boxes) of codec-agnostic information is best. The parsers for those are already there for ISOBMFF and don't need to be updated for each new codec. From the BMFF-based perspective, it would be nice not to have codec-specific configuration box parsers as fallback, but as long as they're there, we will probably need to. Meanwhile, there are some kinds of codec-specific data that still have high level scope, and have practical value to be able to retrieve from sample entries -- such as profiles and levels, and other information that has scope covering the whole track. The configuration of layers in multi-layer tracks, for example. Or information about sublayers and scalability. Generally I guess I'm thinking of technologies that are part of a specific codec's innovations, but still relevant to the systems programming layers outside of the codec. If information needs to be consumed by a subsystem that isn't the video decoder (interpreted narrowly), it is not a good idea to have it in the bitstream only.

MPEGGroup / FileFormat

Decoder initialization feedback #58