Closed tdaede closed 2 years ago
Let me check my understanding the meaning of using SampleGroups to avoid excessive numbers of timed metadata units. Based on the current proposal, the timed metadata units are stored in front of each sample in mdat. But, if we consider using SampleGroups, the meaning is that: During encapsulation the contents of the timed metadata are divided into SampleGroups, and then those are contained inside moov and/or moof instead of contained in mdat. During parsing the file, the contents inside SampleGroups are merged to form the original timed metadata, and it is placed at the front of each relevant sample to form IA bitstream, and is passed to decoders. Am I correct?
Yes, we could specify that the Sample Group data is reinserted in the elementary stream, for example when exporting back to elementary stream syntax. Regarding the integration between the file parser and the decoder, usually ISOBMFF does not specified how it is done. In one implementation, it could be decided done by going back to elementary stream. In an other implementation, the Sample Group information may be passed as side information.
Then, it seems to me that the purpose of using SampleGroups is to save the file storage. I will prepare a summary for "the overhead of timed metadata" vs "the overhead of using SampleGroups for timed metadata".
It is not only to save storage space. It is also a design question:
It would be good if we could have a clear understanding of what is allowed to change and when. For example, we could fill in the following table: IAC Feature | Change possibly Frame by Frame | Change some times (but not frame by frame) | Change not foreseen at all in a track | Change requires decoder reinitialization | Change requires rendering reinitialization |
---|---|---|---|---|---|
codec | |||||
Sample rate | |||||
Ambisonics use | |||||
Ambisonics Order | |||||
Use of Ambisonics demixing | |||||
Ambisonics demixing matrix | |||||
Ambisonics channel mapping | |||||
Ambisonics coupling | |||||
Use of non-diegetic channels | |||||
Count of non-diegetic channels | |||||
Coupling of non-diegetic channels | |||||
Layout of non-diegetic channels (number of DCG, composition of DCG) | |||||
Count of non-diegetic channels in Base Channel Group | |||||
Coefficients for non-diegetic channels (Matrix Downmix Tree) |
All of features except the last one are Non-time metadata. So, the proposal is based on that changes are not foreseen at all in a track. But the last one (Coefficient for non-diegetic channels) changes some times (but not frame by frame). Based on on encoder guideline, it may be changed per every 18 frames (0.36 seconds) in worst case. Please refer to the paper for details. (https://www.aes.org/e-lib/browse.cfm?elib=21489)
It is not only to save storage space. It is also a design question:
- how often is this data expected to change? if it does not change at almost every frame, why should it be in the elementary stream?
- If the audio decoder does not consume the metadata but it's a post-processor consuming it, why should it be in the elementary stream?
- do system tools (packagers, inspectors, demuxers ...) need to access this data (e.g. to generate 'codecs' parameter, or to determine encryption boundaries, ...), having it in the sample payload (i.e. in 'mdat') is not optimal.
For the first and second, Timed metadata for the proposal consists of DemixingInfo() and ChannelGroupSpecificInfo(). DemixingInfo() is for its associated frame(sample) and its information can be changed per every 18 frames in worst case but its size is only 1 byte. ChannelGroupSpecificInfo() is for each Channel Group of its associated frame(sample). It contains sizes of each substream and the Channel Group and gain values for reconstruction which both are changed frame by frame.
So, DemixingInfo() needs to be checked if it needs SampleGroup scheme. Based on my calculation, the required size for two boxes (sbgp adn sgpd) is 44 + 9 x "# of entry" in byte units. (for sbgp, 20+8 x "# of entry" and for sgpd, 24+"Size of DemixingInfo()" x "# of entry") If we assume that DemixingInfo is changed 3 times per every second for simple calculation (i.e. required 3 entries per every second), then it requires 71 bytes for 1s file, 98 bytes for 2s file, 125 bytes for 3s file, and so on. As DemixingInfo() in the sample payload requires just 1 byte per every single frame, fragmented files more than 2s duration (100 frames) can save its storage by using SampleGroup scheme. Definitely, the less changing frequency of DemixingInfo() provides the more storage saving.
For the third, I believe that the timed metadata on the proposal does not have such information except boundaries among substreams which can be changed frame by frame. I think that we need to consider this based on the scope of file parers vs. OBU parsers and also the location of OBU parsers compared to decryption entity.
Thanks.
The fact that ChannelGroupSpecificInfo
changes every frame is not sufficient to determine if sample groups can be useful. It also depends on how many configurations of ChannelGroupSpecificInfo
you will use. If the sample alternates between 2 configurations, sample groups are really appropriate. But if there is a large number of configurations and no pattern in how they are used, it is not a good candidate.
DemixingInfo
seems to have only 8 possible values, so it's definitely a good candidate.
We should consider also if it makes sense to use Sample Groups for one and not for the other. It makes processing a bit more complicated.
Note that sbgp
is not always required. In sgpd
, you can set default_group_description_index
and in that case, you don't even need sbgp
. Note also that sbgp
can be replaced by csgp
to encode sample group patterns more efficiently.
If you provide a textual representation (e.g. XML, JSON) of the Timed_Metadata
structure for a real stream, with MP4Box, I can generate a real MP4 file for you to look at.
Thanks.
The fact that
ChannelGroupSpecificInfo
changes every frame is not sufficient to determine if sample groups can be useful. It also depends on how many configurations ofChannelGroupSpecificInfo
you will use. If the sample alternates between 2 configurations, sample groups are really appropriate. But if there is a large number of configurations and no pattern in how they are used, it is not a good candidate.
DemixingInfo
seems to have only 8 possible values, so it's definitely a good candidate.We should consider also if it makes sense to use Sample Groups for one and not for the other. It makes processing a bit more complicated.
Note that
sbgp
is not always required. Insgpd
, you can setdefault_group_description_index
and in that case, you don't even needsbgp
. Note also thatsbgp
can be replaced bycsgp
to encode sample group patterns more efficiently.
Thanks to point it out. I will look into csgp to figure out SampleGroup usage correctly.
The ChannelGroupSpecificInfo(ambisonics)
and ChannelGroupSpecificInfo(channel_audio)
are not clear to me, but generally we should keep in the sample data the required information to parse the sample and feed the decoder(s). Anything else meant for the post-processor (downmixing instruction, gain, ...) could go sample groups.
Some specific questions about timed metadata:
self-delimited
in Opus?The
ChannelGroupSpecificInfo(ambisonics)
andChannelGroupSpecificInfo(channel_audio)
are not clear to me, but generally we should keep in the sample data the required information to parse the sample and feed the decoder(s). Anything else meant for the post-processor (downmixing instruction, gain, ...) could go sample groups.Let me explain about ChannelGroupSpecificInfo(): The 1st purpose of this Info() is to let IAC file(or OBU) parser know boundaries among substreams (mono/stereo bitstreams) for each ChannelGroup and ChannelGroup size. This 1st purpose is dependent on codecs. So, a codec may or may not need the boundaries. For Opus, it does not need boundaries among substreams because each substream structure except the last one of the CG is self-delimiting. If a following CG presents, it requires ChannelGroup size. But we may remove this as well for an optimal design by insisting self-delimiting structure on every substream except the last one of the frame not CG. For AAC-LC, I think that it does need boundaries. Actually, the frame format for AAC-LC is ADTS (Audio Data Transport Stream. ISO/IEC 13818-7) which has a length field inside its header. But the access unit is not ADTS but the payload of ADTS. So, I believe that we need the boundaries for AAC-LC. Of course, we don't need the boundaries inside timed metadata for the multiple-tracks (one track per a substream).
The 2nd purpose of this Info() is to let decoders to know gain values applied to channels after demixing related to the ChannelGroup. The gain values are changed frame by frame. And, it is only required for Channel_Audio when audio scalability is applied.(In other words, if channel audio consists only one layer (BCG only case), it does not require the gain values.)
Some specific questions about timed metadata:
- Why do need to repeat the stream count? It is already in the static metadata. Or do you envisage that for some samples, some channel groups will have no data? The stream count is duplicated. So, we can remove it.
- Can you explain the various size-related fields? Is this similar to
self-delimited
in Opus? I think that it is better you to refer this updated version of timed metadata (sorry for confusion):
If you provide a textual representation (e.g. XML, JSON) of the
Timed_Metadata
structure for a real stream, with MP4Box, I can generate a real MP4 file for you to look at.
The JSON file is uploaded to IAC folder.
What I see in your file:
TimedMetadata
objects, each with 1 Demixing_Info
and 4 Channel_Group_Specific_Info
Demixing_Info
Recon_Gain_Flags
Channel_Group_Size
Recon_Gain
arraysI attach an mp4 file. The audio content is garbage, don't try to listen to it, but I faked a admi
(Audio demixing sample group) based on the info in your JSON file.
https://user-images.githubusercontent.com/1830314/159753464-f7269cf0-8a59-4ce4-9e29-b35126bceb79.mp4
You can view the file in It will show
and you can see that instead of the 3000 * 1 byte, the whole signaling of demixing takes 29 sgpd
+ 284 sbgp
bytes.
I need to think if there are ways to optimize the storage of other data not required to decode the sample (e.g. Gain).
Many thanks! Thanks to your mp4 file, now I clearly understand the usage of Sample Group.
We agree to create a generic sample group where the payload is exactly the OBU content (with header and length). This is to be used for demixing OBU and possibly others in the future. We still need to discuss if sample groups shall or should or can be used, possibly based on the OBU type.
I am closing this as sample groups are now used as described in the spec.
Unlike video, audio does not have keyframes (alternately, every sample is a keyframe). Consider using SampleGroups to avoid excessive numbers of timed metadata units.