AOMediaCodec / iamf

Immersive Audio Model and Formats
https://aomediacodec.github.io/iamf/
80 stars 15 forks source link

Use SampleGroups to collect timed metadata #21

Closed tdaede closed 2 years ago

tdaede commented 2 years ago

Unlike video, audio does not have keyframes (alternately, every sample is a keyframe). Consider using SampleGroups to avoid excessive numbers of timed metadata units.

sunghee-hwang commented 2 years ago

Let me check my understanding the meaning of using SampleGroups to avoid excessive numbers of timed metadata units. Based on the current proposal, the timed metadata units are stored in front of each sample in mdat. But, if we consider using SampleGroups, the meaning is that: During encapsulation the contents of the timed metadata are divided into SampleGroups, and then those are contained inside moov and/or moof instead of contained in mdat. During parsing the file, the contents inside SampleGroups are merged to form the original timed metadata, and it is placed at the front of each relevant sample to form IA bitstream, and is passed to decoders. Am I correct?

cconcolato commented 2 years ago

Yes, we could specify that the Sample Group data is reinserted in the elementary stream, for example when exporting back to elementary stream syntax. Regarding the integration between the file parser and the decoder, usually ISOBMFF does not specified how it is done. In one implementation, it could be decided done by going back to elementary stream. In an other implementation, the Sample Group information may be passed as side information.

sunghee-hwang commented 2 years ago

Then, it seems to me that the purpose of using SampleGroups is to save the file storage. I will prepare a summary for "the overhead of timed metadata" vs "the overhead of using SampleGroups for timed metadata".

cconcolato commented 2 years ago

It is not only to save storage space. It is also a design question:

cconcolato commented 2 years ago
It would be good if we could have a clear understanding of what is allowed to change and when. For example, we could fill in the following table: IAC Feature Change possibly Frame by Frame Change some times (but not frame by frame) Change not foreseen at all in a track Change requires decoder reinitialization Change requires rendering reinitialization
codec          
Sample rate          
Ambisonics use          
Ambisonics Order          
Use of Ambisonics demixing          
Ambisonics demixing matrix          
Ambisonics channel mapping          
Ambisonics coupling          
Use of non-diegetic channels          
Count of non-diegetic channels          
Coupling of non-diegetic channels          
Layout of non-diegetic channels (number of DCG, composition of DCG)          
Count of non-diegetic channels in Base Channel Group          
Coefficients for non-diegetic channels (Matrix Downmix Tree)          
sunghee-hwang commented 2 years ago

All of features except the last one are Non-time metadata. So, the proposal is based on that changes are not foreseen at all in a track. But the last one (Coefficient for non-diegetic channels) changes some times (but not frame by frame). Based on on encoder guideline, it may be changed per every 18 frames (0.36 seconds) in worst case. Please refer to the paper for details. (https://www.aes.org/e-lib/browse.cfm?elib=21489)

sunghee-hwang commented 2 years ago

It is not only to save storage space. It is also a design question:

  • how often is this data expected to change? if it does not change at almost every frame, why should it be in the elementary stream?
  • If the audio decoder does not consume the metadata but it's a post-processor consuming it, why should it be in the elementary stream?
  • do system tools (packagers, inspectors, demuxers ...) need to access this data (e.g. to generate 'codecs' parameter, or to determine encryption boundaries, ...), having it in the sample payload (i.e. in 'mdat') is not optimal.

For the first and second, Timed metadata for the proposal consists of DemixingInfo() and ChannelGroupSpecificInfo(). DemixingInfo() is for its associated frame(sample) and its information can be changed per every 18 frames in worst case but its size is only 1 byte. ChannelGroupSpecificInfo() is for each Channel Group of its associated frame(sample). It contains sizes of each substream and the Channel Group and gain values for reconstruction which both are changed frame by frame.

So, DemixingInfo() needs to be checked if it needs SampleGroup scheme. Based on my calculation, the required size for two boxes (sbgp adn sgpd) is 44 + 9 x "# of entry" in byte units. (for sbgp, 20+8 x "# of entry" and for sgpd, 24+"Size of DemixingInfo()" x "# of entry") If we assume that DemixingInfo is changed 3 times per every second for simple calculation (i.e. required 3 entries per every second), then it requires 71 bytes for 1s file, 98 bytes for 2s file, 125 bytes for 3s file, and so on. As DemixingInfo() in the sample payload requires just 1 byte per every single frame, fragmented files more than 2s duration (100 frames) can save its storage by using SampleGroup scheme. Definitely, the less changing frequency of DemixingInfo() provides the more storage saving.

For the third, I believe that the timed metadata on the proposal does not have such information except boundaries among substreams which can be changed frame by frame. I think that we need to consider this based on the scope of file parers vs. OBU parsers and also the location of OBU parsers compared to decryption entity.

cconcolato commented 2 years ago

Thanks.

The fact that ChannelGroupSpecificInfo changes every frame is not sufficient to determine if sample groups can be useful. It also depends on how many configurations of ChannelGroupSpecificInfo you will use. If the sample alternates between 2 configurations, sample groups are really appropriate. But if there is a large number of configurations and no pattern in how they are used, it is not a good candidate.

DemixingInfo seems to have only 8 possible values, so it's definitely a good candidate.

We should consider also if it makes sense to use Sample Groups for one and not for the other. It makes processing a bit more complicated.

Note that sbgp is not always required. In sgpd, you can set default_group_description_index and in that case, you don't even need sbgp. Note also that sbgp can be replaced by csgp to encode sample group patterns more efficiently.

cconcolato commented 2 years ago

If you provide a textual representation (e.g. XML, JSON) of the Timed_Metadata structure for a real stream, with MP4Box, I can generate a real MP4 file for you to look at.

sunghee-hwang commented 2 years ago

Thanks.

The fact that ChannelGroupSpecificInfo changes every frame is not sufficient to determine if sample groups can be useful. It also depends on how many configurations of ChannelGroupSpecificInfo you will use. If the sample alternates between 2 configurations, sample groups are really appropriate. But if there is a large number of configurations and no pattern in how they are used, it is not a good candidate.

DemixingInfo seems to have only 8 possible values, so it's definitely a good candidate.

We should consider also if it makes sense to use Sample Groups for one and not for the other. It makes processing a bit more complicated.

Note that sbgp is not always required. In sgpd, you can set default_group_description_index and in that case, you don't even need sbgp. Note also that sbgp can be replaced by csgp to encode sample group patterns more efficiently.

Thanks to point it out. I will look into csgp to figure out SampleGroup usage correctly.

cconcolato commented 2 years ago

The ChannelGroupSpecificInfo(ambisonics) and ChannelGroupSpecificInfo(channel_audio) are not clear to me, but generally we should keep in the sample data the required information to parse the sample and feed the decoder(s). Anything else meant for the post-processor (downmixing instruction, gain, ...) could go sample groups.

Some specific questions about timed metadata:

  1. Why do need to repeat the stream count? It is already in the static metadata. Or do you envisage that for some samples, some channel groups will have no data?
  2. Can you explain the various size-related fields? Is this similar to self-delimited in Opus?
sunghee-hwang commented 2 years ago

The ChannelGroupSpecificInfo(ambisonics) and ChannelGroupSpecificInfo(channel_audio) are not clear to me, but generally we should keep in the sample data the required information to parse the sample and feed the decoder(s). Anything else meant for the post-processor (downmixing instruction, gain, ...) could go sample groups.

Let me explain about ChannelGroupSpecificInfo(): The 1st purpose of this Info() is to let IAC file(or OBU) parser know boundaries among substreams (mono/stereo bitstreams) for each ChannelGroup and ChannelGroup size. This 1st purpose is dependent on codecs. So, a codec may or may not need the boundaries. For Opus, it does not need boundaries among substreams because each substream structure except the last one of the CG is self-delimiting. If a following CG presents, it requires ChannelGroup size. But we may remove this as well for an optimal design by insisting self-delimiting structure on every substream except the last one of the frame not CG. For AAC-LC, I think that it does need boundaries. Actually, the frame format for AAC-LC is ADTS (Audio Data Transport Stream. ISO/IEC 13818-7) which has a length field inside its header. But the access unit is not ADTS but the payload of ADTS. So, I believe that we need the boundaries for AAC-LC. Of course, we don't need the boundaries inside timed metadata for the multiple-tracks (one track per a substream).

The 2nd purpose of this Info() is to let decoders to know gain values applied to channels after demixing related to the ChannelGroup. The gain values are changed frame by frame. And, it is only required for Channel_Audio when audio scalability is applied.(In other words, if channel audio consists only one layer (BCG only case), it does not require the gain values.)

Some specific questions about timed metadata:

  1. Why do need to repeat the stream count? It is already in the static metadata. Or do you envisage that for some samples, some channel groups will have no data? The stream count is duplicated. So, we can remove it.
  2. Can you explain the various size-related fields? Is this similar to self-delimited in Opus? I think that it is better you to refer this updated version of timed metadata (sorry for confusion): image
sunghee-hwang commented 2 years ago

If you provide a textual representation (e.g. XML, JSON) of the Timed_Metadata structure for a real stream, with MP4Box, I can generate a real MP4 file for you to look at.

The JSON file is uploaded to IAC folder.

cconcolato commented 2 years ago

What I see in your file:

I attach an mp4 file. The audio content is garbage, don't try to listen to it, but I faked a admi (Audio demixing sample group) based on the info in your JSON file.

https://user-images.githubusercontent.com/1830314/159753464-f7269cf0-8a59-4ce4-9e29-b35126bceb79.mp4

You can view the file in It will show image

and you can see that instead of the 3000 * 1 byte, the whole signaling of demixing takes 29 sgpd + 284 sbgp bytes.

I need to think if there are ways to optimize the storage of other data not required to decode the sample (e.g. Gain).

sunghee-hwang commented 2 years ago

Many thanks! Thanks to your mp4 file, now I clearly understand the usage of Sample Group.

cconcolato commented 2 years ago

We agree to create a generic sample group where the payload is exactly the OBU content (with header and length). This is to be used for demixing OBU and possibly others in the future. We still need to discuss if sample groups shall or should or can be used, possibly based on the OBU type.

tdaede commented 2 years ago

I am closing this as sample groups are now used as described in the spec.