AOMediaCodec / iamf

Immersive Audio Model and Formats
https://aomediacodec.github.io/iamf/
79 stars 15 forks source link

Need to clarify who will trim during decapsulation of ISOBMFF file. #271

Closed sunghee-hwang closed 1 year ago

sunghee-hwang commented 1 year ago

During encapsulation into ISOBMFF, trimming information will be reflected to the relevant boxes according IAMF specification. Meanwhile, Some samples have trimming information when audio frame OBUs have trimming data. Then who should trim the data during decapsulation of IAMF-ISOBMFF file? IAMF decoder or IAMF-ISOBMFF parser

tdaede commented 1 year ago

We will do track level trimming via ISOBMFF edit lists. The information will be copied from bitstream level trimming (see e.g. Opus in ISOBMFF mapping). The Simple and Base profiles will be restricted such that start times / trimming are the same to make this a simple process.

sunghee-hwang commented 1 year ago

'start times' means start PTS? In terms of AV synchronization, when we assume 10ms audio pre-amble and no video pre-amble, which values are set to audio start PTS and video start PTS when each es stream is passed to the decoder?

sunghee-hwang commented 1 year ago

How about this? It may not be perfect figures and may not be final but it shows trimming related procedures. NOTE: IAMF Decoders for both Standalone and ISOBMFF Interlocked shall do trim audio samples to be trimmed in mid-stream (in case of shore-lived audio elements which neither start nor end at start or end of the stream)

[Standalone-IAMF Decoder]

graph TD
OBU_Parser(OBU Parser) --> |PTS1| Substream_Decoder("Substream Decoder")
OBU_Parser --> |Substreams| Substream_Decoder
Substream_Decoder --> |PTS1| Mix_Presentation("Mixing and Presentation")
OBU_Parser --> |Trimming Information| Mix_Presentation
Substream_Decoder --> |Audio Samples before trimming| Mix_Presentation
Mix_Presentation --> Output("Audio Samples after trimming at PTS2") 

[ISOBMFF Interlocked-IAMF Decoder]

graph TD
OBU_Parser(OBU Parser) --> |PTS1| Substream_Decoder("Substream Decoder")
OBU_Parser --> |Substreams| Substream_Decoder
Substream_Decoder --> |PTS1| Mix_Presentation("Mixing and Presentation")
Substream_Decoder --> |Audio Samples before trimming| Mix_Presentation
Mix_Presentation --> Output("Audio Samples before trimming at PTS1") 

[ISOBMFF Interlocked]

graph TD
ISOBMFF_Parser(ISOBMFF Parser) --> |Descriptor OBUs| IAMF_Decoder("IAMF Decoder")
ISOBMFF_Parser --> |PTS1| IAMF_Decoder("IAMF Decoder")
ISOBMFF_Parser --> |Samples| IAMF_Decoder
IAMF_Decoder --> |PTS1| ISOBMFF_Player
IAMF_Decoder --> |Audio Samples before trimming| ISOBMFF_Player
ISOBMFF_Parser --> |Trimming Information| ISOBMFF_Player
ISOBMFF_Player --> Output("Audio Samples after trimming at PTS2")
sunghee-hwang commented 1 year ago

How about this option?

During trying to resolve this issue, I have realized that the inputs, which IAMF decoders get, are same as both IAMF parser or IAMF-ISOBMFF parser would pass descriptor OBUs, PTS and Samples (or Temporal Units) to IAMF decoders.

NOTE: IAMF Decoder needs to trim audio samples to be trimmed at the end of substreams which do not end at the end of the the stream and at the start of substreams which do not start at the start of the stream.

graph TD
Parser("IAMF Parser <br> (or IAMF-ISOBMFF Parser)") --> |Descriptor OBUs| IAMF_Decoder("IAMF Decoder")
Parser --> |PTS1| IAMF_Decoder
Parser --> |Samples or Temporal Units| IAMF_Decoder
IAMF_Decoder --> |PTS1| Player
IAMF_Decoder --> |Audio Samples before trimming| Player("IAMF Player or ISOBMFF Player")
Parser --> |Trimming Information| Player
Player --> |Audio Samples after trimming starting at PTS2| Speakers("Loudspeakers") 
sunghee-hwang commented 1 year ago

Let me modify the above figure like in below

graph TD
Parser("IAMF File Parser <br> (or IAMF-ISOBMFF Parser)") --> |Descriptor OBUs| IAMF_Decoder("IAMF Decoder")
Parser --> |PTS1| IAMF_Decoder
Parser --> |Samples or Temporal Units| IAMF_Decoder
IAMF_Decoder --> |PTS2| Player
IAMF_Decoder --> |Audio Samples after trimming| Player("IAMF File Player <br> (or IAMF-ISOBMFF Player)")
Parser --> |Zero Trimming Information| Player
Player --> |Audio Samples after trimming starting at PTS2| Speakers("Loudspeakers") 

Here is the reason for the update: When we consider random access, the accessed sample may include the starting Audio Frame OBUs of short-lived contents. In this case, some of Audio Frame OBUs have trimming information but the others do not. Then a question would be how IAMF decoder handle this. It is not impossible to implement this situation but definitely complicated. So, it would be a simple and safe way for IAMF-ISOBMFF parser/player not to care trimming. Instead of that, IAMF decoder will guarantee that its output has no trimming data, and the updated PTS based on it. NOTE: Still, ISOBMFF boxes such as 'edts' and 'stts' reflect the trimming information properly based on IA sequence.

tdaede commented 1 year ago

Hi, sorry for the late reply on this, but these first diagrams look fine. The only confusing thing wrt the diagrams is that I think they would be improved if they explicitly showed the two paths for trimming when in ISOBMFF for start/end or middle, or had two separate diagrams, one for each.

For the last diagram, I'm not 100% clear on it. Are you saying in this special case, the decoder would do all the trimming, and edts would reflect zero trim? Or would this be all the time, and not just in the special case?

sunghee-hwang commented 1 year ago

Based on the current Ref. S/W dec operation, we have realized that the below figures would be more appropriate, PTS1: Start Presentation Time Stamp before trimming (i.e. PTS of the first audio sample) PTS2: Start Presentation Time Stamp after trimming (i.e. PTS of the first audio sample) PTS: PTS1 or PTS2

IAMF decoders output PTS and audio samples. For IAMF-ISOBMFF Player (ISOBMFF Inter-locked player), IAMF decoders may or may not trim the data but they output audio samples and its start PTS. For IAMF Player (standalone player), IAMF decoders trim the data. In other word, they output PTS2 and audio samples after trimming.

[ISOBMFF Inter-locked IAMF Player]

graph TD
Parser("IAMF-ISOBMFF Parser") --> |Descriptor OBUs| IAMF_Decoder("IAMF Decoder")
Parser --> |PTS1| IAMF_Decoder
Parser --> |Samples or Temporal Units| IAMF_Decoder
IAMF_Decoder --> |PTS & Audio Samples| Player("IAMF-ISOBMFF Player")
Parser --> |PTS1 & Trimming Information| Player
Player --> |Audio Samples after trimming starting at PTS2| Speakers("Loudspeakers") 

[Standalone IAMF Player]

graph TD
Parser("IAMF File Parser") --> |Descriptor OBUs| IAMF_Decoder("IAMF Decoder")
Parser --> |PTS1| IAMF_Decoder
Parser --> |Samples or Temporal Units| IAMF_Decoder
IAMF_Decoder --> |PTS2 & Audio Samples after timming| Player("IAMF File Player")
Player --> |Audio Samples after trimming starting at PTS2| Speakers("Loudspeakers") 
sunghee-hwang commented 1 year ago

Hi Shawn,

As discussed at the IAMF call, please share your note for the standalone IAMF player. Here is a draft note from my side. Please review and update it if needed. NOTE: It may be a third entity to manage AV synchronization, in this case the third entity may provide PTSs for the synchronization so that the figure for [Standalone IAMF Player] needs no PTSs.