Eyevinn / mp4ff

Library and tools for parsing and writing MP4 files including video, audio and subtitles. The focus is on fragmented files. Includes mp4ff-info, mp4ff-encrypt, mp4ff-decrypt and other tools.
MIT License
444 stars 81 forks source link

Information on times in the SampleComplete object #8

Closed ivanjaros closed 4 years ago

ivanjaros commented 4 years ago

Hi, could you provide some information on what the times in the SampleComplete object mean and what type of values they are(sec, msec,nsec..)?

    mp4.SampleComplete{
        Sample: mp4.Sample{
            Flags uint32
            Dur   uint32 <----
            Size  uint32
            Cto   int32   <----
        },
        DecodeTime:         uint64 <----
        PresentationTime: uint64 <----
        Data                      []byte
    }
TobbeEdgeware commented 4 years ago

In mp4 files and in this structure, the times are always in the timescale of the track. This timescale is stored in the mdia box (moov.trak.minf.mdia) and can be set in a new init segment using the call mp4.CreateEmptyMP4Init(timescale, mediaType, lang) as in https://github.com/edgeware/mp4ff/blob/master/examples/initcreator/main.go

Many libraries use a common timescale like milliseconds for all tracks, but this leads to rounding errors since video frame rates and audio sample rates have different time scales. Audio tracks should typically have timescale equal to its samling rate (48000) and 29.97 Hz video should have a timescale that is a multiple of 30000 to that there is an integer duration of each picture (a multiple of 1001).

TobbeEdgeware commented 4 years ago

RTMP has a timescale of 1000, so you can use the same when converting to mp4 segments. However, RTMP has a circular timeline which starts at zero at the beginning of the session while live DASH has an absolute ever-increasing timeline determined by the server, so you may need to add an offset to the timestamps and handle any wrap-arounds of the RTMP clock.

ivanjaros commented 4 years ago

Thanks for the info. It sure will be useful. But i still don't know what the values are suppose to represent. I presume SampleComplete.Sample.Dur is the duration of the segment SampleComplete.Sample.Cto is composition time SampleComplete.DecodeTime is duration of the segment again and SampleComplete.PresentationTime is the position of the segment on the time line

TobbeEdgeware commented 4 years ago

The latter two are accumulated values, so there is some redundancy.

Dur and Cto are the sample duration and composition_time_offset for an individual sample and are found in the trun box for fragments. DecodeTime and PresentationTime are accumulated values representing absolute times, or the time since the start of the VoD. These are the values used in the media timeline for DASH.

For a fragment:

 DecodeTime = tfdt.BaseMediaDecodeTime + Sum of Dur for all previous samples in that fragment
 PresentationTime = DecodeTime + Cto

For a monolithic file, the same applies, except that there is no tfdt box so one starts counting from zero at the start of the whole file.

ivanjaros commented 4 years ago

i am still unable to produce playable videos. i have packets that can be simply written into mdat one after another and with moov header it will play just fine. each packet has time.duration available that indicates the position on the timeline since the beginning of the stream - so the DecodeTime(presentation tinre is the same value). I have produced fragmented files, according to mp4parser.com just fine. but they won't play. I am 100% it has something to do with these times. Even in your example in neighbouring issue the duration is not even time related but the value is dur := uint32(180000 / vw.vmd.Framerate) // TODO. Improve to remove drift which is frame rate based. so something is not right here.

TobbeEdgeware commented 4 years ago

Nice that you're making progress. My complete code produced playable video, but assumed a known frame rate of 25. With a little more effort, I could have calculated it from the time differences of the incoming samples.

However, you can get dur more direclty since it is simply the difference in time between the current segment and the next converted to the right timescale. If you choose the 1000 timescale of the incoming RTMP timeline, you can calculate dur[i] = time[i+1] - time[i]. In my case, with timescale=180000 I should have multiplied by 180.

To get forwad quickly you can assume a reasonable dur value and you should get video that you can play with ffplay by concatenating the init segment with a media segment.

My comment about drift was for future cases where there could be possible accumulated rounding errors. You can neglect it.

ivanjaros commented 4 years ago

ok, so i have been able to produce playable videos now. the main issue was that i was setting video and audio codecs on the track, as per the init example. which is wrong. onyl one can be set. as for the times, i have tried every possible combination i could havre though out but nothing works. i never get proepr video length and fps is never correct. i wonder if i should differentiate between audio and video tracks when creating init stuff and setting segments?

ivanjaros commented 4 years ago

i have spent at least 20 hours with this library, still no success. it would be very helpful to have some kind of documentation. specifically about the tracks, durations/times, track ids, fragment sequences and so on. also the create init function has hardcoded 90000 rate which i think is a bug since you are providing rate as argument. and again, creates only one track which i still have no idea is ok or not if you have audio and video packets(ie two strams?).

ivanjaros commented 4 years ago

i know i am beating a dead horse but out of curiosity i have tried mp4 library i orignally have with RTMP server to create just the headers(with proper track information) and even though i cannot play the video(yet), at least now I have correct duration for video and audio in file properties, which was never the case with this library. so i am now 100% certain that the issue here is the headers/tracks. The CreateEmptyMP4Init() is utterly useless and provides no value whatsoever because it needs to be back-filled with real data. Also the examples that use this function are misleading because creating file with headers/init from this will never work.

i think this library truly needs to link segments with init/file and when new segment / sample...is added, it has to compute the headers on its own and only the the entire container can be encoded/written somewhere. the user should be only required to define tracks(and codecs they use), rate(although that is always 90000 so it pintless setting) and add data - either via multiple fragments or via single mdat. without this, there is too much complexity and a ton of boxes/atoms to handle manually in order to make this work like one would expect it to work(ie. it is way too low level at this point, imho).

TobbeEdgeware commented 4 years ago

It is quite complex to work with video and audio on this level. This library is meant as a toolbox and has some high-level functions to help out, but it does not provide any introductory description of the whole file format setup. I can hopefully add more info over time, though based on questions. Given your questions, I will fill in some information about things which may not be obvious.

For me who have worked in this area for a very long time, it is obvious that one should not mix audio and video in the same segment. It is allowed by the file format but not supported by the DASH-IF, CMAF and HLA specs. All code assumes that there is only one track in the init segment and in the media segments. and it is given track number 1.

The CreateEmptyMp4InitSegment() is not useless since it creates a quite a tree of boxes. However, it does not include the sample descriptions, which are very codec dependent. Therefore they need to be filled afterwards. For AVC/H.264 video these are created from SPS and PPS as the example code in writeVideoAVCInitSegment() in https://github.com/edgeware/mp4ff/blob/master/examples/initcreator/main.go shows.

In the partial example I provided in Issue #7, there is code to extract SPS and PPS extracted from the incoming RTMP video, which is outside the realm of this library. That example code is even less documented since it was not meant to be public.

Regarding the hard coded timescale 90000, it is only on the movie level, which is used to describe a common audio and video duration for a progressive mp4 file. For the tracks, and especially for segments, it is the track timescale that matters, and that is what is set when you call CreateEmptyMp4Init().

I'd suggest that you try to make the video playable first, before you try to fix the audio, but here is what I recall for making AAC mp4 segments from RTMP AAC samples; For AAC audio, it is enough to provide the AAC profile and sampling rate as in writeAudioAACInitSegment(). Extracting such example from RTMP AAC samples can be done since AAC sample of type 0 is an AudioSpecificConfig that can be decoded by mp4.AudioSpecificConfig().

ivanjaros commented 4 years ago

as mentioned above, if you set video and audio, the result will not work. so you have to have two tracks no matter what. plus they will have different rates, as mentioned. so i don't get the "one track" or "don't mix audio and video" notes above. the packets from rtmp are demuxed from flv so these are essentially h264 and aac encoded data that need to be muxed into mp4 container(or ts or whatever is needed). so these are essentially two separate tracks(a/v) that were present in flv container and need to be now put into different container.

ivanjaros commented 4 years ago

i used this to create the init with one track per stream(a or v) before i tried the original mp4 muxer's approach to headers.

func initContainer(codecs []av.CodecData) *mp4.InitSegment {
    initSeg := mp4.NewMP4Init()
    initSeg.AddChild(mp4.CreateFtyp())
    moov := mp4.NewMoovBox()
    initSeg.AddChild(moov)
    mvhd := mp4.CreateMvhd()
    mvhd.NextTrackID = int32(len(codecs) + 1)
    moov.AddChild(mvhd)

    mvex := mp4.NewMvexBox()
    moov.AddChild(mvex)

    for k := range codecs {
        initTrack(moov, k, codecs[k])
        trex := mp4.CreateTrex()
        trex.TrackID = uint32(k + 1)
        mvex.AddChild(trex)
    }

    return initSeg
}

func initTrack(moov *mp4.MoovBox, tid int, codec av.CodecData) {
    isA := codec.Type().IsAudio()
    mediaType := "video"
    if isA {
        mediaType = "audio"
    }

    trak := &mp4.TrakBox{}
    moov.AddChild(trak)
    tkhd := mp4.CreateTkhd()
    tkhd.TrackID = uint32(tid + 1)
    if isA {
        tkhd.Volume = 0x0100 // Fixed 16 value 1.0
    }
    trak.AddChild(tkhd)

    mdia := &mp4.MdiaBox{}
    trak.AddChild(mdia)
    mdhd := &mp4.MdhdBox{}
    mdhd.Timescale = 90000
    mdia.AddChild(mdhd)
    hdlr, _ := mp4.CreateHdlr(mediaType)
    mdia.AddChild(hdlr)
    mdhd.SetLanguage("und")
    elng := mp4.CreateElng("und")
    mdia.AddChild(elng)
    minf := mp4.NewMinfBox()
    mdia.AddChild(minf)
    if isA {
        minf.AddChild(mp4.CreateSmhd())
    } else {
        minf.AddChild(mp4.CreateVmhd())
    }
    dinf := &mp4.DinfBox{}
    dinf.AddChild(mp4.CreateDref())
    minf.AddChild(dinf)
    stbl := mp4.NewStblBox()
    minf.AddChild(stbl)
    stsd := mp4.NewStsdBox()
    stbl.AddChild(stsd)
    stbl.AddChild(&mp4.SttsBox{})
    stbl.AddChild(&mp4.StscBox{})
    stbl.AddChild(&mp4.StszBox{})
    stbl.AddChild(&mp4.StcoBox{})

    if isA {
        info := codec.(av.AudioCodecData)
        trak.SetAACDescriptor(mp4.AAClc, info.SampleRate())
    } else {
        info := codec.(h264parser.CodecData)
        trak.SetAVCDescriptor("avc1", info.SPS(), [][]byte{info.PPS()})
    }
}
TobbeEdgeware commented 4 years ago

OK. Looks quite similar to what is done in CreateInitSegment() followed by setting the sample descriptors except that

I think the init segment should be fine here, but the media segment generation is pretty tied to just having one track, so it would be broken. The particular thing I come to think of is the DataOffset in the trun box. It must be adjusted to handle that one media is coming displaced by the other in the mdat box. Of course, you can calculate this and adjust the value, but I recommend that you simply switch to the norm of having one media type per segment and use separate init and media segments for both video and audio. If you do that, you should also be able to replace the code above with a call to CreateInitSegment and then a call to set the corresponding descriptor.

I will try to add some documentation about the timescales and values in SampleComplete and how they are used to generate media segments.

TobbeEdgeware commented 4 years ago

Another comment, it seems that you really want to mux the tracks. Why?

It is typically the DASH manifest or HLS master playlist which tells how to combine the video and audio by reading the corresponding segments separately, so the muxing is done in the client.

For an example, look at the DASH-IF live-source simulator:

VoD: https://livesim.dashif.org/dash/vod/testpic_2s/Manifest.mpd Live: https://livesim.dashif.org/livesim/testpic_2s/Manifest.mpd

In these streams, the video timescale is 90000 and the audio timescale is 48000 as you can see by looking into the the mdhd boxes of the init segments.

ivanjaros commented 4 years ago

For me who have worked in this area for a very long time, it is obvious

Another comment, it seems that you really want to mux the tracks. Why? It is typically the DASH manifest or HLS master playlist which tells how to combine the video and audio by reading the corresponding segments separately, so the muxing is done in the client.

I have been programming for twenty years but I have never worked with video. So I am coming to this from layman's point of view. My presumtions and terminology is probably off since I am butting head with this library.

In essence, I am working on a pet project where I want to capture stream from OBS and proxy it into the browser's native video html element via media source extension. I am using this RTMP server https://github.com/nareix/joy4 that takes in the flv packet from the connection and transforms it into a/v packet https://github.com/nareix/joy4/blob/master/format/flv/flv.go#L165 which is plain h264 encoded video and aac encoded audio.

So this is where the "audio" and "video" tracks some from. I want to take these a/v packets and create mp4 files where each file is one GOP(about 2 seconds, depends on the settings in the originating stream) which I will then process in a way that can be streamed into the browser.

The media source https://developer.mozilla.org/en-US/docs/Web/API/MediaSource supports ONLY fragmented MP4 containers. Even the hls.js does this by taking TS container, extracting the data and repackaging it into fragmented mp4 before passing it to the media source's segment buffer.

So I hope now you better understand where I am coming from.

I have chcecked the code of that RTMP server, which has its own mp4 de/muxer and it too uses diffeent tracks for each stream(stream = either audio or video packet). The only difference is that it does not use fragmenting, but instead uses one single mdat and puts moov at the end instead of at the beginning and it does not write ftype. But the files will play anyway.

TobbeEdgeware commented 4 years ago

OK. Thanks for the background. I've been working with video and streaming for 25 years in standardization, prototyping, products and some open source tools via DASH-IF so this is my area of expertise.

MSE takes segments as you say, but as far as I recall you should create separate SourceBuffers for video and audio and feed them with separate segments. First an initsegment (essentially a moov box), and then a medasegments (moof + mdat boxes). Since the timescales are included in the init segments, you can have different timescale for different media.

You can of course write your own player, but I'd avoid that and use a standard DASH or HLS player, and have the receiver create an updated manifest or HLS playlist every time you've created a new segment. For DASH, one can do things more compact and use a static MPD with $Number$ and timing information and let the player calculate the time of the latest segment based on the wall clock. This requires some time sync and control of drift, so it is probably easier to start by generating a segment list.

The single mdat + moov is the simplest output format to create, but you can of course not use it for live video.

TobbeEdgeware commented 4 years ago

Btw, I think you should be able to segment your output mdat + moov into moov + n * (moof + mdat) with the new example code (examples/segmenter) I just merged.

ivanjaros commented 4 years ago

so which of these init fields need to be filled in WriteTrailer() order for the header to actually work? and i still don't understand splitting the audio and video to separate files..how does that even work? like why would i want mp3 file and soundless mp4 video? and having a/v split for media source is also something really twisted 😅

package main

import (
    "errors"
    "github.com/edgeware/mp4ff/mp4"
    "github.com/nareix/joy4/av"
    "github.com/nareix/joy4/codec/h264parser"
    "io"
    "time"
)

func NewVideoMuxer(w io.Writer) av.Muxer {
    return &videoMuxer{w: w, scale: 90000}
}

type videoMuxer struct {
    w     io.Writer
    init  *mp4.InitSegment
    frag  *mp4.Fragment
    prev  *av.Packet
    dur   time.Duration
    cIdx  int
    scale uint32
    size  int
}

func (mux *videoMuxer) WriteHeader(codecs []av.CodecData) error {
    var codec av.CodecData
    for k := range codecs {
        if codecs[k].Type() == av.H264 {
            codec = codecs[k]
            mux.cIdx = k
            break
        }
    }

    if codec == nil {
        return errors.New("no video stream present")
    }

    info := codec.(h264parser.CodecData)
    mux.init = mp4.CreateEmptyMP4Init(mux.scale, "video", "und")
    mux.init.Moov.Trak[0].SetAVCDescriptor("avc1", info.SPS(), [][]byte{info.PPS()})

    mux.frag = mp4.CreateFragment(0, 1)

    return nil
}

func (mux *videoMuxer) WritePacket(p av.Packet) error {
    if int(p.Idx) != mux.cIdx {
        return nil
    }

    if mux.prev != nil {
        mux.addPacket(p, p.Time-mux.prev.Time)
    }

    mux.prev = &p

    return nil
}

func (mux *videoMuxer) addPacket(p av.Packet, ts time.Duration) {
    sample := &mp4.SampleComplete{
        Sample: mp4.Sample{
            Flags: mp4.NonSyncSampleFlags,
            Dur:   uint32(mux.toTime(ts)),
            Size:  uint32(len(p.Data)),
            Cto:   int32(mux.toTime(p.CompositionTime)),
        },
        DecodeTime:       uint64(mux.toTime(p.Time)),
        PresentationTime: uint64(mux.toTime(mux.dur)),
        Data:             p.Data,
    }

    if p.IsKeyFrame {
        sample.Sample.Flags = mp4.SyncSampleFlags
    }

    mux.frag.AddSample(sample)

    mux.dur += ts
    mux.size += len(p.Data)
}

func (mux *videoMuxer) WriteTrailer() error {
    if mux.prev != nil {
        mux.addPacket(*mux.prev, 0)
    }

    _ = mux.init.Moov.Mvex.Trex.DefaultSampleDescriptionIndex
    _ = mux.init.Moov.Mvex.Trex.DefaultSampleDuration
    _ = mux.init.Moov.Mvex.Trex.DefaultSampleFlags
    _ = mux.init.Moov.Mvex.Trex.DefaultSampleSize
    _ = mux.init.Moov.Mvex.Trex.Flags

    _ = mux.init.Moov.Mvhd.Flags
    _ = mux.init.Moov.Mvhd.Timescale
    _ = mux.init.Moov.Mvhd.Duration
    _ = mux.init.Moov.Mvhd.CreationTime
    _ = mux.init.Moov.Mvhd.ModificationTime
    _ = mux.init.Moov.Mvhd.Rate

    for _, trak := range mux.init.Moov.Trak {
        _ = trak.Edts.Elst.Flags
        _ = trak.Edts.Elst.MediaRateFraction
        _ = trak.Edts.Elst.MediaRateInteger
        _ = trak.Edts.Elst.MediaTime
        _ = trak.Edts.Elst.SegmentDuration

        _ = trak.Mdia.Elst.Flags
        _ = trak.Mdia.Elst.MediaRateFraction
        _ = trak.Mdia.Elst.MediaRateInteger
        _ = trak.Mdia.Elst.MediaTime
        _ = trak.Mdia.Elst.SegmentDuration

        _ = trak.Mdia.Mdhd.ModificationTime
        _ = trak.Mdia.Mdhd.CreationTime
        _ = trak.Mdia.Mdhd.Flags
        _ = trak.Mdia.Mdhd.Duration
        _ = trak.Mdia.Mdhd.Timescale

        _ = trak.Mdia.Minf..... THIS NEVER ENDS.....
    }

    if err := mux.init.Encode(mux.w); err != nil {
        return err
    }

    seg := mp4.NewMediaSegment()
    seg.AddFragment(mux.frag)

    return seg.Encode(mux.w)
}

func (mux *videoMuxer) toTime(dur time.Duration) int64 {
    return int64(dur * time.Duration(mux.scale) / time.Second)
}
TobbeEdgeware commented 4 years ago

See my text and pseudo-code below, but I think that your main problem is that you try to make one big write to the source buffer at the end, but you should do a write of every single segment as it is produced. I just see that you you have a single Encode in the trailer. That will not work.

Regarding separate files, there are different rendering pipelines for the screen and the speaker, so the content must be split at some time. Of course, the underlying engine needs to sync things by appropriate timestamps (synchronized values after division with timescale). In the case of MSE, you make two source buffers like in https://stackoverflow.com/questions/51861938/javascript-mediasource-play-video-and-audio-simultaneously

 my_media_source.addEventListener("sourceopen",function (){
    var video_source_buffer = my_media_source.addSourceBuffer(video_mimeCodec);
    var audio_source_buffer = my_media_source.addSourceBuffer(audio_mimeCodec);

    //.......

    video_source_buffer.appendBuffer(...);
    audio_source_buffer.appendBuffer(...);
}

https://stackoverflow.com/questions/51861938/javascript-mediasource-play-video-and-audio-simultaneously

So you first make an init segment once and feed that into the source buffer (audio or video). Then you add media segments with increasing timestamps to the two source buffers. You need to choose an offset for your timelinewhich for RTMP is 0 when you start the session. You can keep the same offset here. The trailer should be removed since the moov is sent first in the init segment and a moof is sent before every mdat. You cannot have another moov after the whole thing.

Regarding what timestamps in SampleComplete are used by my code, that could definitely be more clearer, so I will improve the documentation there, but you can see what is propagated downwards in func (f *Fragment) AddSample(s *SampleComplete).

trun needs all the Sample data fields (flags, dur, size, cto) so they need to be properly filled and you seem to do that. tfdt has the offset of the segment so it needs the DecodeTime to be set for the first sample of the segment, but it doesn't care about it later. This should be the start offset + the accumulated value of all dur of all previous samples. I don't know your parameters, but I think you fill the right value.

PresentationTime = DecodeTime + cto, so it is wrong in your code, but it is not used for generation, so I will probably drop that, and replace it with a method that calculates it.

For audio, things look very similar except that there is no CTO (its value should be set to 0), and all samples are sync samples.

There seems to be one level missing in your case and that is the generation of sequence of media segments and repeated feeding of the source buffer. (Segments and fragments are a sort of similar but a segment in the code has one styp box and can have multiple fragments). I use segments in the below, and in my example code. They just add one styp box compared to just going for just fragments.

So basically you should do something like:

 w := writerToVideoSourceBuffer 
 videoInit := createInit()
 videoInit.Encode(w)

 segNr := 1
 seg := mp4.NewMediaSegment()
 frag := mp4.CreateFragment(uint32(segNr), 1)
 seg.AddFragment(frag)
 for {
       // Loop over packets, create a new segment every 2 s and wrote to videoSourceBuffer
       sample := dataFromPacket (fill in all values as you do)
       frag.AddSample(sample)
       if time = segNr * 2s { //Write the current segment to source buffer and start filling a new one
                 // You may also look for a sync frame here. Segments (in contrast to fragments), should actually
                 // start with a sync frame
                seg.Encode(w)
                segNr++
                seg = mp4.NewMediaSegment()
                frag := mp4.CreateFragment(uint32(segNr), 1)
                seg.AddFragment(frag)
ivanjaros commented 4 years ago

the front-end/media source is far ahead. currently i just want to produce playable fragmented mp4. this is what i have spent some hours on today, but it still does not work. i have added the if audio sync = true as mentioned above but that is just a minor detail. i think the issue is with trex for mvex since they both expect only single track which means one or the other track(a/v) won't work. my understanding is that trex is metadata for A track and holds track id, duration, size.. for it, so i think this is what breaks the endresult. Or at least on of the things :D

package main

import (
    "github.com/edgeware/mp4ff/mp4"
    "github.com/nareix/joy4/av"
    "github.com/nareix/joy4/codec/aacparser"
    "github.com/nareix/joy4/codec/h264parser"
    "io"
    "time"
)

func NewVideoMuxer(w io.Writer) av.Muxer {
    return &videoMuxer{w: w}
}

type videoMuxer struct {
    w       io.Writer
    streams []stream
}

func (mux *videoMuxer) WriteHeader(codecs []av.CodecData) error {
    mux.streams = make([]stream, len(codecs))
    for k, v := range codecs {
        mux.streams[k] = newStream(k, v)
    }
    return nil
}

func (mux *videoMuxer) WritePacket(p av.Packet) error {
    mux.streams[p.Idx].addPacket(p)
    return nil
}

func (mux *videoMuxer) WriteTrailer() error {
    var dur time.Duration

    encoders := make([]interface{ Encode(io.Writer) error }, 0, len(mux.streams)+2)
    fTyp := mp4.CreateFtyp()
    encoders = append(encoders, fTyp)

    moov := mp4.NewMoovBox()
    encoders = append(encoders, moov)

    mvhd := mp4.CreateMvhd()
    mvhd.NextTrackID = int32(len(mux.streams)+1)
    moov.AddChild(mvhd)

    //mvex := mp4.NewMvexBox()
    //moov.AddChild(mvex)

    for _, stream := range mux.streams {
        stream.writeLastPacket()
        if stream.dur > dur {
            dur = stream.dur
        }

        track, segment := stream.compile()
        encoders = append(encoders, segment)
        moov.AddChild(track)

        //trex := mp4.CreateTrex()
        //trex.TrackID = track.Tkhd.TrackID
        //trex.DefaultSampleSize = uint32(stream.size)
        //trex.DefaultSampleDuration = uint32(stream.toTime(stream.dur))
        //mvex.AddChild(trex)
    }

    moov.Mvhd.Duration = uint64(toTime(dur, int(moov.Mvhd.Timescale)))

    for k := range encoders {
        if err := encoders[k].Encode(mux.w); err != nil {
            return err
        }
    }

    return nil
}

func newStream(idx int, c av.CodecData) stream {
    s := stream{
        codec: c,
        frag:  mp4.CreateFragment(1, uint32(idx+1)),
        scale: 90000,
        idx:   idx,
    }

    if c.Type().IsAudio() {
        info := c.(aacparser.CodecData)
        s.scale = info.SampleRate()
    }

    return s
}

type stream struct {
    codec av.CodecData
    frag  *mp4.Fragment
    dur   time.Duration
    size  int
    prev  *av.Packet
    scale int
    idx   int
}

func (s *stream) addPacket(p av.Packet) {
    if s.prev != nil {
        s.writePacket(p, p.Time-s.prev.Time)
    }
    s.prev = &p
}

func (s *stream) writePacket(p av.Packet, dur time.Duration) {
    sample := &mp4.SampleComplete{
        Sample: mp4.Sample{
            Flags: mp4.NonSyncSampleFlags,
            Dur:   uint32(s.toTime(dur)), // @todo last one is always 0
            Size:  uint32(len(p.Data)),
            Cto:   int32(s.toTime(p.CompositionTime)),
        },
        DecodeTime:       uint64(s.toTime(p.Time)),
        PresentationTime: uint64(s.toTime(s.dur)),
        Data:             p.Data,
    }

    if p.IsKeyFrame || s.codec.Type().IsAudio() {
        sample.Sample.Flags = mp4.SyncSampleFlags
    }

    s.frag.AddSample(sample)

    s.dur += dur
    s.size += len(p.Data)
}

func (s *stream) writeLastPacket() {
    if s.prev != nil {
        s.writePacket(*s.prev, 0)
        s.prev = nil
    }
}

func (s *stream) compile() (*mp4.TrakBox, *mp4.MediaSegment) {
    seg := mp4.NewMediaSegment()
    seg.AddFragment(s.frag)

    // there are only audio and video codecs so we can use if/else just fine
    if s.codec.Type().IsVideo() {
        info := s.codec.(h264parser.CodecData)
        return getVideoTrack(info, s.idx+1, s.scale, s.dur), seg
    } else {
        info := s.codec.(aacparser.CodecData)
        return getAudioTrack(info, s.idx+1, s.scale, s.dur), seg
    }
}

func (s *stream) toTime(dur time.Duration) int64 {
    return toTime(dur, s.scale)
}

func toTime(dur time.Duration, scale int) int64 {
    return int64(dur * time.Duration(scale) / time.Second)
}

func getVideoTrack(info h264parser.CodecData, trackId int, scale int, dur time.Duration) *mp4.TrakBox {
    track := new(mp4.TrakBox)

    tkhd := mp4.CreateTkhd()
    tkhd.TrackID = uint32(trackId)
    tkhd.Duration = uint64(toTime(dur, scale))
    track.AddChild(tkhd)

    mdia := new(mp4.MdiaBox)
    track.AddChild(mdia)

    mdhd := new(mp4.MdhdBox)
    mdhd.Timescale = uint32(scale)
    mdia.AddChild(mdhd)
    mdhd.SetLanguage("und")

    hdlr := &mp4.HdlrBox{HandlerType: "vide", Name: "Video Encoder"}
    mdia.AddChild(hdlr)

    minf := mp4.NewMinfBox()
    mdia.AddChild(minf)
    minf.AddChild(mp4.CreateVmhd())

    dinf := new(mp4.DinfBox)
    dinf.AddChild(mp4.CreateDref())
    minf.AddChild(dinf)

    stbl := mp4.NewStblBox()
    minf.AddChild(stbl)

    stsd := mp4.NewStsdBox()
    stbl.AddChild(stsd)

    stbl.AddChild(&mp4.SttsBox{})
    stbl.AddChild(&mp4.StscBox{})
    stbl.AddChild(&mp4.StszBox{})
    stbl.AddChild(&mp4.StcoBox{})

    track.Tkhd.Width = mp4.Fixed32(info.Width() << 16)
    track.Tkhd.Height = mp4.Fixed32(info.Height() << 16)

    avc := mp4.CreateAvcC(info.SPS(), [][]byte{info.PPS()})
    vs := mp4.CreateVisualSampleEntryBox("avc1", uint16(info.Width()), uint16(info.Height()), avc)
    vs.CompressorName = "Video Packager"
    track.Mdia.Minf.Stbl.Stsd.AddChild(vs)

    return track
}

func getAudioTrack(info aacparser.CodecData, trackId int, scale int, dur time.Duration) *mp4.TrakBox {
    track := new(mp4.TrakBox)

    tkhd := mp4.CreateTkhd()
    tkhd.TrackID = uint32(trackId)
    tkhd.Duration = uint64(toTime(dur, scale))
    tkhd.Volume = 0x0100
    track.AddChild(tkhd)

    mdia := new(mp4.MdiaBox)
    track.AddChild(mdia)

    mdhd := new(mp4.MdhdBox)
    mdhd.Timescale = uint32(scale)
    mdia.AddChild(mdhd)
    mdhd.SetLanguage("und")

    hdlr := &mp4.HdlrBox{HandlerType: "soun", Name: "Audio Encoder"}
    mdia.AddChild(hdlr)

    minf := mp4.NewMinfBox()
    mdia.AddChild(minf)
    minf.AddChild(mp4.CreateSmhd())

    dinf := new(mp4.DinfBox)
    dinf.AddChild(mp4.CreateDref())
    minf.AddChild(dinf)

    stbl := mp4.NewStblBox()
    minf.AddChild(stbl)

    stsd := mp4.NewStsdBox()
    stbl.AddChild(stsd)
    stbl.AddChild(&mp4.SttsBox{})
    stbl.AddChild(&mp4.StscBox{})
    stbl.AddChild(&mp4.StszBox{})
    stbl.AddChild(&mp4.StcoBox{})

    //asc := &mp4.AudioSpecificConfig{
    //  ObjectType:           byte(info.Config.ObjectType),
    //  ChannelConfiguration: 2, //byte(info.ChannelLayout().Count()),
    //  SamplingFrequency:    info.SampleRate(),
    //}
    //switch asc.ObjectType {
    //case mp4.HEAACv1:
    //  asc.ExtensionFrequency = 2 * info.SampleRate()
    //  asc.SBRPresentFlag = true
    //case mp4.HEAACv2:
    //  asc.ExtensionFrequency = 2 * info.SampleRate()
    //  asc.SBRPresentFlag = true
    //  asc.ChannelConfiguration = 1
    //  asc.PSPresentFlag = true
    //}
    //
    //buf := bytes.NewBuffer(nil)
    //err := asc.Encode(buf)
    //if err != nil {
    //  panic(err)
    //}
    //ascBytes := buf.Bytes()
    esds := mp4.CreateEsdsBox(info.MPEG4AudioConfigBytes())
    mp4a := mp4.CreateAudioSampleEntryBox(
        "mp4a",
        uint16(info.ChannelLayout().Count()),
        uint16(info.SampleFormat().BytesPerSample()),
        uint16(info.SampleRate()),
        esds,
    )
    stsd.AddChild(mp4a)

    return track
}

also i have just one fragment per track and just one segment for all fragments, since each clip is just a gop which is about 2 seconds long and i do not need to fragment the data more granuraliry at this time.

TobbeEdgeware commented 4 years ago

I think you have the wrong intermediate goal. Don't try to make mux audio and video into one file. This is going off in the wrong direction. Your final goal is to send data segment by segment to a player as two parallel streams without any muxing.

If you want to stay with just one fragment for now, it's fine. I'd still suggest that you make 4 different files:

video_init.mp4 video_1.m4s. 
audio_init.mp4 audio_1.m4s

You can then concatenate these and get something playable:

cat video_init.mp4 video_1.m4s > video.mp4
cat audio_init.mp4 audio_1.m4s > audio.mp4
ffplay video.mp4
ffplay audio.mp4

As a temporary check that the are aligned, you can mux them using MP4Box

MP4Box -add video.mp4 -add audio.mp4 combined. mp4
ffplay combined.mp4

This will move you back to a progressive mp4 file, but it will show that the timing is correct, and that is all you need.

Another way of playing the video and audio together is to make an HLS asset ofby defining a master playlist pointing to two media playlists which in turn have an init segment and a media segment corresponding to the segments above. That should also play.

Anyway, your end goal should be to generate two separate sequences of segments, one for audio and one for video and feed them to MSE.

ivanjaros commented 4 years ago

ok, i will give it a go. i thought i would just have one source buffer into which i would just push either one mp4 file after another(then i learnt that is not how that works) or more likely just the ftyp+moov header and then just one moof+mdat after another as they become available. but i just still don't understand why the endresult won't play(vlc). the original muxer https://github.com/nareix/joy4/blob/master/format/mp4/muxer.go just takes all packets and writes them into a single buffer that represents single mdat and it just closes the entire blob with moov and that is it. and it plays just fine. there are still separate tracks, so no magic there. but it works and this does not. i just can't wrap my head around why. when i inespected the file with mp4parser.com the structure was correct. there was nothing indicating it cannot play. also, mp4 is supposed to be a container so why so many track related issues?

TobbeEdgeware commented 4 years ago

You can make things work, but you cannot interleave audio and video samples without extra lists of offsets into mdat. For traditional mp4 files this is done using chunk boxes. They are not available for fragmented files so you need to use some other tricks to do that here. In particular, you need to set some offsets in trun in another way for the player to be able to find the samples. That is not supported in this library which sets the offsets for finding the media samples as the start of the mdat right after the moof box. If you're interested how to make combined muxed segments, I have some old Python code for muxing audio and video segments in https://github.com/Dash-Industry-Forum/dash-live-source-simulator/blob/develop/dashlivesim/dashlib/segmentmuxer.py. However, as I said nobody is doing that any longer.

TobbeEdgeware commented 4 years ago

You can possibly also generate a playable muxed segment file with this library by generating a muxed init segment and concatenate segment it with segment 1 for video and segment 1 for audio(with different trakIDs for video and audio). In this case, all offsets should be OK since the full structure is moov + moof_v_1 + mdat_v_1 + moof_a_1 + mdat_a_1 and each moof points to the mdat that is following. That would again mostly be an intermediate result to verify that the extracted content is playable , but it should be relatively easy to achieve with your code.

ivanjaros commented 4 years ago

ok, thank. now i have a btter picture of this situation :)

ivanjaros commented 4 years ago

You can then concatenate these and get something playable:

cat video_init.mp4 video_1.m4s > video.mp4 cat audio_init.mp4 audio_1.m4s > audio.mp4 ffplay video.mp4 ffplay audio.mp4

When i do this I will get illegal short term buffer state detected for each segment i add to the final result. Seems like either the segment(one per file) or its fragment(one per segment) has some missing information.

I have tried to use mp4.CreateFragment(0, 1) and mp4.CreateFragment(1, 1), just in case, since i am not sure which sequence the fragment should start with(0 or 1), unlike track which is always 1. But it made no difference.

..

Also I am splitting the stream by GOP. So the first frame is always key frame and when the next one comes in, i close the file(gop is done) and create a new one. The simple mdat muxer played the files just fine so I presume this is not the issue but I still want to point it out, just in case.

ivanjaros commented 4 years ago

i tried to set fragment sequence to match the chunk sequence instead of always being 1 but it made no impact.

when i join the video track and audio track(init+chunks) into individual mp4 files(a.mp4, v.mp4), the video plays but it is just a green image with some artifacts over the image that resembles some fragments of the orignal recording. if i check file info i see resolution, bitrate and fps but no duration.

when i play the audio file, it plays just fine but when i open file info i only see the sampling frequency and number of channels and nothing else. since these fragments are dynamic, the missing a/v duration in metadata is ok.

so all in all it works, but the video is corrupt for some reason. as mentioend before, using one mdat worked before just fine, so i am guessing there are some data in the audio stream packets that are missing in the video stream...but that is as far as my understanding goes.

ivanjaros commented 4 years ago

and sorry for polluting this issue, since it was originally about something else but i think this will at least help someone in the future.

TobbeEdgeware commented 4 years ago

Well, I think you should first try to make proper video segments without audio and vice versa and make them play. It seems that you tried to do that, but I guess that you somehow got some bytes too much or tool less in the video data.

The green color comes from decoded zero pixel values in the YCbCr color space, indicating that the video decoder has some issue with decoding the video.

The duration in moov box describes the duration of the samples in the moov box. It is zero. You can put the duration of the combined fragments in an optional mehd box inside mvex, but is rarely done. Instead, the duration is most often described in a DASH manifest of HLS playlist at the side.

The sequence number in the segment does normally not matter, but I haven't tried the case of a mixed init segment followed by a video and then an audio segment.

I don't know exactly what you do wrong, but one can look at the binary data of the video at the beginning of the mdat box and see if it looks correct. With 'hexdump -C | more' you can see where the mdat starts. Right after it, the first picture starts and it consists of NAL units proceeded with 4-bytes length fields. Each NAL unit start with a one-byte header and the first 5 bits are the type.

At the start of a GoP should find NAL type 7 SPS and 8 PPS (can be more than one) and 5 actual video IDR NAL. However, you may also see nal type 6 SEI NAL before the video NAL, and a short type 9 at start.

Below is hexdump -C video_0.m4s from the first segment extracted from an RTMP source ... 00000350 00 00 74 b8 00 00 4f 6d 6d 64 61 74 00 00 00 19 |..t...Ommdat....| 00000360 67 64 00 1e ac d9 40 a0 2f f9 61 00 00 03 00 01 |gd....@./.a.....| 00000370 00 00 03 00 3c 8f 16 2d 96 00 00 00 05 68 eb ec |....<..-.....h..| 00000380 b2 2c 00 00 02 f2 06 05 ff ff ee dc 45 e9 bd e6 |.,..........E...|

After mdat, there is a length 00 00 00 19 hex = 25 dec (in italics) followed by 1-byte bold header *67 meaning NAL type 7, so this should be the 25 byte SPS of the content 25 bytes later, there is an italics length 00 00 00 05 followed by 1-byte bold header 67, corresponding to the 5-byte PPS, and and 5 bytes later there a length field and the start of a type 6 SEI NAL unit. At the end of that, the actual video NAL unit of type 5 starts.

If suspect that your mdat body does not look like this, so you may compare and try to find out if you have included a byte too much or too little.

You could also try to take the output from the joy4 RTMP server and try to segment it with my segmenter example code. It should produce init segments and media segments that you can compare your generated files with.

Btw, I have made a PR #11 with some more documentation, and where I also change some names and makes it clear that one cannot have multiple tracks in media segments in this librar. It may be a bit late for you, but if you have any feedback, it would be nice to know.

ivanjaros commented 4 years ago

The duration in moov box describes the duration of the samples in the moov box. It is zero. You can put the duration of the combined fragments in an optional mehd box inside mvex, but is rarely done. Instead, the duration is most often described in a DASH manifest of HLS playlist at the side.

well, i actually don't need the standalone video to work, if the init is ok and each segment is ok, then that is not of any concern to me. i was just trying to quickly play it as you mentioned above as it might help as quick indicator but ..it just brought another issue i don't need to be dealing with.

The sequence number in the segment does normally not matter, but I haven't tried the case of a mixed init segment followed by a video and then an audio segment.

i haven't tried to mix them together and as mentioned, have no need. my main goal is to make the final stream work. so far i have the entire backend done, now i am just struggling a bit with reading "chuked transfer encoding" response in javascript properly(for some reason text responses work but blob response merges the chunks). after that i think it should work as you described above.

if it won't work i will look at the raw data, as you described. sounds "fun" :D

as for documentation, once i am done and everything works, i will have a look and mabye give some pointers via new ticket.

... firefox shows Invalid Top-Level Box error for init segment so there is still some work ahead :D

TobbeEdgeware commented 4 years ago

OK. Good luck. I could have a look at your output segments if you get stuck.

Regarding chunked transfer encoding reception in Javascript, there is as far as I know no way way with the fetch API to guarantee that you get the actual chunks with the transport boundaries preserved , but you'll get some "arbitrary" concatenations at some "arbitrary" times. There are time resolutions and buffer levels which prohibits direct propagation on chunk reception. You can add a parser to find the fragment boundaries, but from what I've learned, you can just add the more or less partial data to the source buffer as you receive it and it should work.

Having a long http request and sending all fragments as HTTP chunks may not be optimal, though. A series of HTTP GETs like in HLS or DASH, or a web socket where the fragments are pushed is probably more robust.

ivanjaros commented 4 years ago

interestingly i can open the video track in chrome and it will work just fine(still green video though but i see movement so the basics work) but media source will always crash and detach soruce buffer. so i guess i will be working on that cocnated video first and when i get rid of that green issue that might jsut fix it altogether. direct-play

ivanjaros commented 4 years ago

i have created just one segment instead of one per GOP and the video now works. there is the green at the beginning still, but then it catches up and renders properly. even windows shows proper thumbnail forthe video. so neither the headers or the segments are the issue here but the packets themselves.

..

looking at the packets, the only thing i can see is that the sample.Sample.Cto is always 0 in my case since all packets have composition time = 0, which is the value i use for Cto. on the other hand, you have mentioned that the green is issue with init so i am quite lost at this point(not that i wasn't until now anyway :D).

..

The duration in moov box describes the duration of the samples in the moov box. It is zero. You can put the duration of the combined fragments in an optional mehd box inside mvex, but is rarely done. Instead, the duration is most often described in a DASH manifest of HLS playlist at the side.

since the moov is in the header, i cannot do that but i am wondering if there is something in the fragment that needs to be manually adjusted? maybe just frag := mp4.CreateFragment(1, 1) .. frag.AddSample(sample) is not enough?

TobbeEdgeware commented 4 years ago

No, green is not an issue with init segment. It is an issue with missing pixels. It is probably because there is no, or a corrupt, key frame (IDR frame) at the start of the video. In that case, the first shown image may be just a delta image superposed on a green background.

ivanjaros commented 4 years ago

first packet is always key frame. i have now tried, out of curiosity, not splitting the packets by streams(tracks) so the video fragment got the audio packets as well and i got the same result - green video. so i am still thinking the fragment has something to do with it.

as for the key frames, this is the logic that determines which packet is a key frame a which is not: https://github.com/nareix/joy4/blob/master/format/flv/flv.go#L174

maybe interestingly this is how what the "dumb" mp4 mdat muxer does https://github.com/nareix/joy4/blob/master/format/mp4/demuxer.go#L382

TobbeEdgeware commented 4 years ago

Looks like standard signaling in RTMP and monolithic mp4 files, so it is not fetched from the media it self by analyzing the NAL units. Anyway, this boolean information is probably correct, so if you use try to write that frame into a sample and it does not show up as a proper image, there is probably some error in that process, or in the SPS/PPS writing.

What version of the library are you using? There was a bug in the AVDecoderConfigurationRecord parsing that was fixed in commit 2d0f5fc after the latest release (0.9.0). Depending on the AVC content it could make a difference, so please use the latest code.

Note that I changed the names of some structs and methods in the PR I just merged, so the external API changed and you need to do some small changes to your code.

ivanjaros commented 4 years ago

ok, i will check out the latest version. as for sps/pps, that is one area that i think might be where the issue lies. i tried to use your avc code and i tried using sps/pps directly but it made no difference.

ivanjaros commented 4 years ago

tried the latest version, no change. audio plays fine, video still green.

ivanjaros commented 4 years ago

i think after almost a week with this, i must draw a conclusion that if the "dumb" mp4 moov+mdat muxer from joy4 works and this does not produce the correct/expected end-result, the fault must lie with this library. a sherlockian conclusion, if you will. i know very little about video containers and codecs, as mentioned in my early post, so i cannot debug this myself. but after the time i have spent trying to make it work, i fail to see there is an issue on the joy4 library's side and i don't do any transformation in my own code. so what is left is just this library. apparently the audio works so the issue is in the video track. could it be that there is "lack" of support for some codes "stuff"? i am using OBS with the same settings i use for youtube:

renderer: direct 3d 11
color format: nv12
color space: 601
color range: partial
1080p scaled down to 720p
bicubic filter
15 fps
1000 kb/s bitrate
qsv encoder
TobbeEdgeware commented 4 years ago

Well, this is not enough for me to find a bug, since I need to be able to reproduce it. I've used the code in some various ways myself and had no problems with the output video I've created independent if it comes from RTMP or other MP4 files.

But please open a new ticket and provide me with some media segments/samples I can look and I can try to see what is wrong. The best would be to get the same samples packed in something that plays properly like the mdat+moov muxer and init and media segments of the same samples generated by your code.

If you cannot get exactly the same samples, it will be some more work, with somewhat less chance of success of pinpointing the issue, but I can give it a try.

ivanjaros commented 4 years ago

would you want the samples or the code so you can try it with obs yourself? ...

if you want to have a look, i have crated https://github.com/ivanjaros/streamer, it has all dependencies so just buld it and run it, it will start rtmp server on port 1935 and the just stream into it with OBS with address rtmp://127.0.0.1 in "custom service"(not youtube, twitch...) and the key you will use will be a cwd directory where data will be saved.

ivanjaros commented 4 years ago

this is hex of the first segment i just made. highlighted according your description above(i hope i got it right). hex-dump

TobbeEdgeware commented 4 years ago

I just wanted an init segment and a media segment so that I could analyze it, but the hexdump also relieves quite a lot

The mdat content shows that you have: 2 byte type 9 AUD NAL unit 09 30 No SPS 4 byte type 8 PPS NAL unit 28 EE 3C B0 13 byte type 6 SEI 01 07 00 00 03 02 00 00 02 .. 04 80 253 (0xFD) byte type 1 Non-IDR frame 41 9A 02 3F ... This should have been NAL type 5 = IDR and much bigger than 253 bytes

Thus fragment does not start with an I-frame so it is understandable that it looks green.

You're also missing an SPS in the sample and just have a PPS, which is quite strange. One can choose have them in the moov box and or in the media, but normally one have SPS and PPS together. One case where it may occur to only have PPS in the media is when one have multiple PPS and switch at a non-IDR frame, which is probably the case here.

So, it seems to be me that you start your fragment with a non-IDR frame. Maybe the IDR-frame is instead the end of the previous segments.

Thanks for the link to your code. I took a very quick look at it, but I'm not sure I can get the time to find the likely bug.

TobbeEdgeware commented 4 years ago

Something I saw in your code is that you're too focused on calculating dur by taking differences in time.

Therefore you drop the first packet in the method

func (s *stream) addPacket(p av.Packet) {
    if s.prev != nil {
        s.writePacket(p, p.Time-s.prev.Time)
    }
    s.prev = &p
}

and you set the duration to zero of the last sample.

This is not the way it should work although I mentioned it to explain that dur is a delta time (I saw in the the joy4 code that it is described as a decode time which it is not). dur is essentially a constant. For 48kHz AAC audio it is 1024 since there are 1024 audio samples in each frame (an mp4 sample). Similarly, the video dur for 29.97Hz video should be exactly 3003 in 90000 timescale. So don't drop any sample, but try to estimate the frame rate (fixed duration) and use it. Then you can with great confidence assume that the duration of the last sample is the same as the others.

An advantage of having a constant dur value is that it can be set in a default field of tfhd and be removed on the sample level from the trun box. This will shrink the moof box. I'm preparing a PR to introduce that optimization.

There is of course always an exception, and that is if you have a web-cam or other camera with variable frame rate.

Anyway, maybe the first packet drop has some influence on the missing IDR-frame (although I cannot see directly that it should since you should wait a second or more until the next comes)?

ivanjaros commented 4 years ago

i was writing a long answer, describing the duration computation for a packet since there is no framerate information and then i have noticed the bug that you have pointed out yourself 😂

this s.writePacket(p, p.Time-s.prev.Time)

should have been s.writePacket(*s.prev, p.Time-s.prev.Time)

when fixed the concated video file is working. i will be testing this tomorrow. but it seems this bug in my code caused the key frame to drop and cause the aforementioned color issues. as for the missing sps/pps, they are set only in the init part and not segment so i am not sure how i should set them when i am writing fragments/segment.

TobbeEdgeware commented 4 years ago

Glad that you found the bug. You don't need to care about the SPS/PPS in the media sample. The single PPS without SPS just emphasized that the first sample could not be an IDR frame.

ivanjaros commented 4 years ago

i will finally close this issue and possibly open new one if something comes up.

last thing i will mention will regard the documentation, not the code but the project. there should be some mention of not mixing the tracks and the fact that, as in my case, once the data is out of the original mp4 container(ie. i had audio packets and video packets), mixing them together will not fly with this library. i guess it depends on the person - if they know what they are doing or not(like me). i guess i was mislead(?) by this example of media source http://nickdesaulniers.github.io/netfix/demo/bufferAll.html linked from https://developer.mozilla.org/en-US/docs/Web/API/MediaSource because that particular video is fragmented but it has audio and video tracks and plays as single mp4. hence i thought i can have one, "normal", mp4 that i just somehow restructure into multiple segments instead of one big chunk. obviously that was incorrect information that lead me into the wrong direction from the get go.

TobbeEdgeware commented 4 years ago

OK. Thanks for the feedback.

I have added some documentation text about single tracks now and also return an error if the one tries to add another track to the moof box (moov is OK to have multiple tracks since that is a possible input that we want to split). That should hopefully be early indications that multiplexed fragmented MP4 files are not supported.

I've chosen the one-track limiting approach instead of spending time on getting muxed segments working since I have not been close to using them in many years, and the direction of interest of everyone is more towards multi-fragment low-latency streaming, where the tracks are definitely not muxed.

Regarding mime-types, as you hint, such utility functions don't really fit into this library but the essential information about different AAC cases is decoded in mp4/aac.go, so it makes sense to expand the API a little.

I only handle the main three AAC configurations:

name      mime-type 
AAC-LC    mp4a.40.2
HE-AACv1  mp4a.40.5
HE-AACv2  mp4a.40.29

Similarly for AVC/H.264, the utility functions could be expanded a bit.

Good luck with the rest of your implementation!