ffmpeg's AV_PKT_FLAG_KEY is not the same as type=key in WebCodecs

reinhrst commented 11 months ago

In packetToEncodedVideoChunk the type is set to "key" if packet.flags & AV_PKT_FLAG_KEY:

type: (packet.flags & 1) ? "key" : "delta",

There is however an annoying thing that ffmpeg sets this flag for any I-frame with a recovery message, whereas Webcodecs demands an IDR frame to start decoding from (at least in h264). I previously made a request to allow decoding to start at a recovery message, but this has not been implemented (yet). Current thinking (from the linked issue) seems to be to add an extra type (recover rather than key) for these frames.

Now I don't have a good solution how to fix this for now (since I'm not sure ffmpeg internally keeps track of the difference between I and IDR frames), but I just wanted to make sure this issue is logged somewhere.

Yahweasel commented 11 months ago

I am aware of this issue, but have no solution either. This would require parsing packets, and that's quite a heavyweight behavior. My "solution" for now has been "keep wandering through keyframes until you get to one VideoDecoder doesn't choke on". Personally, I'm with FFmpeg on this one; recovery frames are keyframes, that's what recovery is. But, my personal opinion on VideoDecoder being lame doesn't change it ;)

Yahweasel commented 11 months ago

(And, luckily, VideoDecoder in both Chrome and Safari don't behave incorrectly if you overspecify keyframes except for the first frame. So, once you've got VideoDecoder into a steady state, this marking doesn't matter.)

reinhrst commented 11 months ago

I see if I can get this other ticket moving again a bit, by adding the information that libav is not able to distinguish between an IDR frame and an I + recovery frame. One of the "promises" of Webcodecs is after all that you could use something like libav to demux.... It also seems to me that changing the behaviour (i.e. allow I + recovery frames as start frame) is not much more than a spec change; the h264 decoders in browsers (at least the ones used for the HTMLVideoElement) can start just fine on an I + recovery frame....

marcello3d commented 11 months ago

Random idea I haven't tried, what if you:

keep around the very first keyframe from the file
try catch around decoder.decode()
if you hit the error, decode(firstKeyframe) then immediately the recovery keyframe again

Yahweasel commented 11 months ago

* try catch around decoder.decode()
VideoDecoder doesn't throw exceptions, it just calls an error callback (and then stops working).

If this technique works, the practical way to use it would be to always send the "known extra-key-flavored" keyframe every time you seek and just ignore the first output frame.

marcello3d commented 11 months ago

In my experience chrome actually throws this specific error synchronously. Which is weird given the async callback but that's what happened in my experiments. Are you seeing differently? Maybe chrome changed their behavior since I tried it.

reinhrst commented 11 months ago

The spec specifies that decoder.decode() should throw an error if a key frame is expected and not received (and it's my experience on Chrome that it does this). Note that it's only a SHOULD and not a MUST in the spec (i.e. if chunk.type !== "key" it MUST throw an error, else it SHOULD inspect the packet to see if the decoder thinks it's a key frame.

I'm quite sure that @marcello3d's idea works (just to wait until you have something that it's willing to decode, and then re-feed the first frame). Something I was playing with myself was to have VideoEncoder (with the same config) encode 1 KeyFrame from a green HTMLCanvas and feed this as first frame (so that the system would still work even in the rare case there is no IDR frame in the whole video file).

Didn't get this to work, since I was working on an annexB file, and the first frame is expected to contain SPS/PPS info, but would expect this to work for a AVC stream...

Yahweasel commented 11 months ago

Re throwing, OK, I'm just nuts, ignore me :) . Misremembering the various error modes.

Using an encoder to get a known-good frame is a great idea! ... assuming your system always has all the same encoders as it has decoders. I guarantee you that in 2037 when Firefox gets WebCodecs support, it'll only have decoders for everything but VP8 ;)

reinhrst commented 11 months ago

Using an encoder to get a known-good frame is a great idea

I'm going to give this approach a try in the next couple of days (and should be easier to implement now that #2 is fixed), will report here!

As a practical matter to move this issue towards closing: Maybe just make a link "limitations" from the documentation for packetToEncodedVideoChunk to this issue, since it's not something that I think can be fixed anytime soon (unless we find out how to query ffmpeg if something is an IDR frame), at least not from this side...

Yahweasel commented 11 months ago

Just as a point of why this is so difficult: this isn't really a disagreement between FFmpeg and WebCodecs, it's a disagreement between every file format and WebCodecs. The keyframe flag in a packet from FFmpeg/libav isn't coming from a decoder, and it doesn't even need a decoder or parser for the appropriate codec to set it. It comes from the file format. The reason recovery frames are marked as keyframes to FFmpeg is because they're marked as keyframes in .mp4 and all other file formats, and the reason they're marked as keyframes there is because they are.

I assume that the real bone of contention is that WebCodecs sees itself more as a realtime multimedia framework, and in the realtime space, you certainly wouldn't want B-frames sneaking in at the beginning of decoding. But when dealing with arbitrary files, we don't have the luxury of being so picky.

Yahweasel commented 11 months ago

I've added the suggested warning to the API documentation.

marcello3d commented 11 months ago

@Yahweasel want to post that perspective on https://github.com/w3c/webcodecs/issues/650 ? The more feedback/perspectives the more likely something will happen

reinhrst commented 11 months ago

Something I was playing with myself was to have VideoEncoder (with the same config) encode 1 KeyFrame from a green HTMLCanvas and feed this as first frame.

So this strategy works (at least in two of my example videos that start with I frames with recovery flag; one in annexb and one in avc format). It should be noted that I explicitly request the software decoder, which is supposedly more forgiving on a lot of issues (I use the software decoder since my video is interlaced and Chrome doesn't have hardware support for it, and has a bug that it doesn't auto-select the software decoder for interlaced stuff). I'm quite sure the encoded frame will not be interlaced, but still, it is accepted as the first (key) frame and afterwards the stream plays just fine.

async function createFakeKeyFrameChunk(
  decoderConfig: VideoDecoderConfig
): Promise<EncodedVideoChunk> {
  // next 6 lines could be made in one on platforms that support Promise.withResolvers()
  let resolve: (value: EncodedVideoChunk) => void
  let reject: (error: any) => void
  const promise = new Promise<EncodedVideoChunk>((res, rej) => {
    resolve = res
    reject = rej
  })
  const encoderConfig = {...decoderConfig} as VideoEncoderConfig
  // encoderConfig needs a width and height set; in my tests these dimensions
  // do not have to match the actual video dimensions, so I'm just using something
  // random for them
  // UPDATE: see below for new insights!!!!!
  encoderConfig.width = 640
  encoderConfig.height = 360
  encoderConfig.avc = {format: decoderConfig.description ? "avc" : "annexb"}
  const videoEncoder = new VideoEncoder({
    output: (chunk, _metadata) => resolve(chunk),
    error: e => reject(e)
    })
  try {
    videoEncoder.configure(encoderConfig)
    const oscanvas = new OffscreenCanvas(encoderConfig.width, encoderConfig.height)
    // getting context seems to be minimal needed before it can be used as VideoFrame source
    oscanvas.getContext("2d")
    const videoFrame = new VideoFrame(
      oscanvas, {timestamp: Number.MIN_SAFE_INTEGER})
    try {
      videoEncoder.encode(videoFrame)
      await videoEncoder.flush()
      const chunk =  await promise
      return chunk
    } finally {
      videoFrame.close()
    }
  } finally {
    videoEncoder.close()
  }
}

(In case anyone is wondering: WebCodecs software decoder is still about 10x faster in decoding the image than libav.js on my Macbook M2.)

Update: the code above says that width and height don't matter. This was true when decoding an annexB stream, but was NOT true when decoding an avc stream. So best to make sure that width and height match!

Yahweasel commented 10 months ago

For the moment I'm going to close this ticket as E_NOTMYPROBLEM. This discrepancy exists, but there's nothing that can be done about it in libavjs-webcodecs-bridge.

reinhrst commented 10 months ago

So this strategy works (at least in two of my example videos that start with I frames with recovery flag; one in annexb and one in avc format) [...]

I have some update; not sure if there is some h264 wizard here who could help me, but at least I wanted to share it so that others running into this issue see that my solution above does not solve everything.

As a background: the video I'm working with is an MTS (mpegts) file from a JVC camcorder (annexb). The results below are the findings I get when I remux this file to mp4 (avcc). This is because random searching in MTS files seems to be broken in libav (for the last 11 years). Indeed, when calling the avformat_seek_file*() on the MTS file, the first packet is not a keyframe most of the time (and the pts found is not accurate).

A more in-depth description of the file I'm working with is in this StackOverflow answer; for here, the important thing is that there are IDR frames once every 300 frames, I frames every 12 frames. There are P frames every 3 frames, with 2 B frames in between: IBBPBBPBBPBBIBBPBB....

Since the stream starts with two B frames (in presentation order; unless otherwise stated, everything in here is in presentation order, first frame has framenr=0), IDR frames are found when frameNr % 300 == 2.

In order to enable random access in the stream, I use the trick above; after flush()ing the VideoDecoder, I feed it one fake video frame, then do a avformat_seek_file_max() with the pts of the frame I want[^1] (which seeks to keyframe that is before (in decoding order!!) the frame with the requested pts), and start feeding packets from there to the VideoDecoder.decode() method. Then I feed enough packets until the VideoFrame that I want pops out (how many packets you have to feed exactly is a bit wishy-washy, especially if you don't want to call flush() on the decoder, but not important for this discussion).

Whether this is successful depends on what framenr % 300 you ask for (remember, there are IDR frames every 300 frames):

Things will go well if decoding starts at an I frame that has framenr % 300 <= 216, so one of the first 18 I frames after the IDR frame
If decoding starts at framenr % 300 >= 228 (so not one of the first 18, out of 24 inter-IDR I-frames), VideoDecoder will only return the I and P frames (not the B frames) -- it will continue to do so until the next IDR frame, after which B frames are returned again.

If this looks familiar to anyone, I would love to hear it. In the meantime I'll look for a solution or root cause (keeping in mind that the seen behaviour may very well be an implementation artifact of the codec in Chrome).

[^1]: Actually I look for "requested pts minus 2 frames", since looking for the requested pts will not work when it's one of the B-frames directly preceding (in presentation order; directly following in decoding order) the non-IDR I-frame, since it's dependent on a P frame that was before the I-frame. This is exactly the difference between an IDR and an non-IDR I frame as described in the stackoverflow answer mentioned above.

Yahweasel commented 10 months ago

I have no conclusion, but I would suggest that you may not be looking for an H.264 wizard. To me, this sniffs of avformat's seeking (either the general part or the MOV-specific part) being too clever by negative half.

reinhrst commented 10 months ago

I have no conclusion, but I would suggest that you may not be looking for an H.264 wizard. To me, this sniffs of avformat's seeking (either the general part or the MOV-specific part) being too clever by negative half.

Interesting idea. Do you mean that seeking for e.g. frame 245 does not result in starting at frame 240 (the keyframe) but something else? I guess I should be able to debug that quick enough, by printing the pts's of all packages that I receive after the seek.... Or do you mean something else?

Yahweasel commented 10 months ago

Interesting idea. Do you mean that seeking for e.g. frame 245 does not result in starting at frame 240 (the keyframe) but something else? I guess I should be able to debug that quick enough, by printing the pts's of all packages that I receive after the seek.... Or do you mean something else?

Well, it certainly never seeks to the exact frame you ask for unless that frame happens to be a keyframe, but what I'm suggesting is that if you seek to 245, it will certainly seek to the keyframe 240, but might "intelligently" drop the B-frames that it thinks you don't need: the pts it seeked to is 240, so it'll drop anything with pts<240. Or, equivalently, it might just drop anything with a pts lower than where it ended up seeking to, which may be different from where you asked it to seek to. The fact that the frames are vanishing to me suggests the possibility of avformat shenanigans rather than avcodec or WebCodecs shenanigans. But, just guessing and grasping at straws here :)

Yahweasel / libavjs-webcodecs-bridge

ffmpeg's AV_PKT_FLAG_KEY is not the same as type=key in WebCodecs #3