chemag / h265nal

Library and Tool to parse H265 NAL units
Other
53 stars 19 forks source link

AV1 OBU Support #28

Open morgabm opened 5 months ago

morgabm commented 5 months ago

Sorry to open an issue for this, but I was wondering if you have considered using work in h264nal and h265nal as the basis for a similar project to parse AV1 obu structures? If so, I may have a need and would be interested in helping with such a project.

Please close this at your discretion.

chemag commented 5 months ago

Quick answer is yes.

Long answer is what is the value. I use h265nal and h264nal in a regular basis, mostly to understand the contents of a Annex B (h265 or h264) streams. I also script around the tools (e.g. see https://github.com/chemag/itools, which uses h265nal to provide information on the SPS colorimetry ot HEIC images).

Now, there are other parsers that do a similar job. First, ffmpeg BSF filter does a very similar job. My main issues with ffmpeg are two:

For h265, there is also https://github.com/strukturag/libde265. It is a deeper parser than either h265nal or ffmpeg's BSF. For example, the itools project mentioned above uses a fork of libde265 (https://github.com/chemag/libde265) to analyze the QP values used in a per-CTU, per-frame basis. Main issue IMO is that the author is not interested (see https://github.com/strukturag/libde265/pull/201).

Now, what I would like is a parser that goes in both directions. Basically a way to convert a binary format (preferably defined using a structured format for binary definition, like kaitai) into a structured text format, and back.

Use cases:

In order to implement this, the idea is to use an intermediate structured format that allows (a) binary to structured-format conversion, and (b) structured-format to text conversion. For the structured format, I like protobuf, which gives you (b) for free by using protobuf-text (I also like the protobuf text format).

+----------+ Mpeg2TsParser::Parse* +---------------+ protobuf +-----------+
|bin mpegts|---------------------->|protobuf mpegts|<-------->|text mpegts|
+----------+<--------------------- +---------------+          +-----------+
            Mpeg2TsParser::Dump*

This is the Section 4 Figure in https://github.com/chemag/m2pb, where the idea was applied to mpeg2-ts streams.

Now, the Parse/Dump functionality in Mpeg2TsParser is hand-written. I would prefer to use something like https://github.com/kaitai-io/kaitai_struct.git, but last time I checked, kaitai did not have (a) the functionality to implement structures based on the RBSP syntax (or whatever it is called in AV1), (b) a dumper, and (c) a way to produce protobuf structures. If we get this, adding support for a new codec would be as easy as feeding the RBSPs to a generic tool.

morgabm commented 5 months ago

I like these thoughts, and while my interests are primarily related to implementing vulkan video extensions which mandate apps handle muxing/demuxin and parsing, I think those use cases would benefit as well from your proposed solution. Is this something that has an established effort anywhere? If not, have you have considered owning such a project? Regardless, I also like protobufs for this approach and would be interested in joining efforts.

Have you looked at Hammer? I tried looking at it briefly, and it seems to implement a software defines intermediate structure. But from the limited documentation I was unable to determine if it supported defining grammars capable of handling entropy encoded values such as the ones present in h264/h265.

I have experience with libav and ffmpeg as well but in addition to some of the reasons you mentioned, I take issue with it due to the code base being ancient, and as a result it seems to suffer from many issues including the somewhat ideological nature of its development as well as the horrid documentation.

I think a tool built on modern technologies such as protobuf & modern c++, with an emphasis on code quality, and a succinct api would perform greatly in this market.

morgabm commented 5 months ago

Rereading your last paragraph again, it occurs that maybe some of my comments were not super clear. Basically it seems like some existing projects make a similar approach to this problem, but a generic solution may solve this issue for all of these use cases and beyond. One where a developer may bring their own grammar (regardless of the form factor of such a grammar), and within a succinct a lightweight framework provided by this solution they are able to provide any additional complexities needed by their parsing algorithm. Something like protobufs alone does not allow for the control needed to parse dynamic/entropy encoded values to my knowledge, please correct me if you know differently.

chemag commented 4 months ago

I like these thoughts, and while my interests are primarily related to implementing vulkan video extensions which mandate apps handle muxing/demuxin and parsing, I think those use cases would benefit as well from your proposed solution. Is this something that has an established effort anywhere? If not, have you have considered owning such a project? Regardless, I also like protobufs for this approach and would be interested in joining efforts.

I'm not sure exactly what you are describing here ("video extensions which mandate apps handle muxing/demuxin and parsing"), and how h265nal (or an AV1 parser) would work. These parsers are just a binary-to-text converter. Right now we're just printing the text values we convert to. My idea is to get the reverse conversion, so as to allow editing the binary streams (Annex B).

Have you looked at Hammer? I tried looking at it briefly, and it seems to implement a software defines intermediate structure. But from the limited documentation I was unable to determine if it supported defining grammars capable of handling entropy encoded values such as the ones present in h264/h265.

I took a look at it. It looks like a layer that facilitates the traditional C parsing approach by defining a series of functions so that you do not have to read byte by byte, and then do hton/ntoh[ls] conversions.

What I really want is to be able to feed an video bitstream syntax. The MPEG formats use the acronym "RBSP", and AV1's is very similar. For example, for AV1, I'd like to start with all the syntax definitions, e.g.

obu_header() {
  obu_forbidden_bit  : f(1)
  obu_type  : f(4) 
  obu_extension_flag  : f(1) 
  obu_has_size_field  : f(1)
  obu_reserved_1bit  : f(1) 
  if (obu_extension_flag == 1) {
    obu_extension_header() 
  }
}

This should autogenerate (a) a parser that accepts a raw (Annex B) AV1 stream and produces a set of protobuf objects representing OBUs and the descendent objects, and (b) a dumper that does the opposite operation. That means that the only work for creating an AV1 parser would be to collect the whole list of syntax definitions from the standard, and then write the skeleton of a full parser/dumper.

The closest thing I've seen for this is the kaitai syntax. Syntax is not very nice IMO, but e.g. it allows defining an ethernet header like this:

meta:
  id: ethernet_frame
  license: CC0-1.0
  ks-version: 0.7
  imports:
    - ipv4_packet
    - ipv6_packet
seq:
  - id: dst_mac
    size: 6
  - id: src_mac
    size: 6
  - id: ether_type
    type: u2be
    enum: ether_type_enum
  - id: body
    size-eos: true
    type:
      switch-on: ether_type
      cases:
        'ether_type_enum::ipv4': ipv4_packet
        'ether_type_enum::ipv6': ipv6_packet
-includes:
  - ipv4_packet.ksy
enums:
  # http://www.iana.org/assignments/ieee-802-numbers/ieee-802-numbers.xhtml
  ether_type_enum:
    0x0800: ipv4
    0x0801: x_75_internet
    0x0802: nbs_internet
    0x0803: ecma_internet
    0x0804: chaosnet
    0x0805: x_25_level_3
    0x0806: arp
    0x86dd: ipv6

I think a better syntax would allow a more generic if/then mechanism to drive the parser or set default values. In fact, I think a good solution will start with a better syntax for a language like this.

Rereading your last paragraph again, it occurs that maybe some of my comments were not super clear. Basically it seems like some existing projects make a similar approach to this problem, but a generic solution may solve this issue for all of these use cases and beyond. One where a developer may bring their own grammar (regardless of the form factor of such a grammar), and within a succinct a lightweight framework provided by this solution they are able to provide any additional complexities needed by their parsing algorithm. Something like protobufs alone does not allow for the control needed to parse dynamic/entropy encoded values to my knowledge, please correct me if you know differently.

The parsing process needs to produce something that can be operated upon. My idea of "operation" includes getting text-based versions of that something (so we get the binary-to-text conversion feature), editing that something (changing values, removing items, etc.), and writing back to binary.

The usual processing in my case does not have much performance requirements: I typically use small Annex B streams, so I don't mind paying the overhead that protobufs force.