nvdec hardware accelerated decoder

totaam commented 1 year ago

Split from #202. There are other frameworks for hardware acceleration, but nvidia's offering is generally the most stable option.

https://docs.nvidia.com/video-technologies/video-codec-sdk/nvdec-video-decoder-api-prog-guide/

totaam commented 1 year ago

Now working for jpeg decoding without OpenGL - which is not very useful since we have nvjpeg (#3504) which does this but better in every way: more lightweight, decodes jpega, with opengl acceleration (decoded image stays on the GPU). But this is a start, hopefully I can figure out how to enable h264, hevc, vp8, vp9, etc.. Still TODO:

fix no device errors when used as a video decoder for jpeg - allocate the CUDA device once in the decode thread?
OpenGL - CUDA buffer sharing (very similar to nvjpeg)
add scaling to libyuv's NV12 to RGB
jpega

totaam commented 1 year ago

It paints... but needs a new shader: this one only paints the Y plane for now. (better than a corrupted screen, or worse.. crashes) Perhaps something like: https://git.linuxtv.org/libcamera.git/tree/src/qcam/assets/shader/NV_2_planes_UV_f.glsl?id=9db6ce0ba499eba53db236558d783a4ff7aa3896 explained in Accelerating libcamera Qcam format conversion using OpenGL shaders

/* SPDX-License-Identifier: LGPL-2.1-or-later */
/*
 * Copyright (C) 2020, Linaro
 *
 * NV_2_planes_UV_f.glsl - Fragment shader code for NV12, NV16 and NV24 formats
 */

#ifdef GL_ES
precision mediump float;
#endif

varying vec2 textureOut;
uniform sampler2D tex_y;
uniform sampler2D tex_u;

void main(void)
{
    vec3 yuv;
    vec3 rgb;
    mat3 yuv2rgb_bt601_mat = mat3(
        vec3(1.164,  1.164, 1.164),
        vec3(0.000, -0.392, 2.017),
        vec3(1.596, -0.813, 0.000)
    );

    yuv.x = texture2D(tex_y, textureOut).r - 0.063;
    yuv.y = texture2D(tex_u, textureOut).r - 0.500;
    yuv.z = texture2D(tex_u, textureOut).g - 0.500;

    rgb = yuv2rgb_bt601_mat * yuv;
    gl_FragColor = vec4(rgb, 1.0);
}

For now I have used this Y to RGBA shader instead, completely ignoring the UV combined plane:

struct pixel_in {                                                                                                                                                                                      
    float2 texcoord1 : TEXCOORD0;
    float2 texcoord2 : TEXCOORD1;
    uniform samplerRECT texture1 : TEXUNIT0;
    uniform samplerRECT texture2 : TEXUNIT1;
    float4 color : COLOR0;                                                                                                                                                                          
};

struct pixel_out {                                                                                                                                                                                     
    float4 color : COLOR0;                                                                                                                                                                          
};  

pixel_out
main (pixel_in IN)
{
    pixel_out OUT;

    float3 pre;

    pre.r = texRECT (IN.texture1, IN.texcoord1).x - (16.0 / 256.0);
    pre.g = texRECT (IN.texture2, IN.texcoord2).x - (128.0 / 256.0);
    pre.b = texRECT (IN.texture2, IN.texcoord2).y - (128.0 / 256.0);

    const float3 red   = float3 (1.0/219.0, 0.0, 1.371/219.0) * 255.0;
    const float3 green = float3 (1.0/219.0, -0.336/219.0, -0.698/219.0) * 255.0;
    const float3 blue  = float3 (1.0/219.0, 1.732/219.0, 0.0) * 255.0;

    OUT.color.r = pre.r;
    OUT.color.g = pre.r;
    OUT.color.b = pre.r;
    OUT.color.a = IN.color.a;

    return OUT;
}

More NV12 shader examples:

https://gist.github.com/tsangz189/18fa42ed46a3cad367241c239b7b3adb (uses GL_RG / GL_UNSIGNED_BYTE for UV data TexImage2D)
https://gist.github.com/crearo/0d50442145b63c6c288d1c1675909990 (uses GL_LUMINANCE_ALPHA / GL_UNSIGNED_BYTE for UV data TexImage2D)

For jpeg mode, do we need to keep the decoder context? (initialization looks costly)

totaam commented 1 year ago

Trying to get nvdec to work via gstreamer.

It finds the same valid encodings and chroma formats:

GST_DEBUG=nvdec*:4 gst-inspect-1.0 nvh264dec 
0:00:00.902346129 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1171:gst_nv_decoder_check_device_caps: mpegvideo bit-depth 8 with chroma format 0 [48 - 4080] x [16 - 4080]
0:00:00.908428330 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1171:gst_nv_decoder_check_device_caps: mpeg2video bit-depth 8 with chroma format 0 [48 - 4080] x [16 - 4080]
0:00:00.914007625 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1171:gst_nv_decoder_check_device_caps: mpeg4video bit-depth 8 with chroma format 0 [48 - 2032] x [16 - 2032]
0:00:00.920519713 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1094:gst_nv_decoder_check_device_caps: No codec map corresponding to codec 3
0:00:00.921501455 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1171:gst_nv_decoder_check_device_caps: h264 bit-depth 8 with chroma format 0 [48 - 4096] x [16 - 4096]
0:00:00.927115589 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1171:gst_nv_decoder_check_device_caps: jpeg bit-depth 8 with chroma format 0 [64 - 32768] x [64 - 16384]
0:00:00.929860858 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1171:gst_nv_decoder_check_device_caps: jpeg bit-depth 8 with chroma format 1 [64 - 32768] x [64 - 16384]
0:00:00.932570852 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1094:gst_nv_decoder_check_device_caps: No codec map corresponding to codec 6
0:00:00.932581141 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1094:gst_nv_decoder_check_device_caps: No codec map corresponding to codec 7
0:00:00.933641584 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1171:gst_nv_decoder_check_device_caps: h265 bit-depth 8 with chroma format 0 [144 - 8192] x [144 - 8192]
0:00:00.934768056 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1171:gst_nv_decoder_check_device_caps: h265 bit-depth 10 with chroma format 0 [144 - 8192] x [144 - 8192]
0:00:00.935946641 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1171:gst_nv_decoder_check_device_caps: h265 bit-depth 12 with chroma format 0 [144 - 8192] x [144 - 8192]
0:00:00.939812713 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1171:gst_nv_decoder_check_device_caps: vp8 bit-depth 8 with chroma format 0 [48 - 4096] x [16 - 4096]
0:00:00.945379570 655952 0x55f02fa992d0 INFO               nvdecoder gstnvdecoder.c:1171:gst_nv_decoder_check_device_caps: vp9 bit-depth 8 with chroma format 0 [128 - 8192] x [128 - 8192]
Factory Details:
  Rank                     primary (256)
  Long-name                NVDEC h264 Video Decoder
  Klass                    Codec/Decoder/Video/Hardware
  Description              NVDEC video decoder
  Author                   Ericsson AB, http://www.ericsson.com, Seungha Yang <seungha.yang@navercorp.com>

Plugin Details:
  Name                     nvcodec
  Description              GStreamer NVCODEC plugin
  Filename                 /usr/lib64/gstreamer-1.0/libgstnvcodec.so
  Version                  1.20.5
  License                  LGPL
  Source module            gst-plugins-bad
  Source release date      2022-12-19
  Binary package           Fedora GStreamer-plugins-bad package
  Origin URL               http://download.fedoraproject.org

GObject
 +----GInitiallyUnowned
       +----GstObject
             +----GstElement
                   +----GstVideoDecoder
                         +----GstNvDec
                               +----nvh264dec

Pad Templates:
  SINK template: 'sink'
    Availability: Always
    Capabilities:
      video/x-h264
          stream-format: byte-stream
              alignment: au
                profile: { (string)constrained-baseline, (string)baseline, (string)main, (string)high, (string)constrained-high, (string)progressive-high }
                  width: [ 48, 4096 ]
                 height: [ 16, 4096 ]

  SRC template: 'src'
    Availability: Always
    Capabilities:
      video/x-raw
                  width: [ 48, 4096 ]
                 height: [ 16, 4096 ]
              framerate: [ 0/1, 2147483647/1 ]
                 format: { (string)NV12 }
      video/x-raw(memory:GLMemory)
                  width: [ 48, 4096 ]
                 height: [ 16, 4096 ]
              framerate: [ 0/1, 2147483647/1 ]
                 format: { (string)NV12 }
      video/x-raw(memory:CUDAMemory)
                  width: [ 48, 4096 ]
                 height: [ 16, 4096 ]
              framerate: [ 0/1, 2147483647/1 ]
                 format: { (string)NV12 }

Element has no clocking capabilities.
Element has no URI handling capabilities.

Pads:
  SINK: 'sink'
    Pad Template: 'sink'
  SRC: 'src'
    Pad Template: 'src'

Element Properties:
  automatic-request-sync-point-flags: Flags to use when automatically requesting sync points
                        flags: readable, writable
                        Flags "GstVideoDecoderRequestSyncPointFlags" Default: 0x00000003, "corrupt-output+discard-input"
                           (0x00000001): discard-input    - GST_VIDEO_DECODER_REQUEST_SYNC_POINT_DISCARD_INPUT
                           (0x00000002): corrupt-output   - GST_VIDEO_DECODER_REQUEST_SYNC_POINT_CORRUPT_OUTPUT
  automatic-request-sync-points: Automatically request sync points when it would be useful
                        flags: readable, writable
                        Boolean. Default: false
  discard-corrupted-frames: Discard frames marked as corrupted instead of outputting them
                        flags: readable, writable
                        Boolean. Default: false
  max-display-delay   : Improves pipelining of decode with display, 0 means no delay (auto = -1)
                        flags: readable, writable
                        Integer. Range: -1 - 2147483647 Default: -1 
  max-errors          : Max consecutive decoder errors before returning flow error
                        flags: readable, writable
                        Integer. Range: -1 - 2147483647 Default: 10 
  min-force-key-unit-interval: Minimum interval between force-keyunit requests in nanoseconds
                        flags: readable, writable
                        Unsigned Integer64. Range: 0 - 18446744073709551615 Default: 0 
  name                : The name of the object
                        flags: readable, writable, 0x2000
                        String. Default: "nvh264dec0"
  parent              : The parent of the object
                        flags: readable, writable, 0x2000
                        Object of type "GstObject"
  qos                 : Handle Quality-of-Service events from downstream
                        flags: readable, writable
                        Boolean. Default: true

And h264 works:

gst-launch-1.0 videotestsrc ! x264enc ! queue ! nvh264dec ! videoconvert ! autovideosink

Building nvdec element from source for debugging is easy:

tar -Jxvf ~/Downloads/gst-plugins-bad-1.20.5.tar.xz
cd gst-plugins-bad-1.20.5/
mkdir build
cd build/
meson ..
ninja
sudo ninja install

rm -fr ~/.cache/gstreamer-1.0
GST_PLUGIN_PATH=/usr/local/lib64/gstreamer-1.0/ GST_DEBUG=nvdec*:5 gst-inspect-1.0 nvh264dec

Hoping to dump the data structures and using the sample data from https://github.com/Xpra-org/xpra/blob/5be5799b7e1b1cc5a5f982e6141b948582a6d9a0/xpra/codecs/codec_checks.py#L23-L37 then saving each frame to a different file, I can replay using:

gst-launch-1.0 multifilesrc location="%02d.h264" index=0 \
    caps="video/x-h264,stream-format=byte-stream,alignment=nal,width=128,height=128,framerate=1/1" \
   ! avdec_h264 ! videorate ! autovideosink

But not with nvh264dec or openh264:

WARNING: erroneous pipeline: could not link multifilesrc0 to nvh264dec0

openh264dec can work by adding an h264parse element before it.

This one works with all decoders:

gst-launch-1.0 videotestsrc pattern=white ! x264enc ! nvh264dec ! videoconvert ! autovideosink

But only thanks to caps negotiation. Saving the frames using:

gst-launch-1.0 videotestsrc pattern=white
            ! video/x-raw,width=320,height=240
            ! x264enc
            ! multifilesink location="frame%d.h264"

Then trying to replay them with multifilesrc as above hits the same issues again.

What makes it work is to specify format="NV12" (or format="I420"), otherwise we see:

0:00:00.231785272 714276 0x55563ad7e120 WARN                GST_CAPS gstpad.c:5757:pre_eventfunc_check:<nvh264dec0:sink> caps video/x-h264, pixel-aspect-ratio=(fraction)1/1, width=(int)320, height=(int)240,
    framerate=(fraction)30/1, chroma-format=(string)4:4:4, bit-depth-luma=(uint)10,
    bit-depth-chroma=(uint)10, colorimetry=(string)bt601, parsed=(boolean)true,
    stream-format=(string)byte-stream, alignment=(string)au, profile=(string)high-4:4:4, level=(string)1.3 not accepted

totaam commented 1 year ago

Correction, this works for openh264 without h264parse (switching from alignment=nal to alignment=au):

gst-launch-1.0 multifilesrc location="frame%d.h264" index=0 \
    caps="video/x-h264,stream-format=byte-stream,alignment=au,width=320,height=240,framerate=1/1" \
    ! openh264dec ! videorate ! autovideosink

And nvdec loads from file:

vp8:

gst-launch-1.0 multifilesrc location="frame%d.vp8" index=0 \
caps="video/x-vp8,stream-format=byte-stream,alignment=au,width=320,height=240" \
! nvvp8dec ! videoconvert ! autovideosink

h264:
```
gst-launch-1.0 multifilesrc location="frame%d.h264" index=0 \
caps="video/x-h264,stream-format=byte-stream,alignment=au,width=320,height=240" \
! nvh264dec ! videoconvert ! autovideosink
```
Though that's only half the problem, the bigger issue is that for h264, we must use a parser to populate the decoder's data structures. (things like h264.num_ref_idx_l1_active_minus1!) Not sure why the pfnDecodePicture - aka parser_decode_callback is called by the gstreamer decoder and not in our codec.. That's the one that provides the CUVIDPICPARAMS.

Xpra-org / xpra

nvdec hardware accelerated decoder #3703