any1 / wayvnc

A VNC server for wlroots based Wayland compositors
ISC License
1.06k stars 65 forks source link

WayVNC crashes if GPU H.264 encoding initialization fails (and doesn't provide debug logs). #327

Open biergaizi opened 2 weeks ago

biergaizi commented 2 weeks ago

Useful information:

Overview

I'm trying to enable hardware H.264 encoding on a PC with a discrete AMD GPU with AMDGPU and Mesa. Unfortunately it keeps failing in mysterious ways, and even the debug log via --log-level=debug can't produce any helpful information (and sometimes even misleading to untrained eyes). I decided to review the source code and I've identified at least three problems. Two of them are trivial to fix, but the last one suggests that WayVNC's H.264 encoding is only compatible with DRM PRIME, but not VAAPI, likely making it incompatible with most (if not all) discrete GPUs. The last one is an user error which can be fixed by enabling both VAAPI and libdrm at the same time.

Problem 1: Crash in have_working_h264_encoder()

Symptom

When wayvnc is started via wayvnc --gpu , connecting a TigerVNC client with H.264 enabled immediately crashes WayVNC.

Info: Capturing output HEADLESS-1
Info: >> Headless output 1 1280x720+0x0 Power:UNKNOWN
DEBUG: ../src/ctl-server.c: 809: Initializing wayvncctl socket: /run/user/1000/wayvncctl
DEBUG: ../subprojects/neatvnc/src/server.c: 1819: Trying address: 10.3.0.2
DEBUG: ../subprojects/neatvnc/src/server.c: 1839: Successfully bound to address
Info: Listening for connections on 10.3.0.2:5900
Info: New client connection from 10.3.0.1: 0x5607288511b0 (ref 1)
DEBUG: ../subprojects/neatvnc/src/server.c: 327: Client chose security type: 1
Info: Starting screen capture
DEBUG: ../src/main.c: 1009: Acquired power state management. Waiting for power event to start capturing
DEBUG: ../src/main.c: 1362: Client connected, new client count: 1
DEBUG: ../src/ctl-server.c: 941: Enqueueing client-connected event: {"id":"1","address":"10.3.0.1","username":null,"seat":"seat0","connection_count":1}
DEBUG: ../src/ctl-server.c: 968: Enqueued client-connected event for 0 clients
DEBUG: ../subprojects/neatvnc/src/server.c: 584: Client 0x5607288511b0 set encodings: cursor,desktop-size,extended-desktop-size,qemu-led-state,vmware-led-state,extended-clipboard,qemu-extended-key-event,open-h264,copyrect,zrle,tight,hextile,rre,copyrect,raw
DEBUG: ../subprojects/neatvnc/src/server.c: 2365: Keyboard LED state changed: ffffffff -> 0
Info: Choosing zrle encoding for client 0x5607288511b0
DEBUG: ../src/main.c: 1018: screencopy_start_immediate
./wayland-vnc.sh: line 10: 1212123 Segmentation fault      (core dumped) /home/user/wayvnc-project/wayvnc/build/wayvnc --gpu --keyboard=us-dvp --log-level=debug 10.3.0.2

The ZRLE log message before WayVNC crashed is misleading, don't let it fool you. The gdb trace showed that the actual crash occurred the H.264 codepath before it printed any debug messages to the console.

Thread 1 "wayvnc" received signal SIGSEGV, Segmentation fault.
0x00007ffff7fa4539 in h264_encoder_destroy (self=0x0) at ../subprojects/neatvnc/src/enc/h264/encoder.c:52
52      self->impl->destroy(self);
(gdb) bt
#0  0x00007ffff7fa4539 in h264_encoder_destroy (self=0x0) at ../subprojects/neatvnc/src/enc/h264/encoder.c:52
#1  0x00007ffff7f85311 in have_working_h264_encoder () at ../subprojects/neatvnc/src/server.c:128
#2  0x00007ffff7f8b039 in choose_frame_encoding (client=0x5555557e4d00, fb=0x555555804d60) at ../subprojects/neatvnc/src/server.c:2057
#3  0x00007ffff7f86fdb in ensure_encoder (client=0x5555557e4d00, fb=0x555555804d60) at ../subprojects/neatvnc/src/server.c:672
#4  0x00007ffff7f8751c in process_fb_update_requests (client=0x5555557e4d00) at ../subprojects/neatvnc/src/server.c:778
#5  0x00007ffff7f87918 in on_client_fb_update_request (client=0x5555557e4d00) at ../subprojects/neatvnc/src/server.c:862
#6  0x00007ffff7f89943 in on_client_message (client=0x5555557e4d00) at ../subprojects/neatvnc/src/server.c:1579
#7  0x00007ffff7f89ae3 in try_read_client_message (client=0x5555557e4d00) at ../subprojects/neatvnc/src/server.c:1627
#8  0x00007ffff7f89d38 in on_client_event (stream=0x5555557e5fd0, event=STREAM_EVENT_READ) at ../subprojects/neatvnc/src/server.c:1673
#9  0x00007ffff7f92298 in stream_tcp__on_readable (self=0x5555557e5fd0) at ../subprojects/neatvnc/src/stream/tcp.c:156
#10 0x00007ffff7f92311 in stream_tcp__on_event (obj=0x5555557e6fe0) at ../subprojects/neatvnc/src/stream/tcp.c:183
#11 0x00007ffff7fbbf04 in aml__handle_event (self=0x5555556021f0, obj=0x5555557e6fe0) at ../subprojects/aml/src/aml.c:801
#12 0x00007ffff7fbc197 in aml_dispatch (self=0x5555556021f0) at ../subprojects/aml/src/aml.c:853
#13 0x0000555555563bfa in main (argc=5, argv=0x7fffffffddc8) at ../src/main.c:2000

Analysis

The root cause of the crash is a variant of the double free bug. When H.264 is requested, NeatVNC attempts to initialize a H.264 hardware device via have_working_h264_encoder() in src/server.c:

static bool have_working_h264_encoder(void)
{
        struct h264_encoder *encoder = h264_encoder_create(1920, 1080,
                        DRM_FORMAT_XRGB8888, 5);
        cached_result = encoder ? 1 : -1;
        h264_encoder_destroy(encoder);

        nvnc_log(NVNC_LOG_DEBUG, "H.264 encoding is %s",
                        cached_result == 1 ? "available" : "unavailable");
}

Internally, h264_encoder_create() calls the FFmpeg backend, h264_encoder_ffmpeg_create() in src/enc/h264/ffmpeg-impl.c. Due to a failure in av_hwdevice_ctx_create() in FFmpeg, the function frees resources and return NULL.

static struct h264_encoder* h264_encoder_ffmpeg_create(uint32_t width,
                uint32_t height, uint32_t format, int quality)
{
        rc = av_hwdevice_ctx_create(&self->hw_device_ctx,
                        AV_HWDEVICE_TYPE_DRM, render_node, NULL, 0);
        if (rc != 0)
                goto hwdevice_ctx_failure;

hwdevice_ctx_failure:
render_node_failure:
        aml_unref(self->work);
worker_failure:
        vec_destroy(&self->current_packet);
packet_failure:
        free(self);
        return NULL;
}

After control is returned to have_working_h264_encoder(), it sees the H.264 device creation has failed, it attempts to call h264_encoder_destroy(encoder) to free resources and prints a debug log to the console.

        h264_encoder_destroy(encoder);

        nvnc_log(NVNC_LOG_DEBUG, "H.264 encoding is %s",
                        cached_result == 1 ? "available" : "unavailable");

Unfortunately, h264_encoder_destroy(encoder) calls the backend h264_encoder_ffmpeg_destroy(), which then attempts to free resources that have already been freed (or never successfully allocated to begin with) after the initial error occurred, crashing the problem. Furthermore, because the crash occurred before the H.264 debug log, the failure is never printed to the console.

This is likely the root cause of #258 (and perhaps #318 and #319?).

static void h264_encoder_ffmpeg_destroy(struct h264_encoder* base)
{
        vec_destroy(&self->current_packet);
        av_buffer_unref(&self->hw_frames_ctx);
        avcodec_free_context(&self->codec_ctx);
        av_buffer_unref(&self->hw_device_ctx);
        avfilter_graph_free(&self->filter_graph);
        aml_unref(self->work);
        free(self);
}

Patch

Just delete h264_encoder_destroy(encoder); to avoid the double free.

diff --git a/src/server.c b/src/server.c
index 2d894a4..ac6ae01 100644
--- a/src/server.c
+++ b/src/server.c
@@ -125,7 +125,6 @@ static bool have_working_h264_encoder(void)
        struct h264_encoder *encoder = h264_encoder_create(1920, 1080,
                        DRM_FORMAT_XRGB8888, 5);
        cached_result = encoder ? 1 : -1;
-       h264_encoder_destroy(encoder);

        nvnc_log(NVNC_LOG_DEBUG, "H.264 encoding is %s",
                        cached_result == 1 ? "available" : "unavailable");

Problem 2: Debug Log on H.264 Encoding is Inadequate or Nonexistent

Symptom

After the initial bug has been patched, continued testing showed H.264 encoding is still unavailable.

DEBUG: ../src/main.c: 1362: Client connected, new client count: 1
DEBUG: ../src/ctl-server.c: 941: Enqueueing client-connected event: {"id":"1","address":"10.3.0.1","username":null,"seat":"seat0","connection_count":1}
DEBUG: ../src/ctl-server.c: 968: Enqueued client-connected event for 0 clients
DEBUG: ../subprojects/neatvnc/src/server.c: 583: Client 0x55a22a1f33b0 set encodings: cursor,desktop-size,extended-desktop-size,qemu-led-state,vmware-led-state,extended-clipboard,qemu-extended-key-event,open-h264,copyrect,zrle,tight,hextile,rre,copyrect,raw
DEBUG: ../subprojects/neatvnc/src/server.c: 2364: Keyboard LED state changed: ffffffff -> 0
Info: Choosing zrle encoding for client 0x55a22a1f33b0
DEBUG: ../src/main.c: 1018: screencopy_start_immediate
DEBUG: ../subprojects/neatvnc/src/server.c: 129: H.264 encoding is unavailable
DEBUG: ../subprojects/neatvnc/src/server.c: 2364: Keyboard LED state changed: 0 -> 2
Info: Client 0x55a22a1f33b0 (1) hung up
Info: Closing client connection 0x55a22a1f33b0: ref 0
DEBUG: ../src/main.c: 1309: Client disconnected, new client count: 0

Analysis

When failures occur in NeatVNC's H.264 codepath, it returns immediately without logging any debug information about where and how it failed. Without any error code or message, it's impossible for an end user to troubleshoot hardware encoding without tracing the source code in a debugger.

From the source code of h264_encoder_ffmpeg_create(), it shows that are at least 6 possible ways in which H.264 initialization can fail.

        if (find_render_node(render_node, sizeof(render_node)) < 0)
                goto render_node_failure;

        rc = av_hwdevice_ctx_create(&self->hw_device_ctx,
                        AV_HWDEVICE_TYPE_DRM, render_node, NULL, 0);
        if (rc != 0)
                goto hwdevice_ctx_failure;

        const AVCodec* codec = avcodec_find_encoder_by_name("h264_vaapi");
        if (!codec)
                goto codec_failure;

        if (h264_encoder__init_hw_frames_context(self) < 0)
                goto hw_frames_context_failure;

        if (h264_encoder__init_filters(self) < 0)
                goto filter_failure;

        if (h264_encoder__init_codec_context(self, codec, quality) < 0)
                goto codec_context_failure;

But none of the failures produces any log, not even a NVNC_LOG_DEBUG log. Ideally, all failure codepaths should have debug logs. If the error occurs in FFmpeg, one should also call av_err2str(err) or av_strerror() to produce human-readable error messages.

Patch

Ideally, all failure paths should have logs. For brevity. I'll just point two critical places that provided information necessary for my own troubleshooting.

diff --git a/src/enc/h264/ffmpeg-impl.c b/src/enc/h264/ffmpeg-impl.c
index 6868137..91e43d5 100644
--- a/src/enc/h264/ffmpeg-impl.c
+++ b/src/enc/h264/ffmpeg-impl.c
@@ -530,11 +530,16 @@ static struct h264_encoder* h264_encoder_ffmpeg_create(uint32_t width,
        char render_node[64];
        if (find_render_node(render_node, sizeof(render_node)) < 0)
                goto render_node_failure;
+       nvnc_log(NVNC_LOG_DEBUG, "Found render node %s", render_node);

        rc = av_hwdevice_ctx_create(&self->hw_device_ctx,
                        AV_HWDEVICE_TYPE_DRM, render_node, NULL, 0);
-       if (rc != 0)
+       if (rc != 0) {
+               char err[256];
+               av_strerror(rc, err, sizeof(err));
+               nvnc_log(NVNC_LOG_WARNING, "Failed to create hwdevice context: %s", err);
                goto hwdevice_ctx_failure;
+       }

        self->base.next_frame_should_be_keyframe = true;
        TAILQ_INIT(&self->fb_queue);

However, I noticed that FFmpeg already has the macro av_err2str() in error.h[https://www.ffmpeg.org/doxygen/2.0/error_8h_source.html] to automatically create a static buffer in order to simplify user code, which is used in official examples.

#define AV_ERROR_MAX_STRING_SIZE 64

/**
 * Convenience macro, the return value should be used only directly in
 * function arguments but never stand-alone.
 */
#define av_err2str(errnum) \ 
    av_make_error_string((char[AV_ERROR_MAX_STRING_SIZE]){0}, AV_ERROR_MAX_STRING_SIZE, errnum)

Is there any reason that this macro is not used in favor of doing it manually? In my opinion the following code is clearer:

diff --git a/src/enc/h264/ffmpeg-impl.c b/src/enc/h264/ffmpeg-impl.c
index 6868137..f2ef2aa 100644
--- a/src/enc/h264/ffmpeg-impl.c
+++ b/src/enc/h264/ffmpeg-impl.c
@@ -439,9 +439,7 @@ static void h264_encoder__do_work(void* handle)

        int rc = h264_encoder__encode(self, frame);
        if (rc != 0) {
-               char err[256];
-               av_strerror(rc, err, sizeof(err));
-               nvnc_log(NVNC_LOG_ERROR, "Failed to encode packet: %s", err);
+               nvnc_log(NVNC_LOG_ERROR, "Failed to encode packet: %s", av_err2str(rc));
                goto failure;
        }

@@ -530,11 +528,15 @@ static struct h264_encoder* h264_encoder_ffmpeg_create(uint32_t width,
        char render_node[64];
        if (find_render_node(render_node, sizeof(render_node)) < 0)
                goto render_node_failure;
+       nvnc_log(NVNC_LOG_DEBUG, "Found render node %s", render_node);

        rc = av_hwdevice_ctx_create(&self->hw_device_ctx,
                        AV_HWDEVICE_TYPE_DRM, render_node, NULL, 0);
-       if (rc != 0)
+       if (rc != 0) {
+               nvnc_log(NVNC_LOG_WARNING, "Failed to create hwdevice context: %s",
+                       av_err2str(rc));
                goto hwdevice_ctx_failure;
+       }

        self->base.next_frame_should_be_keyframe = true;
        TAILQ_INIT(&self->fb_queue);

Problem 3: Discrete GPUs are not Supported by AV_HWDEVICE_TYPE_DRM

Symptom

After the initial two problem have been patched, continued testing showed FFmpeg H.264 initialization failed in av_hwdevice_ctx_create() with the error "Cannot allocate memory".

DEBUG: ../subprojects/neatvnc/src/server.c: 583: Client 0x558ec811c270 set encodings: cursor,desktop-size,extended-desktop-size,qemu-led-state,vmware-led-state,extended-clipboard,qemu-extended-key-event,open-h264,copyrect,zrle,tight,hextile,rre,copyrect,raw
DEBUG: ../subprojects/neatvnc/src/server.c: 2364: Keyboard LED state changed: ffffffff -> 0
Info: Choosing zrle encoding for client 0x558ec811c270
DEBUG: ../src/main.c: 1018: screencopy_start_immediate
xkbcommon: ERROR: couldn't find a Compose file for locale "C.UTF8" (mapped to "C.UTF8")
DEBUG: ../subprojects/neatvnc/src/enc/h264/ffmpeg-impl.c: 531: Found render node /dev/dri/renderD128
Warning: ../subprojects/neatvnc/src/enc/h264/ffmpeg-impl.c: 536: Failed to create hwdevice context: Cannot allocate memory

This suggests FFmpeg has an internal problem, and it's likely caused by hardware incompatibility.

Analysis

To understand and isolate the problem, I wrote this test program based on FFmpeg's official examples. This can show whether FFmpeg itself has working hardware encoding.

The following program can be compiled via cc -Wall -Wextra -pedantic -std=c99 -g -O2 vaapi_encode.c -lavdevice -lavformat -lavfilter -lavcodec -lswresample -lswscale -lavutil -o vaapi_encode

#include <stdio.h>
#include <string.h>
#include <errno.h>

#include <libavutil/hwcontext.h>

int main(int argc, char *argv[])
{
    AVBufferRef *hw_device_ctx = NULL;
    int type, err;

    if (argc < 3)
        goto fail;
    else if (strcmp(argv[2], "vaapi") == 0)
        type = AV_HWDEVICE_TYPE_VAAPI;
    else if (strcmp(argv[2], "drm") == 0)
        type = AV_HWDEVICE_TYPE_DRM;
    else
        goto fail;

    err = av_hwdevice_ctx_create(&hw_device_ctx, type, argv[1], NULL, 0);
    if (err < 0) {
        fprintf(stderr, "Failed to create a hwdevice: %s\n", av_err2str(err));
        goto fail;
    }

    fprintf(stderr, "No error.\n");
    return 0;

fail:
    fprintf(stderr, "error has occurred.\n");
    return 1;
}

The tests showed that the DRM render node can be opened as AV_HWDEVICE_TYPE_VAAPI, but it cannot be opened as AV_HWDEVICE_TYPE_DRM.

$ ./vaapi_encode /dev/dri/renderD128 vaapi
No error.
$ ./vaapi_encode /dev/dri/renderD128 drm
Failed to create a hwdevice: Cannot allocate memory
Error has occurred.

Thus, an obvious idea is to see what happens if WayVNC opens the GPU render node as a AV_HWDEVICE_TYPE_VAAPI.

diff --git a/src/enc/h264/ffmpeg-impl.c b/src/enc/h264/ffmpeg-impl.c
index 6868137..d7bc477 100644
--- a/src/enc/h264/ffmpeg-impl.c
+++ b/src/enc/h264/ffmpeg-impl.c
@@ -532,7 +530,7 @@ static struct h264_encoder* h264_encoder_ffmpeg_create(uint32_t width,
                goto render_node_failure;

        rc = av_hwdevice_ctx_create(&self->hw_device_ctx,
-                       AV_HWDEVICE_TYPE_DRM, render_node, NULL, 0);
+                       AV_HWDEVICE_TYPE_VAAPI, render_node, NULL, 0);
        if (rc != 0)
                goto hwdevice_ctx_failure;

A test showed that the troubleshooting has made immense progress, but it's far from enough.

DEBUG: ../src/ctl-server.c: 968: Enqueued client-connected event for 0 clients
DEBUG: ../subprojects/neatvnc/src/server.c: 583: Client 0x5576fcfc00f0 set encodings: cursor,desktop-size,extended-desktop-size,qemu-led-state,vmware-led-state,extended-clipboard,qemu-extended-key-event,open-h264,copyrect,zrle,tight,hextile,rre,copyrect,raw
DEBUG: ../subprojects/neatvnc/src/server.c: 2364: Keyboard LED state changed: ffffffff -> 0
Info: Choosing zrle encoding for client 0x5576fcfc00f0
DEBUG: ../src/main.c: 1018: screencopy_start_immediate
xkbcommon: ERROR: couldn't find a Compose file for locale "C.UTF8" (mapped to "C.UTF8")
Info: libva: VA-API version 1.22.0
Info: libva: Trying to open /usr/lib64/va/drivers/radeonsi_drv_video.so
Info: libva: Found init function __vaDriverInit_1_22
Info: libva: va_openDriver() returns 0
Info: Initialised VAAPI connection: version 1.22
DEBUG: libav: 0: Format 0x3231564e -> nv12.
DEBUG: libav: 0: Format 0x30313050 -> p010le.
DEBUG: libav: 0: Format 0x36313050 -> unknown.
DEBUG: libav: 0: Format 0x30323449 -> yuv420p.
DEBUG: libav: 0: Format 0x32315659 -> yuv420p.
DEBUG: libav: 0: Format 0x56595559 -> unknown.
DEBUG: libav: 0: Format 0x32595559 -> yuyv422.
DEBUG: libav: 0: Format 0x59565955 -> uyvy422.
DEBUG: libav: 0: Format 0x30303859 -> gray.
DEBUG: libav: 0: Format 0x50343434 -> yuv444p.
DEBUG: libav: 0: Format 0x56323234 -> yuv440p.
DEBUG: libav: 0: Format 0x50424752 -> unknown.
DEBUG: libav: 0: Format 0x41524742 -> bgra.
DEBUG: libav: 0: Format 0x41424752 -> rgba.
DEBUG: libav: 0: Format 0x42475241 -> argb.
DEBUG: libav: 0: Format 0x58524742 -> bgr0.
DEBUG: libav: 0: Format 0x58424752 -> rgb0.
DEBUG: libav: 0: Format 0x30335241 -> unknown.
DEBUG: libav: 0: Format 0x30334241 -> unknown.
DEBUG: libav: 0: Format 0x30335258 -> x2rgb10le.
DEBUG: libav: 0: Format 0x30334258 -> unknown.
Info: VAAPI driver: Mesa Gallium driver 24.2.0 for AMD Radeon Pro VII (radeonsi, vega20, LLVM 18.1.8, DRM 3.57, 6.10.6-gentoo-dist).
Info: Driver not found in known nonstandard list, using standard behaviour.
ERROR: libav: 0: The hardware pixel format 'drm_prime' is not supported by the device type 'VAAPI'
DEBUG: libav: 0: detected 56 logical cores
DEBUG: libav: 0: Setting 'width' to value '1'
DEBUG: libav: 0: Setting 'height' to value '1'
DEBUG: libav: 0: Setting 'pix_fmt' to value 'drm_prime'
DEBUG: libav: 0: Setting 'time_base' to value '1/1'
Info: w:1 h:1 pixfmt:drm_prime tb:1/1 fr:0/1 sar:0/1
DEBUG: libav: 0: Setting 'mode' to value 'direct'
DEBUG: libav: 0: Setting 'derive_device' to value 'vaapi'
DEBUG: libav: 0: Setting 'format' to value 'nv12'
DEBUG: libav: 0: Setting 'mode' to value 'fast'
DEBUG: libav: 0: Setting 'out_color_matrix' to value 'bt709'
DEBUG: libav: 0: Setting 'out_range' to value 'limited'
DEBUG: libav: 0: Setting 'out_color_primaries' to value 'bt709'
DEBUG: libav: 0: Setting 'out_color_transfer' to value 'bt709'
DEBUG: libav: 0: query_formats: 4 queried, 3 merged, 0 already done, 0 delayed
DEBUG: libav: 0: Configure hwmap drm_prime -> vaapi.
ERROR: libav: 0: Unsupported format: drm_prime.
ERROR: libav: 0: Failed to create frame context for reverse mapping: -22.
ERROR: libav: 0: Failed to configure output pad on Parsed_hwmap_0
DEBUG: ../subprojects/neatvnc/src/server.c: 129: H.264 encoding is unavailable

Now FFmpeg can be initialized and the GPU can also be recogized. but it fails at:

ERROR: libav: 0: The hardware pixel format 'drm_prime' is not supported by the device type 'VAAPI'

So it's not as simple as a type flag change.

I'm not an expert an VAAPI or drm_prime, so I had a hard time understanding their differeces. But after some Web searches, it seems that hardware encoding on drm_prime is a special zero-copy mechanism that depend on shared memory between the CPU and GPU, meanwhile VAAPI is the general-purpose GPU hardware transcoding API.

Furthermore, I found this explaination of DRM PRIME from a Mesa code example:

    /* Map the exported buffer, using the PRIME File descriptor */
    /* That ONLY works if the DRM driver implements gem_prime_mmap.
     * This function is not implemented in most of the DRM drivers for
     * GPU with discrete memory. Meaning that it will surely fail with
     * Radeon, AMDGPU and Nouveau drivers for desktop cards ! */
    uint8_t * primed_framebuffer = mmap(
        0, create_request.size, PROT_READ | PROT_WRITE, MAP_SHARED,
        dma_buf_fd, 0);
    ret = errno;

If my deduction is correct (please correct me if I'm wrong), this shows the actual H.264 encoding problem is systematic. The current H.264 implementation reply on DRM PRIME rather than the generic VAAPI, so only supports embedded SoCs but not the general VAAPI. So it's likely incompatible with most (if not all) discrete GPUs? This seems to suggest that the present code is only tested with single-board computers like Raspberry Pi, but discrete GPUs are never fully tested, and a full fix requires extending the current GPU encoding backend with a VAAPI codepath.

any1 commented 2 weeks ago

Just delete h264_encoder_destroy(encoder); to avoid the double free.

That would cause a memory leak when it succeeds. The problem is that h264_encoder_destroy should be idempotent (do nothing with NULL), but the NULL check is missing in the function.

This seems to suggest that the present code is only tested with single-board computers like Raspberry Pi, but discrete GPUs are never fully tested.

A different encoder is used for Raspberry Pi. This is tested on my Intel based laptop and I think I also tested it on my workstation at work, which has an AMD processor and a dedicated GPU. The AMD chip does not have an integrated GPU.

It's probably just selecting the wrong render node. If you have multiple render nodes, you need to select the one that's used by the compositor. This is not implemented.

A DRM PRIME is just another name for a Linux DMA-BUF.

biergaizi commented 2 weeks ago

Download password: changeme In the installer menu, select "gcc."

Please remove this spam message. It looks like malware phishing.

That would cause a memory leak when it succeeds. The problem is that h264_encoder_destroy should be idempotent (do nothing with NULL)

Right, it was my first thought too, but I didn't consider the "succeeds" case and I saw no point of doing so...

It's probably just selecting the wrong render node. If you have multiple render nodes, you need to select the one that's used by the compositor. This is not implemented. A DRM PRIME is just another name for a Linux DMA-BUF.

Thanks for the tip.

It turned out that I didn't enable libdrm for FFmpeg, but only VAAPI. After enabling libdrm, both DRM and VAAPI modes are now working as expected. I'll report this as a bug to Gentoo... But since it can be a problem for any self-built FFmpeg, the debug log should probably mention libdrm hwdevice context instead of just hwdevice context as a debugging aid:

diff --git a/src/enc/h264/ffmpeg-impl.c b/src/enc/h264/ffmpeg-impl.c
index 6868137..dfb407f 100644
--- a/src/enc/h264/ffmpeg-impl.c
+++ b/src/enc/h264/ffmpeg-impl.c
@@ -533,8 +533,11 @@ static struct h264_encoder* h264_encoder_ffmpeg_create(uint32_t width,

        rc = av_hwdevice_ctx_create(&self->hw_device_ctx,
                        AV_HWDEVICE_TYPE_DRM, render_node, NULL, 0);
-       if (rc != 0)
+       if (rc != 0) {
+               nvnc_log(NVNC_LOG_WARNING, "Failed to create libdrm hwdevice context: %s",
+                               av_err2str(rc));
                goto hwdevice_ctx_failure;
+       }

        self->base.next_frame_should_be_keyframe = true;
        TAILQ_INIT(&self->fb_queue)

I'm still seeing a black screen in VNC, but it's likely yet another separate GPU problem unrelated to this problem. Update: After a fresh rebuild with all debug logging turned off, it's now working perfectly.

Info: Capturing output HEADLESS-1
Info: >> Headless output 1 1280x720+0x0 Power:UNKNOWN
Info: Listening for connections on 10.3.0.2:5900
Info: New client connection from 10.3.0.1: 0x556d4ad3b690 (ref 1)
Info: Starting screen capture
Info: Choosing zrle encoding for client 0x556d4ad3b690
xkbcommon: ERROR: couldn't find a Compose file for locale "C.UTF8" (mapped to "C.UTF8")
Info: Opened DRM device /dev/dri/renderD128: driver amdgpu version 3.57.0.
Info: w:1 h:1 pixfmt:drm_prime tb:1/1 fr:0/1 sar:0/1
Info: libva: VA-API version 1.22.0
Info: libva: Trying to open /usr/lib64/va/drivers/radeonsi_drv_video.so
Info: libva: Found init function __vaDriverInit_1_22
Info: libva: va_openDriver() returns 0
Info: Initialised VAAPI connection: version 1.22
Info: VAAPI driver: Mesa Gallium driver 24.2.0 for AMD Radeon Pro VII (radeonsi, vega20, LLVM 18.1.8, DRM 3.57, 6.10.6-gentoo-dist).
Info: Driver not found in known nonstandard list, using standard behaviour.
Info: Input surface format is nv12.
Info: Using VAAPI profile VAProfileH264ConstrainedBaseline (13).
Info: Using VAAPI entrypoint VAEntrypointEncSlice (6).
Info: Using VAAPI render target format YUV420 (0x1).
Info: RC mode: CQP.
Info: RC quality: 5.
Info: RC framerate: 65535/1 (65535.00 fps).
Info: Driver does not report any additional prediction constraints.
Info: Using intra and P-frames (supported references: 1 / 0).
Warning: libav: 0: Driver does not support some wanted packed headers (wanted 0xd, found 0x1).
Info: Using level 4.
Info: Choosing open-h264 encoding for client 0x556d4ad3b690
Info: Opened DRM device /dev/dri/renderD128: driver amdgpu version 3.57.0.
Info: w:1 h:1 pixfmt:drm_prime tb:1/1 fr:0/1 sar:0/1
Info: libva: VA-API version 1.22.0
Info: libva: Trying to open /usr/lib64/va/drivers/radeonsi_drv_video.so
Info: libva: Found init function __vaDriverInit_1_22
Info: libva: va_openDriver() returns 0
Info: Initialised VAAPI connection: version 1.22
Info: VAAPI driver: Mesa Gallium driver 24.2.0 for AMD Radeon Pro VII (radeonsi, vega20, LLVM 18.1.8, DRM 3.57, 6.10.6-gentoo-dist).
Info: Driver not found in known nonstandard list, using standard behaviour.
Info: Input surface format is nv12.
Info: Using VAAPI profile VAProfileH264ConstrainedBaseline (13).
Info: Using VAAPI entrypoint VAEntrypointEncSlice (6).
Info: Using VAAPI render target format YUV420 (0x1).
Info: RC mode: CQP.
Info: RC quality: 7.
Info: RC framerate: 65535/1 (65535.00 fps).
Info: Driver does not report any additional prediction constraints.
Info: Using intra and P-frames (supported references: 1 / 0).
Warning: libav: 0: Driver does not support some wanted packed headers (wanted 0xd, found 0x1).
Info: Using level 3.1.
Info: Opened DRM device /dev/dri/renderD128: driver amdgpu version 3.57.0.
Info: w:1 h:1 pixfmt:drm_prime tb:1/1 fr:0/1 sar:0/1
Info: libva: VA-API version 1.22.0
Info: libva: Trying to open /usr/lib64/va/drivers/radeonsi_drv_video.so
Info: libva: Found init function __vaDriverInit_1_22
Info: libva: va_openDriver() returns 0
Info: Initialised VAAPI connection: version 1.22
Info: VAAPI driver: Mesa Gallium driver 24.2.0 for AMD Radeon Pro VII (radeonsi, vega20, LLVM 18.1.8, DRM 3.57, 6.10.6-gentoo-dist).
Info: Driver not found in known nonstandard list, using standard behaviour.
Info: Input surface format is nv12.
Info: Using VAAPI profile VAProfileH264ConstrainedBaseline (13).
Info: Using VAAPI entrypoint VAEntrypointEncSlice (6).
Info: Using VAAPI render target format YUV420 (0x1).
Info: RC mode: CQP.
Info: RC quality: 7.
Info: RC framerate: 65535/1 (65535.00 fps).
Info: Driver does not report any additional prediction constraints.
Info: Using intra and P-frames (supported references: 1 / 0).
Warning: libav: 0: Driver does not support some wanted packed headers (wanted 0xd, found 0x1).
Info: Using level 3.1.
Info: Opened DRM device /dev/dri/renderD128: driver amdgpu version 3.57.0.
Info: w:1 h:1 pixfmt:drm_prime tb:1/1 fr:0/1 sar:0/1
Info: libva: VA-API version 1.22.0
Info: libva: Trying to open /usr/lib64/va/drivers/radeonsi_drv_video.so
Info: libva: Found init function __vaDriverInit_1_22
Info: libva: va_openDriver() returns 0
Info: Initialised VAAPI connection: version 1.22
Info: VAAPI driver: Mesa Gallium driver 24.2.0 for AMD Radeon Pro VII (radeonsi, vega20, LLVM 18.1.8, DRM 3.57, 6.10.6-gentoo-dist).
Info: Driver not found in known nonstandard list, using standard behaviour.
Info: Input surface format is nv12.
Info: Using VAAPI profile VAProfileH264ConstrainedBaseline (13).
Info: Using VAAPI entrypoint VAEntrypointEncSlice (6).
Info: Using VAAPI render target format YUV420 (0x1).
Info: RC mode: CQP.
Info: RC quality: 7.
Info: RC framerate: 65535/1 (65535.00 fps).
Info: Driver does not report any additional prediction constraints.
Info: Using intra and P-frames (supported references: 1 / 0).
Warning: libav: 0: Driver does not support some wanted packed headers (wanted 0xd, found 0x1).
Info: Using level 3.1.
Info: Opened DRM device /dev/dri/renderD128: driver amdgpu version 3.57.0.
Info: w:1 h:1 pixfmt:drm_prime tb:1/1 fr:0/1 sar:0/1
Info: libva: VA-API version 1.22.0
Info: libva: Trying to open /usr/lib64/va/drivers/radeonsi_drv_video.so
Info: libva: Found init function __vaDriverInit_1_22
Info: libva: va_openDriver() returns 0
Info: Initialised VAAPI connection: version 1.22
Info: VAAPI driver: Mesa Gallium driver 24.2.0 for AMD Radeon Pro VII (radeonsi, vega20, LLVM 18.1.8, DRM 3.57, 6.10.6-gentoo-dist).
Info: Driver not found in known nonstandard list, using standard behaviour.
Info: Input surface format is nv12.
Info: Using VAAPI profile VAProfileH264ConstrainedBaseline (13).
Info: Using VAAPI entrypoint VAEntrypointEncSlice (6).
Info: Using VAAPI render target format YUV420 (0x1).
Info: RC mode: CQP.
Info: RC quality: 7.
Info: RC framerate: 65535/1 (65535.00 fps).
Info: Driver does not report any additional prediction constraints.
Info: Using intra and P-frames (supported references: 1 / 0).
Warning: libav: 0: Driver does not support some wanted packed headers (wanted 0xd, found 0x1).
Info: Using level 4.
any1 commented 2 weeks ago

In case, you're not aware of this, there is an even lower log level named "trace". Feel free to open a PR for improved logging.

any1 commented 2 weeks ago

I fixed the null-dereference bug. It was introduced 3 week ago in 60f86fd04c500c6dd32e39a7e166d286dae68cb9