games-on-whales / wolf

Stream virtual desktops and games running in Docker
https://games-on-whales.github.io/wolf/stable/
MIT License
588 stars 40 forks source link

Nvidia ctk using CDI #118

Open leiserfg opened 2 weeks ago

leiserfg commented 2 weeks ago

I'm on nixos and given that there the method currently documented for nvidia in the guide does not work. The cause is that nixos moved from using nvidia-wrapper to using CDI which is the new and advice way of mixing nvidia and docker. I'm using Driver Version: 555.58.02 and this dockerfile:

services:
  wolf:
    image: my-wolf
    buid: .
    environment:
      - XDG_RUNTIME_DIR=/tmp/sockets
      - HOST_APPS_STATE_FOLDER=/etc/wolf
      - NVIDIA_DRIVER_CAPABILITIES=all
      - NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all
    volumes:
      - /etc/wolf/:/etc/wolf/
      - /tmp/sockets:/tmp/sockets:rw
      - /var/run/docker.sock:/var/run/docker.sock:rw
      - /dev/:/dev/:rw
      - /run/udev:/run/udev:rw
    device_cgroup_rules:
      - 'c 13:* rmw'
    devices:
      - /dev/dri
      - /dev/uinput
      - /dev/uhid
    deploy:
      resources:
        reservations:
          devices:
           - driver: cdi
             device_ids:
                - nvidia.com/gpu=all
    network_mode: host
    restart: unless-stopped

together with this Dockerfile:

 FROM ghcr.io/games-on-whales/wolf:stable                                                                              
 RUN  cp /usr/local/nvidia/lib/libnvrtc* /usr/lib/             

To workaround the fact that CDI will override the /usr/local/nvidia/lib/ folder. Without it wolf can't find libnvrtc, therefore it fails finding nvidia encoders.

but I get this error while I try to start any App:

wolf-1  |    0:     0x7f573666c575 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd736fd5964392270
wolf-1  |    1:     0x7f57366900fb - core::fmt::write::hc6043626647b98ea
wolf-1  |    2:     0x7f57366697df - std::io::Write::write_fmt::h0d24b3e0473045db
wolf-1  |    3:     0x7f573666c34e - std::sys_common::backtrace::print::h45eb8174d25a1e76
wolf-1  |    4:     0x7f573666d899 - std::panicking::default_hook::{{closure}}::haf3f0170eb4f3b53
wolf-1  |    5:     0x7f573666d63a - std::panicking::default_hook::hb5d3b27aa9f6dcda
wolf-1  |    6:     0x7f573666dd33 - std::panicking::rust_panic_with_hook::h6b49d59f86ee588c
wolf-1  |    7:     0x7f573666dc14 - std::panicking::begin_panic_handler::{{closure}}::hd4c2f7ed79b82b70
wolf-1  |    8:     0x7f573666ca39 - std::sys_common::backtrace::__rust_end_short_backtrace::h2946d6d32d7ea1ad
wolf-1  |    9:     0x7f573666d947 - rust_begin_unwind
wolf-1  |   10:     0x7f5736456f93 - core::panicking::panic_fmt::ha02418e5cd774672
wolf-1  |   11:     0x7f5736457426 - core::result::unwrap_failed::h55f86ada3ace5ed2
wolf-1  |   12:     0x7f57364ba2f5 - waylanddisplaycore::comp::init::h0646e1a9d0f59f64
wolf-1  |   13:     0x7f5736474cd0 - std::sys_common::backtrace::__rust_begin_short_backtrace::h3e12ecad4fbffac1
wolf-1  |   14:     0x7f5736478a7e - core::ops::function::FnOnce::call_once{{vtable.shim}}::h66c8be57d5da151d
wolf-1  |   15:     0x7f573667074b - std::sys::pal::unix::thread::Thread::new::thread_start::hb85dbfa54ba503d6
wolf-1  |   16:     0x7f573329ca94 - <unknown>
wolf-1  |   17:     0x7f5733329a34 - __clone
wolf-1  | 2024-09-09T18:31:57.170706Z ERROR waylanddisplaycore: Compositor thread panic'ed! err=Any { .. }
wolf-1  | thread '<unnamed>' panicked at /tmp/gst-wayland-display/wayland-display-core/src/lib.rs:87:41:
wolf-1  | called `Result::unwrap()` on an `Err` value: RecvError
wolf-1  | stack backtrace:
wolf-1  |    0:     0x7f573666c575 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd736fd5964392270
wolf-1  |    1:     0x7f57366900fb - core::fmt::write::hc6043626647b98ea
wolf-1  |    2:     0x7f57366697df - std::io::Write::write_fmt::h0d24b3e0473045db
wolf-1  |    3:     0x7f573666c34e - std::sys_common::backtrace::print::h45eb8174d25a1e76
wolf-1  |    4:     0x7f573666d899 - std::panicking::default_hook::{{closure}}::haf3f0170eb4f3b53
wolf-1  |    5:     0x7f573666d63a - std::panicking::default_hook::hb5d3b27aa9f6dcda
wolf-1  |    6:     0x7f573666dd33 - std::panicking::rust_panic_with_hook::h6b49d59f86ee588c
wolf-1  |    7:     0x7f573666dc14 - std::panicking::begin_panic_handler::{{closure}}::hd4c2f7ed79b82b70
wolf-1  |    8:     0x7f573666ca39 - std::sys_common::backtrace::__rust_end_short_backtrace::h2946d6d32d7ea1ad
wolf-1  |    9:     0x7f573666d947 - rust_begin_unwind
wolf-1  |   10:     0x7f5736456f93 - core::panicking::panic_fmt::ha02418e5cd774672
wolf-1  |   11:     0x7f5736457426 - core::result::unwrap_failed::h55f86ada3ace5ed2
wolf-1  |   12:     0x7f573645e0a3 - waylanddisplaycore::MaybeRecv<T>::get::hfdc57fb8b9f1c5c8
wolf-1  |   13:     0x7f573645f74a - display_get_devices_len
wolf-1  |   14:     0x5602cd93be74 - _ZN4wolf4core15virtual_display22create_wayland_displayERKN5immer5arrayINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS2_13memory_policyINS2_21free_list_heap_policyINS2_8cpp_heapELm1024EEENS2_15refcount_policyENS2_15spinlock_policyENS2_20no_transience_policyELb0ELb1EEEEERKS9_
wolf-1  |                                at /wolf/src/core/src/platforms/linux/virtual-display/wayland-display.cpp:30:36
wolf-1  |   15:     0x5602cd735971 - _ZZZ23setup_sessions_handlersRKN5immer3boxIN5state8AppStateENS_13memory_policyINS_21free_list_heap_policyINS_8cpp_heapELm1024EEENS_15refcount_policyENS_15spinlock_policyENS_20no_transience_policyELb0ELb1EEEEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt8optionalI11AudioServerEENK3$_2clERKNS0_INS1_13StreamSessionESA_EEENKUlvE_clEv
wolf-1  |                                at /wolf/src/moonlight-server/wolf.cpp:225:29
wolf-1  |   16:     0x7f57336eabb4 - <unknown>
wolf-1  |   17:     0x7f573329ca94 - <unknown>
wolf-1  |   18:     0x7f5733329a34 - __clone
wolf-1  | fatal runtime error: failed to initiate panic, error 5

I'm currently using the second way of running wolf documented in the guide (the volume with all the nvidia files) and that works fine, but it will probably drive me crazy the next time there is a nvidia update.

ABeltramo commented 2 weeks ago

There's just been a report on Discord for NixOS:

Apparently, virtualisation.docker.enableNvidia is deprecated, they say to use hardware.nvidia-container-toolkit.enable = true; instead, which is broken. Neither of them install libnvidia-container, which contains nvidia-container-cli.

After further debugging it seems that NixOS bundles a very old Nvidia container toolkit version

└> sudo nvidia-container-cli -V                     
cli-version: 1.9.0
lib-version: 1.9.0
build date: 1980-01-01T00:00+00:00
build revision: v1.9.0
build compiler: gcc 13.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -DWITH_TIRPC -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

1.9.0 has been released in Mar 2022 not even close to 1.16.0 which contains the fixes required to run Wolf.

I'm currently using the second way of running wolf documented in the guide (the volume with all the nvidia files) and that works fine

I'm glad that works, it seems that this should be reported upstream to NixOS!

Azelphur commented 2 weeks ago

Just to provide some useful information, it seems that Wolf requires nvidia container toolkit >= 1.16, however, NixOS provides 1.15.0-rc3 - this package needs some work too, as it retrieves the nvidia-container-toolkit from gitlab, but nvidia have migrated to github and the 1.16 release is not available on gitlab.

Edit: There's an update request waiting to be filled here https://github.com/NixOS/nixpkgs/issues/341911