canonical / intel-npu-driver-snap

Snap recipe for the Intel® NPU Driver (https://github.com/intel/linux-npu-driver/)
MIT License
5 stars 0 forks source link

Automate firmware loading and persist custom path #8

Closed frenchwr closed 1 month ago

frenchwr commented 1 month ago

Internal Jira card: PEK-1255

This addresses issues raised in https://github.com/canonical/intel-npu-driver-snap/issues/2 by:

Testing

Test daemon updates path and re-loads kernel module

ubuntu@hercules-542967:~$ sudo snap install --dangerous ./intel-npu-driver_1.6.0_amd64.snap 
intel-npu-driver 1.6.0 installed

ubuntu@hercules-542967:~$ sudo snap connect intel-npu-driver:intel-npu-fw

ubuntu@hercules-542967:~$ sudo snap connect intel-npu-driver:intel-npu-kmod

ubuntu@hercules-542967:~$ sudo snap logs intel-npu-driver.load-npu-firmware
2024-09-25T20:52:07Z systemd[1]: Started snap.intel-npu-driver.load-npu-firmware.service - Service for snap application intel-npu-driver.load-npu-firmware.
2024-09-25T20:52:07Z intel-npu-driver.load-npu-firmware[29034]:   Error: intel-npu-kmod plug must be connected in order
2024-09-25T20:52:07Z intel-npu-driver.load-npu-firmware[29034]:   for NPU firmware to be re-loaded by intel_vpu kernel module.
2024-09-25T20:52:07Z intel-npu-driver.load-npu-firmware[29034]:   
2024-09-25T20:52:07Z systemd[1]: snap.intel-npu-driver.load-npu-firmware.service: Main process exited, code=exited, status=1/FAILURE
2024-09-25T20:52:07Z systemd[1]: snap.intel-npu-driver.load-npu-firmware.service: Failed with result 'exit-code'.
2024-09-25T20:52:17Z systemd[1]: snap.intel-npu-driver.load-npu-firmware.service: Scheduled restart job, restart counter is at 3.
2024-09-25T20:52:17Z systemd[1]: Started snap.intel-npu-driver.load-npu-firmware.service - Service for snap application intel-npu-driver.load-npu-firmware.
2024-09-25T20:52:17Z intel-npu-driver.load-npu-firmware[29200]: Firmware search path is set to: /var/snap/intel-npu-driver/x1
2024-09-25T20:52:17Z systemd[1]: snap.intel-npu-driver.load-npu-firmware.service: Deactivated successfully.

ubuntu@hercules-542967:~$ sudo cat /sys/module/firmware_class/parameters/path
/var/snap/intel-npu-driver/x1

The daemon initially fails because the snap interfaces are not connected. Once connected, the daemon runs successfully and updates the search path. Ultimately this will (pending approval from the snapstore team) be automated with autoconnections.

Test search path updated following reboot

ubuntu@hercules-542967:~$ sudo reboot

Broadcast message from root@hercules-542967 on pts/1 (Wed 2024-09-25 20:53:01 UTC):

The system will reboot now!

ubuntu@hercules-542967:~$ Connection to 192.168.102.126 closed by remote host.
Connection to 192.168.102.126 closed.
will@xps13:~ $ ssh hercules 
Welcome to Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-45-generic x86_64)
.
.
Last login: Wed Sep 25 18:54:54 2024 from 192.168.102.1
ubuntu@hercules-542967:~$ sudo snap logs intel-npu-driver.load-npu-firmware
2024-09-25T20:52:07Z intel-npu-driver.load-npu-firmware[29034]:   
2024-09-25T20:52:07Z systemd[1]: snap.intel-npu-driver.load-npu-firmware.service: Main process exited, code=exited, status=1/FAILURE
2024-09-25T20:52:07Z systemd[1]: snap.intel-npu-driver.load-npu-firmware.service: Failed with result 'exit-code'.
2024-09-25T20:52:17Z systemd[1]: snap.intel-npu-driver.load-npu-firmware.service: Scheduled restart job, restart counter is at 3.
2024-09-25T20:52:17Z systemd[1]: Started snap.intel-npu-driver.load-npu-firmware.service - Service for snap application intel-npu-driver.load-npu-firmware.
2024-09-25T20:52:17Z intel-npu-driver.load-npu-firmware[29200]: Firmware search path is set to: /var/snap/intel-npu-driver/x1
2024-09-25T20:52:17Z systemd[1]: snap.intel-npu-driver.load-npu-firmware.service: Deactivated successfully.
2024-09-25T20:53:45Z systemd[1]: Started snap.intel-npu-driver.load-npu-firmware.service - Service for snap application intel-npu-driver.load-npu-firmware.
2024-09-25T20:53:45Z intel-npu-driver.load-npu-firmware[1089]: Firmware search path is set to: /var/snap/intel-npu-driver/x1
2024-09-25T20:53:45Z systemd[1]: snap.intel-npu-driver.load-npu-firmware.service: Deactivated successfully.

ubuntu@hercules-542967:~$ sudo dmesg | grep intel_vpu
[    4.706576] intel_vpu 0000:00:0b.0: enabling device (0000 -> 0002)
[    4.714294] intel_vpu 0000:00:0b.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20230726*MTL_CLIENT_SILICON-release*2101*ci_tag_mtl_pv_vpu_rc_20230726_2101*648a666b8b9
[    4.796516] [drm] Initialized intel_vpu 1.0.0 20230117 for 0000:00:0b.0 on minor 0
[   11.084478] intel_vpu 0000:00:0b.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20240726*MTL_CLIENT_SILICON-release*0004*ci_tag_ud202428_vpu_rc_20240726_0004*e4a99ed6b3e
[   11.211998] [drm] Initialized intel_vpu 1.0.0 20230117 for 0000:00:0b.0 on minor 0

ubuntu@hercules-542967:~$ sudo cat /sys/module/firmware_class/parameters/path
/var/snap/intel-npu-driver/x1

Note: there is a ~7 second period at boot where the device has the OS-provided firmware loaded instead of the one shipped by the snap. This is in the very early stages of the boot sequence so I think should be acceptable but would love to hear others' thoughts.

I also ran intel-npu-driver.vpu-emd-test --config=basic.yaml (as described in the README) and verified that the expected tests passed following a reboot.

Test OS-provided firmware loaded on snap removal

buntu@hercules-542967:~$ sudo snap remove intel-npu-driver 
intel-npu-driver removed

ubuntu@hercules-542967:~$ sudo cat /sys/module/firmware_class/parameters/path

ubuntu@hercules-542967:~$ sudo dmesg | grep intel_vpu
[    4.706576] intel_vpu 0000:00:0b.0: enabling device (0000 -> 0002)
[    4.714294] intel_vpu 0000:00:0b.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20230726*MTL_CLIENT_SILICON-release*2101*ci_tag_mtl_pv_vpu_rc_20230726_2101*648a666b8b9
[    4.796516] [drm] Initialized intel_vpu 1.0.0 20230117 for 0000:00:0b.0 on minor 0
[   11.084478] intel_vpu 0000:00:0b.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20240726*MTL_CLIENT_SILICON-release*0004*ci_tag_ud202428_vpu_rc_20240726_0004*e4a99ed6b3e
[   11.211998] [drm] Initialized intel_vpu 1.0.0 20230117 for 0000:00:0b.0 on minor 0
[  331.098190] intel_vpu 0000:00:0b.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20230726*MTL_CLIENT_SILICON-release*2101*ci_tag_mtl_pv_vpu_rc_20230726_2101*648a666b8b9
[  331.196499] [drm] Initialized intel_vpu 1.0.0 20230117 for 0000:00:0b.0 on minor 0

ubuntu@hercules-542967:~$ snap tasks --last=remove
Status  Spawn               Ready               Summary
Done    today at 20:59 UTC  today at 20:59 UTC  Stop snap "intel-npu-driver" services
Done    today at 20:59 UTC  today at 20:59 UTC  Run remove hook of "intel-npu-driver" snap if present
Done    today at 20:59 UTC  today at 20:59 UTC  Disconnect interfaces of snap "intel-npu-driver"
Done    today at 20:59 UTC  today at 20:59 UTC  Save data of snap "intel-npu-driver" in automatic snapshot set #12
Done    today at 20:59 UTC  today at 20:59 UTC  Remove aliases for snap "intel-npu-driver"
Done    today at 20:59 UTC  today at 20:59 UTC  Make snap "intel-npu-driver" unavailable to the system
Done    today at 20:59 UTC  today at 20:59 UTC  Remove security profile for snap "intel-npu-driver" (x1)
Done    today at 20:59 UTC  today at 20:59 UTC  Remove data for snap "intel-npu-driver" (x1)
Done    today at 20:59 UTC  today at 20:59 UTC  Remove snap "intel-npu-driver" (x1) from the system
Done    today at 20:59 UTC  today at 20:59 UTC  Disconnect intel-npu-driver:intel-npu-plug from intel-npu-driver:intel-npu
Done    today at 20:59 UTC  today at 20:59 UTC  Disconnect intel-npu-driver:intel-npu-kmod from snapd:kernel-module-control
Done    today at 20:59 UTC  today at 20:59 UTC  Disconnect intel-npu-driver:intel-npu-fw from snapd:kernel-firmware-control

The last command shows the sequence of steps run by snapd when a snap is removed. An important detail is that the remove hook is the second step, and runs before the snap interfaces are disconnected. This is important because the remove hook needs permissions provided by these interfaces to unset the firmware path and to re-load the kernel module with the OS-provided firmware.

frenchwr commented 1 month ago

At first glance it looks like the automated build may be failing due to an issue with our self-hosted runners or with GitHub itself. I will try re-running in the morning.

[ 22%] Performing download step (download, verify and extract) for 'node-api-headers-populate'
:: -- Downloading...
::    dst='/root/parts/npu-driver/build/third_party/vpux_plugin/build-cid/_deps/node-api-headers-subbuild/node-api-headers-populate-prefix/src/v1.1.0.tar.gz'
::    timeout='none'
::    inactivity timeout='none'
:: -- Using src='https://github.com/nodejs/node-api-headers/archive/refs/tags/v1.1.0.tar.gz'
:: CMake Error at node-api-headers-subbuild/node-api-headers-populate-prefix/src/node-api-headers-populate-stamp/download-node-api-headers-populate.cmake:170 (message):
::   Each download failed!
::
::     error: downloading 'https://github.com/nodejs/node-api-headers/archive/refs/tags/v1.1.0.tar.gz' failed
::           status_code: 22
::           status_string: "HTTP response code said error"
::           log:
::           --- LOG BEGIN ---
::           Host github.com:443 was resolved.
hector-cao commented 1 month ago

@frenchwr Awesome ! I m not sure of technical feasability here but I m wondering if there is a way to do load the kernel as part of the hook called upon the intel-npu-fw interface connection as before.

something like this in the snapcraft:

hooks:
  connect-plug-intel-npu-fw:
    plugs:
      - intel-npu-kmod

It would make sense to me to tight the kernel module load on this interface connection and also to load the kernel module on actual use (instead of having a daemon that continuously tries and fails if the users do not yet connect the appropriate interfaces - because they do not want to use NPU even if the snap is installed on the system).

What do you think ?

frenchwr commented 1 month ago

@frenchwr Awesome ! I m not sure of technical feasability here but I m wondering if there is a way to do load the kernel as part of the hook called upon the intel-npu-fw interface connection as before.

something like this in the snapcraft:

hooks:
  connect-plug-intel-npu-fw:
    plugs:
      - intel-npu-kmod

It would make sense to me to tight the kernel module load on this interface connection and also to load the kernel module on actual use (instead of having a daemon that continuously tries and fails if the users do not yet connect the appropriate interfaces - because they do not want to use NPU even if the snap is installed on the system).

What do you think ?

I originally pursued this path but there were problems:

hector-cao commented 1 month ago

@frenchwr Awesome ! I m not sure of technical feasability here but I m wondering if there is a way to do load the kernel as part of the hook called upon the intel-npu-fw interface connection as before. something like this in the snapcraft:

hooks:
  connect-plug-intel-npu-fw:
    plugs:
      - intel-npu-kmod

It would make sense to me to tight the kernel module load on this interface connection and also to load the kernel module on actual use (instead of having a daemon that continuously tries and fails if the users do not yet connect the appropriate interfaces - because they do not want to use NPU even if the snap is installed on the system). What do you think ?

I originally pursued this path but there were problems:

  • The custom firmware path in /sys/module/firmware_class/parameters/path does not survive reboots, and hook-based mechanism does not trigger on reboot. Snap interface connections (even manual ones) persist on reboot, so a plug's connect-plug* hook does not run on reboot. I don't love the daemon solution - I don't think it's perfect - but it does solve the reboot issue as it's run as a systemd service on boot. A more permanent solution would be writing a new snapd interface that allows the snap to write the firmware binary blobs to a conventional path on the host (e.g. /usr/lib/firmware/intel).
  • You need access to two interfaces from a hook that runs when only one of these interfaces connects - so which interfaces's connect-plug* hook should be used? Whichever you choose, you need to ensure that the other interface has first connected. If the snapstore team approves autoconnecting both interfaces, is there a mechanism for controlling the order at which they connect? As you can see it gets messy.

I did know that the manual connections stay on reboot and hooks are not invoked again.

A more permanent solution would be writing a new snapd interface that allows the snap to write the firmware binary blobs to a conventional path on the host (e.g. `/usr/lib/firmware/intel`).

Did you think also to the approach to modify the kernel boot command line (via grub) to add the firmware path to the vpu module ? This change is going to be permanent across reboot also.

For your second bullet, i see, let me think about it but probably you make the point

frenchwr commented 1 month ago

Did you think also to the approach to modify the kernel boot command line (via grub) to add the firmware path to the vpu module ? This change is going to be permanent across reboot also.

Is that possible? Does the intel_vpu kernel module accept the path as a param? I don't see anything like that in the source. If it is supported then I agree it's definitely worth consideration.

hector-cao commented 1 month ago

Did you think also to the approach to modify the kernel boot command line (via grub) to add the firmware path to the vpu module ? This change is going to be permanent across reboot also.

Is that possible? Does the intel_vpu kernel module accept the path as a param? I don't see anything like that in the source. If it is supported then I agree it's definitely worth consideration.

I believe yes, i already comment about that in the first MR, here is the source code : https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/accel/ivpu/ivpu_fw.c?h=linux-6.9.y#n48

frenchwr commented 1 month ago

I believe yes, i already comment about that in the first MR, here is the source code : https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/accel/ivpu/ivpu_fw.c?h=linux-6.9.y#n48

Awesome - now I remember. Let me think about it and test.

frenchwr commented 1 month ago

I've given this some thought and researched a bit. My overall thought is that updating a system's kernel cmdline feels too obtrusive and risky. This snap is designed for laptop and desktop usage, so I think we should strongly prefer a design that limits changes to the whole system.

The custom firmware path at /sys/module/firmware_class/parameters/path, while less permanent, does offer good flexibility. If for some reason the binary blobs are not in the custom path, the kernel module will next search in the conventional system locations. It's not as clear to me (from the source code) how the kernel module would behave if we passed the path as a param but the firmware was not found on that path.

There's also the issue of kernel upgrades. I guess we would need some mechanism for also updating the kernel boot command any time a new kernel is installed.

Gadget snaps do give you the ability to customize the kernel boot command, but these are intended for IoT devices. I do not see any other type of grub interface support within snapd.

hector-cao commented 1 month ago
  • intel_vpu

I believe kernel command-line is persistent through kernel upgrades, However, i agree that doing system-wide configuration change is not ideal

But, i was thinking if we can give the custom firmware path to vpu driver directly (through driver parameter) instead of using the parameter path of firmware_class

So we can remove the need for kernel-firmware-control and the contents of the script load-npu-firmware will look like

...
# Load NPU firmware from custom path
rmmod intel_vpu
modprobe intel_vpu ivpu_firmware=xxx

Of course, this approach depends on vpu driver feature to take custom firmware path and will no longer work if the vpu driver removes the support for this feature in the future.

frenchwr commented 1 month ago
  • intel_vpu

I believe kernel command-line is persistent through kernel upgrades, However, i agree that doing system-wide configuration change is not ideal

But, i was thinking if we can give the custom firmware path to vpu driver directly (through driver parameter) instead of using the parameter path of firmware_class

So we can remove the need for kernel-firmware-control and the contents of the script load-npu-firmware will look like

...
# Load NPU firmware from custom path
rmmod intel_vpu
modprobe intel_vpu ivpu_firmware=xxx

Of course, this approach depends on vpu driver feature to take custom firmware path and will no longer work if the vpu driver removes the support for this feature in the future.

I've been testing and just noticed from the intel_vpu source you referenced:

MODULE_PARM_DESC(firmware, "NPU firmware binary in /lib/firmware/..");

Note the /lib/firmware as the base location, as though that is required. I've attempted multiple variations of the path to the firmware shipped with the snap and nothing has worked yet, but I'll look again tomorrow.

[90402.526319] intel_vpu 0000:00:0b.0: Direct firmware load for /var/snap/intel-npu-driver/current/ failed with error -2
[90657.399847] intel_vpu 0000:00:0b.0: Direct firmware load for /var/snap/intel-npu-driver/current/intel/vpu/vpu_37xx_v0.0.bin failed with error -2
[90683.971593] intel_vpu 0000:00:0b.0: Direct firmware load for ../../var/snap/intel-npu-driver/current/intel/vpu/vpu_37xx_v0.0.bin failed with error -2
[90720.210621] intel_vpu 0000:00:0b.0: Direct firmware load for ../var/snap/intel-npu-driver/current/intel/vpu/vpu_37xx_v0.0.bin failed with error -2
hector-cao commented 1 month ago

modprobe intel_vpu ivpu_firmware=

@frenchwr I m sorry to make you loose your time for probably nothing

I took a closer look at the source code and seems that the firmware parameter of the vpu driver is not a path but only the firmware file name.

frenchwr commented 1 month ago

modprobe intel_vpu ivpu_firmware=

@frenchwr I m sorry to make you loose your time for probably nothing

I took a closer look at the source code and seems that the firmware parameter of the vpu driver is not a path but only the firmware file name.

No worries! I think this is a good time for us to make sure we're considering all possibilities.