canonical / nvidia-assemble

2 stars 2 forks source link

Get rid of devmode #1

Closed xnox closed 1 year ago

xnox commented 1 year ago

Ideally the snap should work fully confined.

I have started in the past to work on the hardware-control interface in snapd https://github.com/snapcore/snapd/pull/11104 to allow the udev operations, but that stalled because my sample snaps didn't seem to have access to neither udev or devices properly somehow. I can build experimental snapd snaps with it.

Also there is potential to optimize pc-kernel snap packaging to make it more friendly to use with nvidia.

xnox commented 1 year ago

@jocado any thoughts and tests would be welcomed here.

jocado commented 1 year ago

Hi :wave: - sorry for the delayed response.

General issue to start with. I found when I tried to run snapcraft ont he snap, it was failing for me. Not sure how you are building, I'm using snapcraft and multipass. It seem to of found two issues, that for some reason may not be present in the build system you're using.

There seems to be a bug in craft_parts, so core22 specific atm, where if the part used to set the version is called version it creates an error.

2023-02-17 17:22:07.986 :: 2023-02-17 17:22:06.839 :: error: 'override-build' in part 'version' executed an invalid control API call: variable 'version' can be set only once.                                                
2023-02-17 17:22:07.986 :: 2023-02-17 17:22:07.043 'override-build' in part 'version' failed with code 1.                                                                                                                     

If I rename the part to set_version, it all works.

The second issue was an error from git when running git describe --tags in the src dir for the part.

fatal: detected dubious ownership in repository at '/root/parts/version/src'
To add an exception for this directory, call:

    git config --global --add safe.directory /root/parts/version/src

This can be fixed by moving the versioning to the build stage, where the perms are more likely to be correct.

There's also an opportunity to move to the newer craftctl commands, but imagine you may prefer to keep the snapcraftctl commands as it will reduce the variance between the core20 and core22 branches.

I created a PR for the first two things: https://github.com/xnox/nvidia-assemble/pull/2 what do you think ?

jocado commented 1 year ago

After successfully compiling the snap, with the config from the above MR, it's not quite working with current changes.

I connected the kernel-module-control interface plug, and re-ran the nvidia-assemble service. This is what I get in the logs:

2023-02-17T18:53:43Z systemd[1]: Starting Service for snap application nvidia-assemble.nvidia-assemble...
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4774]: + nvidia_dir=/lib/modules/*/kernel/nvidia-*
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4774]: + [ -e /lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/bits/SHA256SUMS ]
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4774]: + cmp -s /lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/bits/SHA256SUMS /var/snap/nvidia-assemble/common/nvidia-driver/bits/SHA256SUMS
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4774]: + modprobe nvidia-drm modeset=1
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4798]: modprobe: ERROR: ../libkmod/libkmod-module.c:191 kmod_module_parse_depline() ctx=0x55b25a8382a0 path=/lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/nvidia-modeset.ko error=No such file or directory
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4798]: modprobe: ERROR: ../libkmod/libkmod-module.c:191 kmod_module_parse_depline() ctx=0x55b25a8382a0 path=/lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/nvidia-modeset.ko error=No such file or directory
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4798]: modprobe: ERROR: could not insert 'nvidia_drm': Unknown symbol in module, or unknown parameter (see dmesg)
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4774]: + echo connect kernel-module-control
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4774]: connect kernel-module-control
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4774]: + modprobe nvidia-uvm
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4799]: modprobe: ERROR: ../libkmod/libkmod-module.c:191 kmod_module_parse_depline() ctx=0x5616c53ca2a0 path=/lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/nvidia.ko error=No such file or directory
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4799]: modprobe: ERROR: ../libkmod/libkmod-module.c:191 kmod_module_parse_depline() ctx=0x5616c53ca2a0 path=/lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/nvidia.ko error=No such file or directory
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4799]: modprobe: ERROR: could not insert 'nvidia_uvm': Unknown symbol in module, or unknown parameter (see dmesg)
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4774]: + echo connect kernel-module-control
2023-02-17T18:53:43Z nvidia-assemble.nvidia-assemble[4774]: connect kernel-module-control
2023-02-17T18:53:43Z systemd[1]: snap.nvidia-assemble.nvidia-assemble.service: Deactivated successfully.

No obvious confinement issues. In fact, I re-installed it with --devmode and it was the same.

jocado commented 1 year ago

It's probably obvious from the above info, but I'm testing on UC22. pc-kernel 22/stable

xnox commented 1 year ago

So i have been experimenting with changing how the kernel snap is packaged and what nvidia-assemble is doing.

Can you please try refreshing to snaps in https://people.canonical.com/~xnox/nvidia/ ?

These have the following changes: 1) kernel snap contains nvidia.ko symlinks to the nvidia-assemble common location 2) kernel snap contains nvidia device nodes in modules.devname 3) kernel snap has nouveau drivers exluded, meaning there is no need to deny loading them 4) nvidia-assemble is modified to only copy bits of the modules 5) nvidia-assemble is modified to only assemble the .ko and load them 6) nvidia-assemble now only needs kernel-module-load permissions for modules to be correctly loaded on second boot of a new kernel abi; kernel-module-control is still needed to load new nvidia modules upon nvidia-assemble installation / first boot of a new kernel abi.

Let me know how you like the above. It seems to almost work for me in confined mode, however i still need sudo to access and make deviceQuery run. I am checking further if the permissions on the /dev/nvidia* devices are incorrect, or if snapd needs to know to map / allow access to more devices via opengl interface.

jocado commented 1 year ago

That looks very interesting promosing :+1: I'm am going to test it, but it will also have to create a new image to do it [ as you can't replaced a signed kernel snap in a secured image.

I will come back to you as soon as I've been able to.

xnox commented 1 year ago

That looks very interesting promosing +1 I'm am going to test it, but it will also have to create a new image to do it [ as you can't replaced a signed kernel snap in a secured image.

I will come back to you as soon as I've been able to.

Let me try putting this kernel into a branch into the store, as I should be able to do that on temporary basis.

xnox commented 1 year ago

@jocado pc-kernel is now published in 22/candidate/xnox-nvidia-pc 5.15.0-60.66.1 1233 - 2023-03-23T00:00:00Z (expiry date)

xnox commented 1 year ago

For the other udev rules, I am proposing to just add them to the base snap itself for core22 and core24: https://github.com/snapcore/core-base/pull/96/files and https://github.com/snapcore/core-base/pull/97/files

jocado commented 1 year ago

pc-kernel is now published in 22/candidate/xnox-nvidia-pc

Thanks - that made it much easier to test :+1:

It seems like it works well. Obviously, I had to connect the interface plug and then restart the nvidia-assemble service, but this is looking generally very promising from my point of view.

In case it's useful, here are the logs before interface connection:

2023-02-21T17:18:57Z systemd[1]: Starting Service for snap application nvidia-assemble.nvidia-assemble...
2023-02-21T17:18:57Z nvidia-assemble.nvidia-assemble[4509]: + nvidia_dir=/lib/modules/*/kernel/nvidia-*
2023-02-21T17:18:57Z nvidia-assemble.nvidia-assemble[4509]: + [ -e /lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/bits/SHA256SUMS ]
2023-02-21T17:18:57Z nvidia-assemble.nvidia-assemble[4509]: + cmp -s /lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/bits/SHA256SUMS /var/snap/nvidia-assemble/common/nvidia-driver/bits/SHA256SUMS
2023-02-21T17:18:57Z nvidia-assemble.nvidia-assemble[4509]: + rm -rf /var/snap/nvidia-assemble/common/nvidia-driver
2023-02-21T17:18:57Z nvidia-assemble.nvidia-assemble[4509]: + mkdir -p /var/snap/nvidia-assemble/common/nvidia-driver
2023-02-21T17:18:57Z nvidia-assemble.nvidia-assemble[4509]: + cp -r /lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/bits /var/snap/nvidia-assemble/common/nvidia-driver/
2023-02-21T17:19:00Z nvidia-assemble.nvidia-assemble[4509]: + cd /var/snap/nvidia-assemble/common/nvidia-driver/bits
2023-02-21T17:19:00Z nvidia-assemble.nvidia-assemble[4509]: + sed -i s|/usr/bin/ld.bfd|ld.bfd| BUILD
2023-02-21T17:19:00Z nvidia-assemble.nvidia-assemble[4509]: + sh BUILD
2023-02-21T17:19:01Z nvidia-assemble.nvidia-assemble[4569]: nvidia-drm.ko: OK
2023-02-21T17:19:01Z nvidia-assemble.nvidia-assemble[4569]: nvidia-modeset.ko: OK
2023-02-21T17:19:01Z nvidia-assemble.nvidia-assemble[4569]: nvidia-peermem.ko: OK
2023-02-21T17:19:01Z nvidia-assemble.nvidia-assemble[4569]: nvidia-uvm.ko: OK
2023-02-21T17:19:02Z nvidia-assemble.nvidia-assemble[4569]: nvidia.ko: OK
2023-02-21T17:19:02Z nvidia-assemble.nvidia-assemble[4509]: + modprobe nvidia-drm modeset=1
2023-02-21T17:19:02Z nvidia-assemble.nvidia-assemble[4581]: /snap/nvidia-assemble/x1/commands/nvidia-assemble: 18: modprobe: Permission denied
2023-02-21T17:19:02Z nvidia-assemble.nvidia-assemble[4509]: + echo connect kernel-module-control
2023-02-21T17:19:02Z nvidia-assemble.nvidia-assemble[4509]: connect kernel-module-control
2023-02-21T17:19:02Z nvidia-assemble.nvidia-assemble[4509]: + modprobe nvidia-uvm
2023-02-21T17:19:02Z nvidia-assemble.nvidia-assemble[4582]: /snap/nvidia-assemble/x1/commands/nvidia-assemble: 19: modprobe: Permission denied
2023-02-21T17:19:02Z nvidia-assemble.nvidia-assemble[4509]: + echo connect kernel-module-control
2023-02-21T17:19:02Z nvidia-assemble.nvidia-assemble[4509]: connect kernel-module-control
2023-02-21T17:19:02Z systemd[1]: snap.nvidia-assemble.nvidia-assemble.service: Deactivated successfully.
2023-02-21T17:19:02Z systemd[1]: Finished Service for snap application nvidia-assemble.nvidia-assemble.

Logs after interface connection:

2023-02-21T17:20:01Z systemd[1]: Starting Service for snap application nvidia-assemble.nvidia-assemble...
2023-02-21T17:20:01Z nvidia-assemble.nvidia-assemble[5024]: + nvidia_dir=/lib/modules/*/kernel/nvidia-*
2023-02-21T17:20:01Z nvidia-assemble.nvidia-assemble[5024]: + [ -e /lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/bits/SHA256SUMS ]
2023-02-21T17:20:01Z nvidia-assemble.nvidia-assemble[5024]: + cmp -s /lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/bits/SHA256SUMS /var/snap/nvidia-assemble/common/nvidia-driver/bits/SHA256SUMS
2023-02-21T17:20:01Z nvidia-assemble.nvidia-assemble[5024]: + modprobe nvidia-drm modeset=1
2023-02-21T17:20:04Z nvidia-assemble.nvidia-assemble[5024]: + modprobe nvidia-uvm
2023-02-21T17:20:04Z systemd[1]: snap.nvidia-assemble.nvidia-assemble.service: Deactivated successfully.
2023-02-21T17:20:04Z systemd[1]: Finished Service for snap application nvidia-assemble.nvidia-assemble.
2023-02-21T17:20:04Z systemd[1]: snap.nvidia-assemble.nvidia-assemble.service: Consumed 3.276s CPU time.
2023-02-21T17:23:03Z systemd[1]: Starting Service for snap application nvidia-assemble.nvidia-assemble...
2023-02-21T17:23:07Z nvidia-assemble.nvidia-assemble[1592]: + nvidia_dir=/lib/modules/*/kernel/nvidia-*
2023-02-21T17:23:07Z nvidia-assemble.nvidia-assemble[1592]: + [ -e /lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/bits/SHA256SUMS ]
2023-02-21T17:23:07Z nvidia-assemble.nvidia-assemble[1592]: + cmp -s /lib/modules/5.15.0-60-generic/kernel/nvidia-515srv/bits/SHA256SUMS /var/snap/nvidia-assemble/common/nvidia-driver/bits/SHA256SUMS
2023-02-21T17:23:07Z nvidia-assemble.nvidia-assemble[1592]: + modprobe nvidia-drm modeset=1
2023-02-21T17:23:07Z nvidia-assemble.nvidia-assemble[1592]: + modprobe nvidia-uvm
2023-02-21T17:23:07Z systemd[1]: snap.nvidia-assemble.nvidia-assemble.service: Deactivated successfully.
2023-02-21T17:23:07Z systemd[1]: Finished Service for snap application nvidia-assemble.nvidia-assemble.

Everything seems to be in place:

# ls -al /dev/nvidia*
crw------- 1 root root 195, 254 Feb 21 17:22 /dev/nvidia-modeset
crw------- 1 root root 505,   0 Feb 21 17:22 /dev/nvidia-uvm
crw------- 1 root root 505,   1 Feb 21 17:22 /dev/nvidia-uvm-tools
crw------- 1 root root 195,   0 Feb 21 17:22 /dev/nvidia0
crw------- 1 root root 195, 255 Feb 21 17:22 /dev/nvidiactl
# lsmod |grep nvid
nvidia_uvm           1327104  0
nvidia_drm             73728  1
nvidia_modeset       1146880  1 nvidia_drm
nvidia              40849408  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        311296  4 mgag200,nvidia_drm
drm                   622592  7 drm_kms_helper,nvidia,mgag200,nvidia_drm
jocado commented 1 year ago

For the other udev rules, I am proposing to just add them to the base snap itself for core22 and core24:

I'm not an expert here, but just for my understanding, it seems like those rules are more about the graphical use case, not CUDA. Is that true ?

jocado commented 1 year ago

So, it seems like, if you are able to continue this path, the remaining changes are:

Is there anything I'm missing ?

xnox commented 1 year ago

So, it seems like, if you are able to continue this path, the remaining changes are:

  • Merge changes pc-kernel snap

Yeap.

  • Get interface auto-connection approved for nvidia-assemble snap

I am not sure if this is a suitable request for the global store, however one should be able to make such a change in their own gadget and/or their own brand store. I will consult with relevant people about this.

  • Merge changes for core22 and core24 snaps

Yeap

Is there anything I'm missing ?

We are still trying to see and check if everything works correctly, as we are experiencing some odd things with graphics-core22-samples right now.

xnox commented 1 year ago

For the other udev rules, I am proposing to just add them to the base snap itself for core22 and core24:

I'm not an expert here, but just for my understanding, it seems like those rules are more about the graphical use case, not CUDA. Is that true ?

It is related to logind not randomly revoking access to the GPU device, and related to power-management, but yeah, it's not strictly related to running things as root for CUDA.

xnox commented 1 year ago

@jocado because nvidia-uvm dynamically allocates charact device Major number, and is not allowed to use devtmpfs APIs (GPL symbols) userspace must check and create correct matching major number character devices for uvm. Thus a new snapd interface is needed to allow this, as there currently isn't an mknod interface.

https://github.com/snapcore/snapd/pull/12591

xnox commented 1 year ago

Sample snaps with the new interface are at https://people.canonical.com/~xnox/nvidia/

jocado commented 1 year ago

Hi @xnox

Apologies for the radio silence.

I really appreciate your continued work on this.

As my testing options area limited, and would like to start building and testing some of our use cases that will use the nvidia support before they are available, is my best bet to use the previous nvidia-assemble_3-13-gcb0be00_amd64.snap and pc-kernel from 22/candidate/xnox-nvidia-pc ? I guess if that's working for me it enough to proceed on for now on my side ?

I'm more than happy to try and test anything you would like on my hardware generally. But I guess there's limited value in my testing the current nvidia-assmble snap in https://people.canonical.com/~xnox/nvidia/ until the new snap interfaces is reviewed and agreed on a bit more. Please let me know if you would like me to test it in any case :)

Thanks.

xnox commented 1 year ago

Hi @xnox

Apologies for the radio silence.

I really appreciate your continued work on this.

As my testing options area limited, and would like to start building and testing some of our use cases that will use the nvidia support before they are available, is my best bet to use the previous nvidia-assemble_3-13-gcb0be00_amd64.snap and pc-kernel from 22/candidate/xnox-nvidia-pc ? I guess if that's working for me it enough to proceed on for now on my side ?

I'm more than happy to try and test anything you would like on my hardware generally. But I guess there's limited value in my testing the current nvidia-assmble snap in https://people.canonical.com/~xnox/nvidia/ until the new snap interfaces is reviewed and agreed on a bit more. Please let me know if you would like me to test it in any case :)

Thanks.

Indeed we are a bit stuck, as i need a new interface, and snapstore will not accept this snap into the store if it uses an undefined interface. Hence we are sort of back to devmode territory.

Let me do things to publish stuff into the store such that they are usable again, at least in devmode. And let me try to land the kernel change too. Such that you can at least install everything confined; but nvidia-assemble in devmode, and test things from there.

jocado commented 1 year ago

That would be really great :+1:

xnox commented 1 year ago

snapd edge has the interface, but i cannot upload this snap to use it yet, because of https://code.launchpad.net/~xnox/review-tools/+git/review-tools/+merge/438546 not being merged or deployed yet. It will still be automatically rejected.