aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
439 stars 145 forks source link

Error during driver installation: aws-neuron #945

Closed ssotank closed 3 weeks ago

ssotank commented 4 weeks ago
  1. I'm using inf1.xlarge EC2 instance in eu-central-1 with AMI ami-06e89bbb5f88b3a34 (Ubuntu 22.04 LTS, build 20240801).
  2. I'm installing drivers from AWS apt repository.

It fails:

$ sudo apt install aws-neuron-dkms
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages will be upgraded:
  aws-neuron-dkms
1 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.
1 not fully installed or removed.
Need to get 0 B/86.9 kB of archives.
After this operation, 41.0 kB of additional disk space will be used.
(Reading database ... 71498 files and directories currently installed.)
Preparing to unpack .../aws-neuron-dkms_2.3.26.0_amd64.deb ...
Deleting module aws-neuron-2.2.14.0 completely from the DKMS tree.
Unpacking aws-neuron-dkms (2.3.26.0) over (2.2.14.0) ...
Setting up aws-neuron-dkms (2.3.26.0) ...
Loading new aws-neuron-2.3.26.0 DKMS files...
Building for 6.5.0-1024-aws
Building for architecture x86_64
Building initial module for 6.5.0-1024-aws
ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/aws-neuron-dkms.0.crash'
Error! Bad return status for module build on kernel: 6.5.0-1024-aws (x86_64)
Consult /var/lib/dkms/aws-neuron/2.3.26.0/build/make.log for more information.
dpkg: error processing package aws-neuron-dkms (--configure):
 installed aws-neuron-dkms package post-installation script subprocess returned error exit status 10
Errors were encountered while processing:
 aws-neuron-dkms
needrestart is being skipped since dpkg has failed
E: Sub-process /usr/bin/dpkg returned an error code (1)

because it can't compile the sources:

$ cat /var/lib/dkms/aws-neuron/2.3.26.0/build/make.log
DKMS make.log for aws-neuron-2.3.26.0 for kernel 6.5.0-1024-aws (x86_64)
Wed Aug 14 13:25:16 UTC 2024
make: Entering directory '/usr/src/linux-headers-6.5.0-1024-aws'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  You are using:           gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_module.o
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_pci.o
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_mempool.o
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_dma.o
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_ring.o
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_ds.o
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_core.o
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_crwl.o
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_cdev.o
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_topsp.o
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_pid.o
  CC [M]  /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_reset.o
In file included from ./include/linux/linkage.h:7,
                 from ./include/linux/kernel.h:17,
                 from /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_cdev.c:12:
/var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_cdev.c: In function ‘ncdev_module_init’:
./include/linux/export.h:29:22: error: passing argument 1 of ‘class_create’ from incompatible pointer type [-Werror=incompatible-pointer-types]
   29 | #define THIS_MODULE (&__this_module)
      |                     ~^~~~~~~~~~~~~~~
      |                      |
      |                      struct module *
/var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_cdev.c:1668:41: note: in expansion of macro ‘THIS_MODULE’
 1668 |         neuron_dev_class = class_create(THIS_MODULE, "neuron_device");
      |                                         ^~~~~~~~~~~
In file included from ./include/linux/device.h:31,
                 from ./include/linux/cdev.h:8,
                 from /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_cdev.c:14:
./include/linux/device/class.h:230:54: note: expected ‘const char *’ but argument is of type ‘struct module *’
  230 | struct class * __must_check class_create(const char *name);
      |                                          ~~~~~~~~~~~~^~~~
/var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_cdev.c:1668:28: error: too many arguments to function ‘class_create’
 1668 |         neuron_dev_class = class_create(THIS_MODULE, "neuron_device");
      |                            ^~~~~~~~~~~~
In file included from ./include/linux/device.h:31,
                 from ./include/linux/cdev.h:8,
                 from /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_cdev.c:14:
./include/linux/device/class.h:230:29: note: declared here
  230 | struct class * __must_check class_create(const char *name);
      |                             ^~~~~~~~~~~~
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:251: /var/lib/dkms/aws-neuron/2.3.26.0/build/neuron_cdev.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [/usr/src/linux-headers-6.5.0-1024-aws/Makefile:2039: /var/lib/dkms/aws-neuron/2.3.26.0/build] Error 2
make: *** [Makefile:234: __sub-make] Error 2
make: Leaving directory '/usr/src/linux-headers-6.5.0-1024-aws'
aws-taylor commented 4 weeks ago

Hello @ssotank,

This issue is occurring because there was a breaking change to the linux/device.h header to the class_create macro introduced in the 6.4 kernel. I'll notify the responsible team and we'll work on support for newer kernels. In the mean time, you can work around this issue by using an older version of Ubuntu that has a pre-6.4 kernel.

aws-taylor commented 4 weeks ago

Hello again @ssotank,

I was incorrect in my diagnosis. The aws-neuron-dkms package has been deprecated in favor of the aws-neuronx-dkms package. Please use that version. I'll work with the relevant teams to see if we can come up with some sort of mechanism to make this more clear.

ssotank commented 3 weeks ago

Thank you Taylor,

It works indeed with aws-neuronx-dkms. So the driver installation is now unified for both Inf1 and Inf2?

aws-taylor commented 3 weeks ago

Correct.