aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
461 stars 154 forks source link

Error when trying to install aws-neuronx-dkms 2.* on the ubuntu-latest runner in GitHub Action #843

Closed tmotegi closed 7 months ago

tmotegi commented 8 months ago

I have developed a torch-neuronx application, and I have been running its unit tests using GitHub actions. Up until now, I was able to execute the unit tests for the torch-neuronx application using the ubuntu-latest runner in GitHub actions. However, today I started encountering an error with sudo apt-get install --no-install-recommends -y aws-neuronx-dkms=2.* in Github Action.

Here is the log I retrieved:

Reading database ... 300630 files and directories currently installed.)
Preparing to unpack .../dctrl-tools_2.24-3build2_amd64.deb ...
Unpacking dctrl-tools (2.24-3build2) ...
Selecting previously unselected package dkms.
Preparing to unpack .../dkms_2.8.7-2ubuntu2.2_all.deb ...
Unpacking dkms (2.8.7-2ubuntu2.2) ...
Selecting previously unselected package aws-neuronx-dkms.
Preparing to unpack .../aws-neuronx-dkms_2.15.9.0_amd64.deb ...
Unpacking aws-neuronx-dkms (2.15.9.0) ...
Setting up dctrl-tools (2.24-3build2) ...
Setting up dkms (2.8.7-2ubuntu2.2) ...
Setting up aws-neuronx-dkms (2.15.9.0) ...
Loading new aws-neuronx-2.15.9.0 DKMS files...
Building for 6.5.0-1015-azure
Building for architecture x86_64
Building initial module for 6.5.0-1015-azure
Error! Bad return status for module build on kernel: 6.5.0-1015-azure (x86_64)
Consult /var/lib/dkms/aws-neuronx/2.15.9.0/build/make.log for more information.
dpkg: error processing package aws-neuronx-dkms (--configure):
 installed aws-neuronx-dkms package post-installation script subprocess returned error exit status 10
Processing triggers for man-db (2.10.2-1) ...
Errors were encountered while processing:
 aws-neuronx-dkms
needrestart is being skipped since dpkg has failed
E: Sub-process /usr/bin/dpkg returned an error code (1)

the contents of /var/lib/dkms/aws-neuronx/2.15.9.0/build/make.log are as follows:

DKMS make.log for aws-neuronx-2.15.9.0 for kernel 6.5.0-1015-azure (x86_64)
Fri Mar  1 08:23:39 UTC 2024
make: Entering directory '/usr/src/linux-headers-6.5.0-1015-azure'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  You are using:           gcc-11 (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_arch.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_dhal.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_reg_access.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_module.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_pci.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_mempool.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_dma.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_ring.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_ds.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_core.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_crwl.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_cdev.o
  CC [M]  /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_topsp.o
In file included from ./include/linux/linkage.h:7,
                 from ./include/linux/kernel.h:17,
                 from /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_cdev.c:12:
/var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_cdev.c: In function ‘ncdev_module_init’:
./include/linux/export.h:29:22: error: passing argument 1 of ‘class_create’ from incompatible pointer type [-Werror=incompatible-pointer-types]
   29 | #define THIS_MODULE (&__this_module)
      |                     ~^~~~~~~~~~~~~~~
      |                      |
      |                      struct module *
/var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_cdev.c:2219:41: note: in expansion of macro ‘THIS_MODULE’
 2219 |         neuron_dev_class = class_create(THIS_MODULE, "neuron_device");
      |                                         ^~~~~~~~~~~
In file included from ./include/linux/device.h:31,
                 from ./include/linux/cdev.h:8,
                 from /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_cdev.c:14:
./include/linux/device/class.h:230:54: note: expected ‘const char *’ but argument is of type ‘struct module *’
  230 | struct class * __must_check class_create(const char *name);
      |                                          ~~~~~~~~~~~~^~~~
/var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_cdev.c:2219:28: error: too many arguments to function ‘class_create’
 2219 |         neuron_dev_class = class_create(THIS_MODULE, "neuron_device");
      |                            ^~~~~~~~~~~~
In file included from ./include/linux/device.h:31,
                 from ./include/linux/cdev.h:8,
                 from /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_cdev.c:14:
./include/linux/device/class.h:230:29: note: declared here
  230 | struct class * __must_check class_create(const char *name);
      |                             ^~~~~~~~~~~~
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:251: /var/lib/dkms/aws-neuronx/2.15.9.0/build/neuron_cdev.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [/usr/src/linux-headers-6.5.0-1015-azure/Makefile:2039: /var/lib/dkms/aws-neuronx/2.15.9.0/build] Error 2
make: *** [Makefile:234: __sub-make] Error 2
make: Leaving directory '/usr/src/linux-headers-6.5.0-1015-azure'

I've confirmed that the image for the ubuntu-latest runner has been updated, along with an update to the kernel version. https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20240225.1 How might I go about installing aws-neuronx-dkms 2.* onto this new kernel version of ubuntu-latest?

jeffhataws commented 8 months ago

@tmotegi thanks for reporting the issue. We have made a fix and it will be available in the upcoming release.

owu-1 commented 8 months ago

The latest ubuntu server AMI which works for now is ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240220

jamesbuddrige commented 8 months ago

I'm also observing this on Ubuntu 22.04 on AWS and I've tried the older AMI with no luck

amazon-ebs.linux: Setting up aws-neuronx-dkms (2.15.9.0) ...
--
1344 | amazon-ebs.linux: debconf: unable to initialize frontend: Dialog
1345 | amazon-ebs.linux: debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
1346 | amazon-ebs.linux: debconf: falling back to frontend: Readline
1347 | amazon-ebs.linux: Loading new aws-neuronx-2.15.9.0 DKMS files...
1348 | amazon-ebs.linux: Building for 6.5.0-1015-aws
1349 | amazon-ebs.linux: Building for architecture x86_64
1350 | amazon-ebs.linux: Building initial module for 6.5.0-1015-aws
1351 | amazon-ebs.linux: Error! Bad return status for module build on kernel: 6.5.0-1015-aws (x86_64)
1352 | amazon-ebs.linux: Consult /var/lib/dkms/aws-neuronx/2.15.9.0/build/make.log for more information.
1353 | amazon-ebs.linux: dpkg: error processing package aws-neuronx-dkms (--configure):
1354 | amazon-ebs.linux:  installed aws-neuronx-dkms package post-installation script subprocess returned error exit status 10
1355 | amazon-ebs.linux: Processing triggers for man-db (2.10.2-1) ...
1356 | amazon-ebs.linux: Processing triggers for libc-bin (2.35-0ubuntu3.6) ...
1357 | amazon-ebs.linux: Errors were encountered while processing:
1358 | amazon-ebs.linux:  aws-neuronx-dkms
1359 | amazon-ebs.linux: needrestart is being skipped since dpkg has failed
1360 | ==> amazon-ebs.linux: E: Sub-process /usr/bin/dpkg returned an error code (1)
lanbochen-anyscale commented 7 months ago

@tmotegi thanks for reporting the issue. We have made a fix and it will be available in the upcoming release.

Hi, what is the ETA for the next release? Thanks.

ftimyo commented 7 months ago

current aws-neuronx-dkms does not build with kernel 6.5.0-. I tried different ubuntu 22.04 - 240207.1, 240220, 240223, the build only works with kernel 6.2.0-

ntriller-plaid commented 7 months ago

It appears the kernel incompatibility problem has been resolved based on my tests.

image

Screenshot from this PR

james-aws commented 7 months ago

this is fixed in the latest release. Thanks for reporting. Closing this now