amzn / amzn-drivers

Official AWS drivers repository for Elastic Network Adapter (ENA) and Elastic Fabric Adapter (EFA)
455 stars 175 forks source link

Issues while upgrading AMI to v1.12.1 #175

Closed afernandezody closed 3 years ago

afernandezody commented 3 years ago

Hello, I'm trying to update the EFA drivers to v1.12.0 or later. Building on an awsparallelcluster AMI (CentOS8 flavor) and after upgrading the kernels to 4.18.0-305.3.1.el8.x86_64, running the full installer returns the following error:

= Starting Amazon Elastic Fabric Adapter Installation Script =
= EFA Installer Version: 1.12.1 =

efa-config already installed
== Installing EFA dependencies ==
Main config did not have a skip_missing_names_on_install attr. before setopt
Main config did not have a skip_missing_names_on_install attr. before setopt
Last metadata expiration check: 0:32:03 ago on Mon 14 Jun 2021 03:00:11 PM UTC.
Package kernel-devel-4.18.0-305.3.1.el8.x86_64 is already installed.
Dependencies resolved.
Nothing to do.
Complete!
Main config did not have a skip_missing_names_on_install attr. before setopt
Main config did not have a skip_missing_names_on_install attr. before setopt
Last metadata expiration check: 0:32:05 ago on Mon 14 Jun 2021 03:00:11 PM UTC.
Package pciutils-3.7.0-1.el8.x86_64 is already installed.
Dependencies resolved.
Nothing to do.
Complete!
== Installing EFA packages ==
dkms is already installed
Unloading EFA kernel module
Error: No matching Packages to list
Installing RPMS/CENT8/x86_64/rdma-core/ibacm-32.1-1.el8.x86_64.rpm
Main config did not have a skip_missing_names_on_install attr. before setopt
Main config did not have a skip_missing_names_on_install attr. before setopt
Last metadata expiration check: 0:32:22 ago on Mon 14 Jun 2021 03:00:11 PM UTC.
Dependencies resolved.
================================================================================
 Package                     Arch       Version          Repository        Size
================================================================================
Upgrading:
 ibacm                       x86_64     32.1-1.el8       @commandline      85 k
 infiniband-diags            x86_64     32.1-1.el8       @commandline     315 k
 infiniband-diags-compat     x86_64     32.1-1.el8       @commandline      29 k
 libibumad                   x86_64     32.1-1.el8       @commandline      24 k
 libibverbs                  x86_64     32.1-1.el8       @commandline     330 k
 libibverbs-utils            x86_64     32.1-1.el8       @commandline      66 k
 librdmacm                   x86_64     32.1-1.el8       @commandline      68 k
 librdmacm-utils             x86_64     32.1-1.el8       @commandline      91 k
 rdma-core                   x86_64     32.1-1.el8       @commandline      53 k
 rdma-core-devel             x86_64     32.1-1.el8       @commandline     319 k

Transaction Summary
================================================================================
Upgrade  10 Packages

Total size: 1.3 M
Downloading Packages:
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                        1/1 
  Running scriptlet: rdma-core-32.1-1.el8.x86_64                            1/1 
  Upgrading        : rdma-core-32.1-1.el8.x86_64                           1/20 
  Running scriptlet: rdma-core-32.1-1.el8.x86_64                           1/20 
  Upgrading        : libibverbs-32.1-1.el8.x86_64                          2/20 
  Running scriptlet: libibverbs-32.1-1.el8.x86_64                          2/20 
  Upgrading        : libibumad-32.1-1.el8.x86_64                           3/20 
  Running scriptlet: libibumad-32.1-1.el8.x86_64                           3/20 
  Upgrading        : infiniband-diags-32.1-1.el8.x86_64                    4/20 
  Running scriptlet: infiniband-diags-32.1-1.el8.x86_64                    4/20 
  Upgrading        : librdmacm-32.1-1.el8.x86_64                           5/20 
  Running scriptlet: librdmacm-32.1-1.el8.x86_64                           5/20 
  Upgrading        : ibacm-32.1-1.el8.x86_64                               6/20 
  Running scriptlet: ibacm-32.1-1.el8.x86_64                               6/20 
  Upgrading        : rdma-core-devel-32.1-1.el8.x86_64                     7/20 
  Upgrading        : librdmacm-utils-32.1-1.el8.x86_64                     8/20 
  Upgrading        : infiniband-diags-compat-32.1-1.el8.x86_64             9/20 
  Upgrading        : libibverbs-utils-32.1-1.el8.x86_64                   10/20 
  Cleanup          : rdma-core-devel-32.0-4.el8.x86_64                    11/20 
  Cleanup          : infiniband-diags-compat-31.2amzn-1.el8.x86_64        12/20 
  Running scriptlet: ibacm-32.0-4.el8.x86_64                              13/20 
  Cleanup          : ibacm-32.0-4.el8.x86_64                              13/20 
  Running scriptlet: ibacm-32.0-4.el8.x86_64                              13/20 
  Cleanup          : librdmacm-utils-32.0-4.el8.x86_64                    14/20 
  Cleanup          : librdmacm-32.0-4.el8.x86_64                          15/20 
  Running scriptlet: librdmacm-32.0-4.el8.x86_64                          15/20 
  Cleanup          : libibverbs-utils-32.0-4.el8.x86_64                   16/20 
  Cleanup          : infiniband-diags-32.0-4.el8.x86_64                   17/20 
  Running scriptlet: infiniband-diags-32.0-4.el8.x86_64                   17/20 
  Cleanup          : libibumad-32.0-4.el8.x86_64                          18/20 
  Running scriptlet: libibumad-32.0-4.el8.x86_64                          18/20 
  Cleanup          : libibverbs-32.0-4.el8.x86_64                         19/20 
  Running scriptlet: libibverbs-32.0-4.el8.x86_64                         19/20 
  Cleanup          : rdma-core-32.0-4.el8.x86_64                          20/20 
  Running scriptlet: rdma-core-32.0-4.el8.x86_64                          20/20 
  Verifying        : ibacm-32.1-1.el8.x86_64                               1/20 
  Verifying        : ibacm-32.0-4.el8.x86_64                               2/20 
  Verifying        : infiniband-diags-32.1-1.el8.x86_64                    3/20 
  Verifying        : infiniband-diags-32.0-4.el8.x86_64                    4/20 
  Verifying        : infiniband-diags-compat-32.1-1.el8.x86_64             5/20 
  Verifying        : infiniband-diags-compat-31.2amzn-1.el8.x86_64         6/20 
  Verifying        : libibumad-32.1-1.el8.x86_64                           7/20 
  Verifying        : libibumad-32.0-4.el8.x86_64                           8/20 
  Verifying        : libibverbs-32.1-1.el8.x86_64                          9/20 
  Verifying        : libibverbs-32.0-4.el8.x86_64                         10/20 
  Verifying        : libibverbs-utils-32.1-1.el8.x86_64                   11/20 
  Verifying        : libibverbs-utils-32.0-4.el8.x86_64                   12/20 
  Verifying        : librdmacm-32.1-1.el8.x86_64                          13/20 
  Verifying        : librdmacm-32.0-4.el8.x86_64                          14/20 
  Verifying        : librdmacm-utils-32.1-1.el8.x86_64                    15/20 
  Verifying        : librdmacm-utils-32.0-4.el8.x86_64                    16/20 
  Verifying        : rdma-core-32.1-1.el8.x86_64                          17/20 
  Verifying        : rdma-core-32.0-4.el8.x86_64                          18/20 
  Verifying        : rdma-core-devel-32.1-1.el8.x86_64                    19/20 
  Verifying        : rdma-core-devel-32.0-4.el8.x86_64                    20/20 
Installed products updated.

Upgraded:
  ibacm-32.1-1.el8.x86_64                    infiniband-diags-32.1-1.el8.x86_64 
  infiniband-diags-compat-32.1-1.el8.x86_64  libibumad-32.1-1.el8.x86_64        
  libibverbs-32.1-1.el8.x86_64               libibverbs-utils-32.1-1.el8.x86_64 
  librdmacm-32.1-1.el8.x86_64                librdmacm-utils-32.1-1.el8.x86_64  
  rdma-core-32.1-1.el8.x86_64                rdma-core-devel-32.1-1.el8.x86_64  

Complete!
Installing RPMS/CENT8/x86_64/efa-config-1.8-1.el8.noarch.rpm
Main config did not have a skip_missing_names_on_install attr. before setopt
Main config did not have a skip_missing_names_on_install attr. before setopt
Last metadata expiration check: 0:32:29 ago on Mon 14 Jun 2021 03:00:11 PM UTC.
Error: 
 Problem: conflicting requests
  - nothing provides libhwloc.so.5()(64bit) needed by openmpi40-aws-4.1.1-1.el8.x86_64
(try to add '--skip-broken' to skip uninstallable packages)
Error: Failed to install packages.
Error: failed to install EFA packages, exiting

Trying a minimal installation seems to be looking for parts of the old kernel maybe because v1.10.2 was installed with the old kernels (?) (not sure what to make out of some of the error messages).

= Starting Amazon Elastic Fabric Adapter Installation Script =
= EFA Installer Version: 1.12.1 =

EFA is already installed. Would you like to reinstall EFA? [y/n]: y
efa-config already installed
== Installing EFA dependencies ==
Main config did not have a skip_missing_names_on_install attr. before setopt
Main config did not have a skip_missing_names_on_install attr. before setopt
Last metadata expiration check: 0:41:45 ago on Mon 14 Jun 2021 03:00:11 PM UTC.
Package kernel-devel-4.18.0-305.3.1.el8.x86_64 is already installed.
Dependencies resolved.
Nothing to do.
Complete!
Main config did not have a skip_missing_names_on_install attr. before setopt
Main config did not have a skip_missing_names_on_install attr. before setopt
Last metadata expiration check: 0:41:48 ago on Mon 14 Jun 2021 03:00:11 PM UTC.
Package pciutils-3.7.0-1.el8.x86_64 is already installed.
Dependencies resolved.
Nothing to do.
Complete!
== Installing EFA packages ==
dkms is already installed
skipping efa-profile-1.5-1.el8.noarch because of minimal installation
skipping libfabric-aws-1.11.2amzn1.1-1.el8.x86_64 because of minimal installation
skipping libfabric-aws-devel-1.11.2amzn1.1-1.el8.x86_64 because of minimal installation
skipping openmpi40-aws-4.1.1-1.el8.x86_64 because of minimal installation
ibacm-32.1-1.el8.x86_64 already installed
infiniband-diags-32.1-1.el8.x86_64 already installed
infiniband-diags-compat-32.1-1.el8.x86_64 already installed
libibumad-32.1-1.el8.x86_64 already installed
libibverbs-32.1-1.el8.x86_64 already installed
libibverbs-utils-32.1-1.el8.x86_64 already installed
librdmacm-32.1-1.el8.x86_64 already installed
librdmacm-utils-32.1-1.el8.x86_64 already installed
rdma-core-32.1-1.el8.x86_64 already installed
rdma-core-devel-32.1-1.el8.x86_64 already installed
Error: No matching Packages to list
Installing RPMS/CENT8/x86_64/efa-config-1.8-1.el8.noarch.rpm
Main config did not have a skip_missing_names_on_install attr. before setopt
Main config did not have a skip_missing_names_on_install attr. before setopt
Last metadata expiration check: 0:42:07 ago on Mon 14 Jun 2021 03:00:11 PM UTC.
Dependencies resolved.
========================================================================================
 Package                 Architecture  Version                Repository           Size
========================================================================================
Upgrading:
 efa                     x86_64        1.12.1-1.el8           @commandline         61 k
 efa-config              noarch        1.8-1.el8              @commandline         13 k
Installing dependencies:
 cmake                   x86_64        3.18.2-9.el8           appstream           9.8 M
 cmake-data              noarch        3.18.2-9.el8           appstream           1.6 M
 cmake-rpm-macros        noarch        3.18.2-9.el8           appstream            44 k
 libuv                   x86_64        1:1.40.0-1.el8         appstream           155 k

Transaction Summary
========================================================================================
Install  4 Packages
Upgrade  2 Packages

Total size: 12 M
Total download size: 12 M
Downloading Packages:
(1/4): cmake-rpm-macros-3.18.2-9.el8.noarch.rpm         5.7 MB/s |  44 kB     00:00
(2/4): libuv-1.40.0-1.el8.x86_64.rpm                     27 MB/s | 155 kB     00:00
(3/4): cmake-data-3.18.2-9.el8.noarch.rpm                42 MB/s | 1.6 MB     00:00
(4/4): cmake-3.18.2-9.el8.x86_64.rpm                     26 MB/s | 9.8 MB     00:00
----------------------------------------------------------------------------------------
Total                                                    18 MB/s |  12 MB     00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                1/1
  Running scriptlet: cmake-rpm-macros-3.18.2-9.el8.noarch                           1/1
  Installing       : cmake-rpm-macros-3.18.2-9.el8.noarch                           1/8
  Installing       : libuv-1:1.40.0-1.el8.x86_64                                    2/8
  Installing       : cmake-data-3.18.2-9.el8.noarch                                 3/8
  Installing       : cmake-3.18.2-9.el8.x86_64                                      4/8
  Upgrading        : efa-1.12.1-1.el8.x86_64                                        5/8
  Running scriptlet: efa-1.12.1-1.el8.x86_64                                        5/8

Creating symlink /var/lib/dkms/efa/1.12.1/source ->
                 /usr/src/efa-1.12.1

DKMS: add completed.
Error! echo
Your kernel headers for kernel 4.18.0-240.1.1.el8_3.x86_64 cannot be found at
/lib/modules/4.18.0-240.1.1.el8_3.x86_64/build or /lib/modules/4.18.0-240.1.1.el8_3.x86_64/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel 4.18.0-240.1.1.el8_3.x86_64 cannot be found at
/lib/modules/4.18.0-240.1.1.el8_3.x86_64/build or /lib/modules/4.18.0-240.1.1.el8_3.x86_64/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.

Kernel preparation unnecessary for this kernel.  Skipping...

Running the pre_build script:
/var/lib/dkms/efa/1.12.1/build/build /var/lib/dkms/efa/1.12.1/build
-- The C compiler identification is GNU 8.4.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Kernel directory - /lib/modules/4.18.0-305.3.1.el8.x86_64/build
-- Inspecting kernel
-- Inspecting kernel - done
-- Configuring done
-- Generating done
-- Build files have been written to: /var/lib/dkms/efa/1.12.1/build/build
/var/lib/dkms/efa/1.12.1/build

Building module:
cleaning build area...
cd build; 'make'.........
cleaning build area...

DKMS: build completed.

efa.ko.xz:
Running module version sanity check.
 - Original module
   - An original module was already stored during a previous install
 - Installation
   - Installing to /lib/modules/4.18.0-305.3.1.el8.x86_64/extra/

depmod.....
Job for systemd-modules-load.service failed because the control process exited with error code.
See "systemctl status systemd-modules-load.service" and "journalctl -xe" for details.

DKMS: install completed.
Error! echo
Your kernel headers for kernel kabi-current cannot be found at
/lib/modules/kabi-current/build or /lib/modules/kabi-current/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel kabi-current cannot be found at
/lib/modules/kabi-current/build or /lib/modules/kabi-current/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel kabi-rhel80 cannot be found at
/lib/modules/kabi-rhel80/build or /lib/modules/kabi-rhel80/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel kabi-rhel80 cannot be found at
/lib/modules/kabi-rhel80/build or /lib/modules/kabi-rhel80/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel kabi-rhel81 cannot be found at
/lib/modules/kabi-rhel81/build or /lib/modules/kabi-rhel81/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel kabi-rhel81 cannot be found at
/lib/modules/kabi-rhel81/build or /lib/modules/kabi-rhel81/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel kabi-rhel82 cannot be found at
/lib/modules/kabi-rhel82/build or /lib/modules/kabi-rhel82/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel kabi-rhel82 cannot be found at
/lib/modules/kabi-rhel82/build or /lib/modules/kabi-rhel82/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel kabi-rhel83 cannot be found at
/lib/modules/kabi-rhel83/build or /lib/modules/kabi-rhel83/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel kabi-rhel83 cannot be found at
/lib/modules/kabi-rhel83/build or /lib/modules/kabi-rhel83/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel kabi-rhel84 cannot be found at
/lib/modules/kabi-rhel84/build or /lib/modules/kabi-rhel84/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
Error! echo
Your kernel headers for kernel kabi-rhel84 cannot be found at
/lib/modules/kabi-rhel84/build or /lib/modules/kabi-rhel84/source.
You can use the --kernelsourcedir option to tell DKMS where it's located.
warning: %post(efa-1.12.1-1.el8.x86_64) scriptlet failed, exit status 1

Error in POSTIN scriptlet in rpm package efa
  Upgrading        : efa-config-1.8-1.el8.noarch                                    6/8
  Running scriptlet: efa-config-1.8-1.el8.noarch                                    6/8
  Running scriptlet: efa-1.10.2-1.el8.x86_64                                        7/8

-------- Uninstall Beginning --------
Module:  efa
Version: 1.10.2
Kernel:  4.18.0-305.3.1.el8.x86_64 (x86_64)
-------------------------------------

Status: This module version was INACTIVE for this kernel.
depmod....

DKMS: uninstall completed.

------------------------------
Deleting module version: 1.10.2
completely from the DKMS tree.
------------------------------
Done.

  Cleanup          : efa-1.10.2-1.el8.x86_64                                        7/8
  Running scriptlet: efa-config-1.7-1.el8.noarch                                    8/8
  Cleanup          : efa-config-1.7-1.el8.noarch                                    8/8
  Running scriptlet: efa-config-1.7-1.el8.noarch                                    8/8
  Verifying        : cmake-3.18.2-9.el8.x86_64                                      1/8
  Verifying        : cmake-data-3.18.2-9.el8.noarch                                 2/8
  Verifying        : cmake-rpm-macros-3.18.2-9.el8.noarch                           3/8
  Verifying        : libuv-1:1.40.0-1.el8.x86_64                                    4/8
  Verifying        : efa-config-1.8-1.el8.noarch                                    5/8
  Verifying        : efa-config-1.7-1.el8.noarch                                    6/8
  Verifying        : efa-1.12.1-1.el8.x86_64                                        7/8
  Verifying        : efa-1.10.2-1.el8.x86_64                                        8/8
Installed products updated.

Upgraded:
  efa-1.12.1-1.el8.x86_64                  efa-config-1.8-1.el8.noarch
Installed:
  cmake-3.18.2-9.el8.x86_64                     cmake-data-3.18.2-9.el8.noarch
  cmake-rpm-macros-3.18.2-9.el8.noarch          libuv-1:1.40.0-1.el8.x86_64

Complete!
Unloading EFA kernel module
Reloading EFA kernel module
Minimal installation does not include libfabric, skipping test.
===================================================
EFA installation complete.
- Please logout/login to complete the installation.
===================================================

Thanks.

gal-pressman commented 3 years ago

Hi @afernandezody, Can you please share the AMI id so I can reproduce the issue?

Thanks.

afernandezody commented 3 years ago

ami-0598a2f18554d1972

gal-pressman commented 3 years ago

Thanks, will take a look.

wzamazon commented 3 years ago

@afernandezody

The first issue (with regular install) are related to CentOS 8's recent upgrade to CentOS 8.4.

During this upgrade, CentOS8 upgraded hwloc from 1.x to 2.x, which are not binary compatible to each other.

The Open MPI package included in EFA installer is built with hwloc 1.x, because it cannot find hwloc 1.x, the installation failed.

We are actively working to solve this issue.

afernandezody commented 3 years ago

Hi @wzamazon, That makes sense. I guess that retrying w/o updating firstly could be a solution (not sure if I want to follow that route) or maybe upgrading the drivers and, after it, updating the system (not sure if anything would break but there's probably only one way to find out). P.S. None of my proposed workarounds worked so it's back to square one.

gal-pressman commented 3 years ago

The second issue should now be fixed in the latest release v1.12.2.

I'm resolving this issue, please reopen if you have any more questions.

afernandezody commented 3 years ago

Hi @galpress, My (attempted) installations were using the tar file downloaded from https://efa-installer.amazonaws.com... I guess that the files at _amzn-drivers-efa_linux1.12.2 will be integrated into something like aws-efa-installer-1.12.2.tar.gz, which at this moment is empty. To install the EFA drivers by themselves, do I simply go to ./amzn-drivers-efa_linux_1.12.2/kernel/linux/efa and use cmake or is there any other intermediate step that I should complete? Thanks.

gal-pressman commented 3 years ago

The next installer release should include this fix, so your existing workflow should work fine (that's the preferred way). If you wish to install the driver without the installer you can generate an rpm by running 'make rpm' in the kernel/linux/efa/rpm directory and installing it using yum/rpm.

afernandezody commented 3 years ago

Hi @galpress, I would also prefer to use the installer and would wait if the new release were available today or tomorrow morning. However, I'm in a bit of a hurry as I must prepare an AMI by the end of the week. I tried your suggestion but got the error:

sudo yum install efa-1.12.2-1.el8.src.rpm
Last metadata expiration check: 1:17:09 ago on Thu 17 Jun 2021 12:59:46 PM UTC.
Error: Will not install a source rpm package (efa-1.12.2-1.el8.src).

I also tried with dnf and using localinstall rather than install but same luck. Maybe it's because it's CentOS8 not ALinux2, not sure. Thanks.

gal-pressman commented 3 years ago

Can you please try to install the rpm, not the source rpm?

afernandezody commented 3 years ago

The rpm was in the x86_64 subdirectory (what was I thinking!). After running yum, the only change that I notice is the creation of the subdirectory /usr/src/efa-1.12.2 but nothing has changed in the /opt/amazon or /opt/amazon/efa subdirectories.

gal-pressman commented 3 years ago

Right, these are directories that are installed by the EFA installer (which you didn't use in this case). If the rpm installation passed your driver should now be updated, you can verify that by running 'modinfo efa' and making sure the version is 1.12.2g.

afernandezody commented 3 years ago

You are right as the system is returning:

filename:       /lib/modules/4.18.0-305.3.1.el8.x86_64/extra/efa.ko.xz
description:    Elastic Fabric Adapter (EFA)
license:        Dual BSD/GPL
author:         Amazon.com, Inc. or its affiliates
softdep:        pre: ib_uverbs
version:        1.12.2g
rhelversion:    8.4
srcversion:     CECDE2333322F004E8B5352
alias:          pci:v00001D0Fd0000EFA1sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd0000EFA0sv*sd*bc*sc*i*
depends:        ib_core,ib_uverbs
name:           efa
vermagic:       4.18.0-305.3.1.el8.x86_64 SMP mod_unload modversions

The whole thing is to overcome the issue discussed at https://github.com/ofiwg/libfabric/issues/6332. Hopefully installing the new driver will suffice. Thanks.

gal-pressman commented 3 years ago

Glad to hear it worked! Though I'm not sure I understand how updating the driver is related to this issue?

afernandezody commented 3 years ago

Are you saying that the issue can only be fixed if the whole installer (v1.12) is run?

wzamazon commented 3 years ago

@afernandezody

Running whole installer 1.12.x is ideal. Because that is not possible now on CentOS 8, you can install the libfabric-aws-xxx.rpm that comes with installer, which should also fix the fork issue.

afernandezody commented 3 years ago

@wzamazon, I downloaded https://efa-installer.amazonaws.com/aws-efa-installer-1.12.1.tar.gz but the rpms in the ./aws-efa-installer/RPMS/CENT8/x86_64 subdirectory are _libfabric-aws-1.11.2amzn1.1-1.el8.x8664.rpm and _libfabric-aws-devel-1.11.2amzn1.1-1.el8.x8664.rpm. When I try to install the 1st one, the system states that it's already installed and makes no change (as far as I can see). Shouldn't the file be named something like _libfabric-aws-1.12.1amznX.X-X.el8.x8664.rpm? (it's the same for the other OSs)

wzamazon commented 3 years ago

@afernandezody

EFA installer contains multiple software packages: efa kernel module, lifabric, open mpi, rdma-core. Each component has its own version. So EFA installer version does not necessary match libfabric version.

If you already have libfabric-aws-1.11.2amzn1.1-1.el8.x86_64 installed on you machine, this should be enough to address the fork issue.

afernandezody commented 3 years ago

OK. It will take me today and most of tomorrow until I have a configuration ready to test.

gal-pressman commented 3 years ago

@afernandezody any updates? Anything else needed from our side?

afernandezody commented 3 years ago

@galpress, Sorry, one issue led to another and, at the end, I decided to wait until the release of parallelcluster 2.11.0 to recheck everything out and make any upgrades if necessary. I'm closing the thread (and will only reopen it if the issue reproduces with the newer AMI).