Closed afernandezody closed 3 years ago
Hi @afernandezody, Can you please share the AMI id so I can reproduce the issue?
Thanks.
ami-0598a2f18554d1972
Thanks, will take a look.
@afernandezody
The first issue (with regular install) are related to CentOS 8's recent upgrade to CentOS 8.4.
During this upgrade, CentOS8 upgraded hwloc from 1.x to 2.x, which are not binary compatible to each other.
The Open MPI package included in EFA installer is built with hwloc 1.x, because it cannot find hwloc 1.x, the installation failed.
We are actively working to solve this issue.
Hi @wzamazon, That makes sense. I guess that retrying w/o updating firstly could be a solution (not sure if I want to follow that route) or maybe upgrading the drivers and, after it, updating the system (not sure if anything would break but there's probably only one way to find out). P.S. None of my proposed workarounds worked so it's back to square one.
The second issue should now be fixed in the latest release v1.12.2.
I'm resolving this issue, please reopen if you have any more questions.
Hi @galpress, My (attempted) installations were using the tar file downloaded from https://efa-installer.amazonaws.com... I guess that the files at _amzn-drivers-efa_linux1.12.2 will be integrated into something like aws-efa-installer-1.12.2.tar.gz, which at this moment is empty. To install the EFA drivers by themselves, do I simply go to ./amzn-drivers-efa_linux_1.12.2/kernel/linux/efa and use cmake or is there any other intermediate step that I should complete? Thanks.
The next installer release should include this fix, so your existing workflow should work fine (that's the preferred way). If you wish to install the driver without the installer you can generate an rpm by running 'make rpm' in the kernel/linux/efa/rpm directory and installing it using yum/rpm.
Hi @galpress, I would also prefer to use the installer and would wait if the new release were available today or tomorrow morning. However, I'm in a bit of a hurry as I must prepare an AMI by the end of the week. I tried your suggestion but got the error:
sudo yum install efa-1.12.2-1.el8.src.rpm
Last metadata expiration check: 1:17:09 ago on Thu 17 Jun 2021 12:59:46 PM UTC.
Error: Will not install a source rpm package (efa-1.12.2-1.el8.src).
I also tried with dnf and using localinstall rather than install but same luck. Maybe it's because it's CentOS8 not ALinux2, not sure. Thanks.
Can you please try to install the rpm, not the source rpm?
The rpm was in the x86_64 subdirectory (what was I thinking!). After running yum, the only change that I notice is the creation of the subdirectory /usr/src/efa-1.12.2
but nothing has changed in the /opt/amazon
or /opt/amazon/efa
subdirectories.
Right, these are directories that are installed by the EFA installer (which you didn't use in this case). If the rpm installation passed your driver should now be updated, you can verify that by running 'modinfo efa' and making sure the version is 1.12.2g.
You are right as the system is returning:
filename: /lib/modules/4.18.0-305.3.1.el8.x86_64/extra/efa.ko.xz
description: Elastic Fabric Adapter (EFA)
license: Dual BSD/GPL
author: Amazon.com, Inc. or its affiliates
softdep: pre: ib_uverbs
version: 1.12.2g
rhelversion: 8.4
srcversion: CECDE2333322F004E8B5352
alias: pci:v00001D0Fd0000EFA1sv*sd*bc*sc*i*
alias: pci:v00001D0Fd0000EFA0sv*sd*bc*sc*i*
depends: ib_core,ib_uverbs
name: efa
vermagic: 4.18.0-305.3.1.el8.x86_64 SMP mod_unload modversions
The whole thing is to overcome the issue discussed at https://github.com/ofiwg/libfabric/issues/6332. Hopefully installing the new driver will suffice. Thanks.
Glad to hear it worked! Though I'm not sure I understand how updating the driver is related to this issue?
Are you saying that the issue can only be fixed if the whole installer (v1.12) is run?
@afernandezody
Running whole installer 1.12.x is ideal. Because that is not possible now on CentOS 8, you can install the libfabric-aws-xxx.rpm that comes with installer, which should also fix the fork issue.
@wzamazon,
I downloaded https://efa-installer.amazonaws.com/aws-efa-installer-1.12.1.tar.gz but the rpms in the ./aws-efa-installer/RPMS/CENT8/x86_64
subdirectory are _libfabric-aws-1.11.2amzn1.1-1.el8.x8664.rpm and _libfabric-aws-devel-1.11.2amzn1.1-1.el8.x8664.rpm. When I try to install the 1st one, the system states that it's already installed and makes no change (as far as I can see). Shouldn't the file be named something like _libfabric-aws-1.12.1amznX.X-X.el8.x8664.rpm? (it's the same for the other OSs)
@afernandezody
EFA installer contains multiple software packages: efa kernel module, lifabric, open mpi, rdma-core. Each component has its own version. So EFA installer version does not necessary match libfabric version.
If you already have libfabric-aws-1.11.2amzn1.1-1.el8.x86_64 installed on you machine, this should be enough to address the fork issue.
OK. It will take me today and most of tomorrow until I have a configuration ready to test.
@afernandezody any updates? Anything else needed from our side?
@galpress, Sorry, one issue led to another and, at the end, I decided to wait until the release of parallelcluster 2.11.0 to recheck everything out and make any upgrades if necessary. I'm closing the thread (and will only reopen it if the issue reproduces with the newer AMI).
Hello, I'm trying to update the EFA drivers to v1.12.0 or later. Building on an awsparallelcluster AMI (CentOS8 flavor) and after upgrading the kernels to 4.18.0-305.3.1.el8.x86_64, running the full installer returns the following error:
Trying a minimal installation seems to be looking for parts of the old kernel maybe because v1.10.2 was installed with the old kernels (?) (not sure what to make out of some of the error messages).
Thanks.