aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
826 stars 312 forks source link

3.7.2: Lustre kmod modprobe breaks custom AMI based on RHEL8 #5913

Open nyetsche opened 9 months ago

nyetsche commented 9 months ago

My organization requires using RHEL8 (a supported OS) from the privately shared RedHat licensed base. We then use pcluster build-image to make it ready for ParallelCluster.

The pcluster build-image task has started failing for us recently. The initial AMI starts with RHEL-8.8 (I also tried 8.7, but is updated to RHEL 8.9 from the redhat-release RPM during build:

EVENTS  1700589295187   Step UpdateOS   1700589294393
EVENTS  1700589295187   ExecuteBash: STARTED EXECUTION  1700589294395

[...]

EVENTS  1700589326128   Stdout:  redhat-release                           x86_64  8.9-0.1.el8                    rhel-8-baseos-rhui-rpms       45 k 1700589326002

That comes from the UpdateOS section of the playbook:

121       - name: UpdateOS
122         action: ExecuteBash
123         inputs:
124           commands:
125             - |
126               set -v
127               OS='{{ build.OperatingSystemName.outputs.stdout }}'
128               PLATFORM='{{ build.PlatformName.outputs.stdout }}'
129
130               if [[ ${!PLATFORM} == RHEL ]]; then
131                 yum -y update
[...]

The yum -y update brings the OS to all most recent packages, including redhat-release and kernel-*.

The failure occurs later, during a kernel_module 'lnet': https://github.com/aws/aws-parallelcluster-cookbook/blob/v3.7.2/cookbooks/aws-parallelcluster-environment/resources/lustre/partial/_install_lustre_centos_redhat.rb#L36

EVENTS  1700590745016   Stdout: [2023-11-21T18:19:01+00:00] INFO: dnf_package[kmod-lustre-client, lustre-client, dracut] installed ["kmod-lustre-client", "lustre-client", nil] at ["0:2.12.8-1.fsx7.el8.x86_64", "0:2.12.8-1.fsx7.el8.x86_64", nil]    1700590741488
EVENTS  1700590745016   Stdout:       - install version 0:2.12.8-1.fsx7.el8.x86_64 of package kmod-lustre-client    1700590741488
EVENTS  1700590745016   Stdout:       - install version 0:2.12.8-1.fsx7.el8.x86_64 of package lustre-client 1700590741488
EVENTS  1700590745016   Stdout:     * kernel_module[lnet] action install[2023-11-21T18:19:04+00:00] INFO: Processing kernel_module[lnet] action install ((eval) line 36)    1700590744740
EVENTS  1700590745016   Stdout:       ================================================================================  1700590744770
EVENTS  1700590745016   Stdout:       Error executing action `install` on resource 'kernel_module[lnet]'    1700590744770
EVENTS  1700590745016   Stdout:       ================================================================================  1700590744770
EVENTS  1700590745016   Stdout:       Mixlib::ShellOut::ShellCommandFailed  1700590744770
EVENTS  1700590745016   Stdout:       ------------------------------------  1700590744770
EVENTS  1700590745016   Stdout:       Expected process to exit with [0], but received '1'   1700590744770
EVENTS  1700590745016   Stdout:       ---- Begin output of modprobe lnet ----   1700590744770
EVENTS  1700590745016   Stdout:       STDOUT:   1700590744770
EVENTS  1700590745016   Stdout:       STDERR: modprobe: FATAL: Module lnet not found in directory /lib/modules/4.18.0-513.5.1.el8_9.x86_64  1700590744770

That is, there's no module in /lib/modules/4.18.0-513.5.1.el8_9.x86_64.

The kernel matrix compability in this document https://docs.aws.amazon.com/fsx/latest/LustreGuide/install-lustre-client.html indeed doesn't mention 4.18.0-513, and the upstream at https://downloads.whamcloud.com/public/lustre/latest-2.12-release/el8/client/ doesn't include it either. So I realize this is actually a Lustre packaging issue, but I'm not sure how to get in touch with the FSX Lustre team. Even so, it'd be great to have a workaround. Right now we can't use new AMIs for compute nodes.

I'm unsure of the best way forward here - blacklist redhat-release* and/or kernel-* from build-image process? Ignore errors from modprobe lnet?

hgreebe commented 9 months ago

A workaround could be to try not upgrading the os in the build image process by setting this config option to false: https://docs.aws.amazon.com/parallelcluster/latest/ug/Build-v3.html#Build-v3-UpdateOsPackages

coderforlife commented 8 months ago

@hgreebe The documentation says that option is false by default.

coderforlife commented 8 months ago

Can an option be added to the image builder to NOT include lustre/fsx support at all? Many setups do not require it and it would make it way easier to support many custom AMIs as it is the biggest sticking point in version compatibility.

enrico-usai commented 7 months ago

Hi @coderforlife , you're correct the UpdateOsPackages is set to false by default.

@hgreebe suggested @nyetsche to set it to false because he said:

The initial AMI starts with RHEL-8.8 (I also tried 8.7, but is updated to RHEL 8.9 from the redhat-release RPM during build

and the UpdateOS step would be executed ONLY when UpdateOsPackages is set to true. So this should have solved the issue for @nyetsche.

Anyway we tracked internally the feature to avoid installing FSx for lustre drivers and support updated kernels when the client is not yet available.

Enrico