amazonlinux / amazon-linux-2023

Amazon Linux 2023
https://aws.amazon.com/linux/amazon-linux-2023/
Other
500 stars 37 forks source link

[Bug] - Lustre Client not compatible with FsX for Lustre #723

Closed gmarchand closed 1 month ago

gmarchand commented 1 month ago

Describe the bug

I follow the documentation to install Lustre Client on AL2023 https://docs.aws.amazon.com/fsx/latest/LustreGuide/install-lustre-client.html

My instance configuration is:

Here is my user data:

#!/bin/bash -ex

echo "AWS Batch for FFMPEG : Mount FSx Lustre Cluster"

fsx_dnsname=%DNS_NAME%
fsx_mountname=%MOUNT_NAME%
fsx_mountpoint=%MOUNT_POINT%

echo "Linux Kernel:"
uname -r

# Check Amazon Linux version
if [ -f "/etc/os-release" ]; then
  # Parse version from os-release
  source /etc/os-release
  if [[ "${VERSION_ID}" == "2" ]]; then
    # Amazon Linux 2
    echo "Detected Amazon Linux 2, installing Lustre client"
    amazon-linux-extras install -y lustre
  elif [[ "${VERSION_ID}" =~ ^[2][0-9]*$ ]]; then
    # Amazon Linux 2023 or similar format (e.g., 21)
    echo "Detected Amazon Linux 2023 (or similar version), installing Lustre client"
    # Issue: https://github.com/amazonlinux/amazon-linux-2023/issues/397#issuecomment-1760177301
    while true; do
    dnf update --assumeyes && break
    done
    while true; do
    dnf install --quiet --assumeyes lustre-client && break
    done
  else
    echo "Unsupported Amazon Linux version for Lustre client"
  fi
else
  echo "Unsupported Amazon Linux version"
fi

mkdir -p "$fsx_mountpoint"
mount -t lustre -o relatime,flock ${fsx_dnsname}@tcp:/${fsx_mountname} ${fsx_mountpoint}

echo "AWS Batch for FFMPEG : Mount FSx Lustre Cluster : END"

Here is the system log with the error

Booting `Amazon Linux (6.1.90-99.173.amzn2023.x86_64) 2023'

...

[   14.997261] cloud-init[2645]: AWS Batch for FFMPEG : Mount FSx Lustre Cluster

[   14.997416] cloud-init[2645]: + fsx_dnsname=fs-xxx.fsx.eu-west-1.amazonaws.com

[   14.997616] cloud-init[2645]: + fsx_mountname=vktahbev

[   14.997731] cloud-init[2645]: + fsx_mountpoint=/fsx-lustre

[   14.997892] cloud-init[2645]: + echo 'Linux Kernel:'

[   14.998060] cloud-init[2645]: Linux Kernel:

[   14.998217] cloud-init[2645]: + uname -r

[   14.998396] cloud-init[2645]: 6.1.90-99.173.amzn2023.x86_64

[   14.998543] cloud-init[2645]: + '[' -f /etc/os-release ']'

[   14.998703] cloud-init[2645]: + source /etc/os-release

[   14.998866] cloud-init[2645]: ++ NAME='Amazon Linux'

[   14.999025] cloud-init[2645]: ++ VERSION=2023

[   14.999185] cloud-init[2645]: ++ ID=amzn

[   14.999350] cloud-init[2645]: ++ ID_LIKE=fedora

[   14.999668] cloud-init[2645]: ++ VERSION_ID=2023

[   14.999833] cloud-init[2645]: ++ PLATFORM_ID=platform:al2023

[   14.999994] cloud-init[2645]: ++ PRETTY_NAME='Amazon Linux 2023.4.20240513'

[   15.000334] cloud-init[2645]: ++ ANSI_COLOR='0;33'

[   15.000491] cloud-init[2645]: ++ CPE_NAME=cpe:2.3:o:amazon:amazon_linux:2023

[   15.000651] cloud-init[2645]: ++ HOME_URL=https://aws.amazon.com/linux/amazon-linux-2023/

[   15.000813] cloud-init[2645]: ++ DOCUMENTATION_URL=https://docs.aws.amazon.com/linux/

[   15.000975] cloud-init[2645]: ++ SUPPORT_URL=https://aws.amazon.com/premiumsupport/

[   15.001136] cloud-init[2645]: ++ BUG_REPORT_URL=https://github.com/amazonlinux/amazon-linux-2023

[   15.001296] cloud-init[2645]: ++ VENDOR_NAME=AWS

[   15.001455] cloud-init[2645]: ++ VENDOR_URL=https://aws.amazon.com/

[   15.001617] cloud-init[2645]: ++ SUPPORT_END=2028-03-15

[   15.001779] cloud-init[2645]: + [[ 2023 == \2 ]]

[   15.001941] cloud-init[2645]: + [[ 2023 =~ ^[2][0-9]*$ ]]

[   15.002104] cloud-init[2645]: + echo 'Detected Amazon Linux 2023 (or similar version), installing Lustre client'

[   15.002264] cloud-init[2645]: Detected Amazon Linux 2023 (or similar version), installing Lustre client

[   15.002420] cloud-init[2645]: + true

[   15.002581] cloud-init[2645]: + dnf update --assumeyes

....

[   15.350260] cloud-init[2645]: + dnf install --quiet --assumeyes lustre-client

[   20.072614] zram_generator::config[3178]: zram0: system has too much memory (31376MB), limit is 800MB, ignoring.

[   20.933482] cloud-init[2645]: Installed:

...

[   20.935294] cloud-init[2645]:   lustre-client-2.15.3-3.amzn2023.x86_64

...

[   20.986875] cloud-init[2645]: + break

[   20.987034] cloud-init[2645]: + mkdir -p /fsx-lustre

[   20.987200] cloud-init[2645]: + mount -t lustre -o relatime,flock fs-xxxx.fsx.eu-west-1.amazonaws.com@tcp:/vktahbev /fsx-lustre

[   21.008314] LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4

[   21.010486] alg: No test for adler32 (adler32-zlib)

[   21.752844] Key type ._llcrypt registered

[   21.753280] Key type .llcrypt registered

[   21.820283] lnet: module is from the staging directory, the quality is unknown, you have been warned.

[   21.909973] obdclass: module is from the staging directory, the quality is unknown, you have been warned.

[   21.935389] Lustre: Lustre: Build Version: 2.15.3_114_gb61b66c_dirty

[   21.951993] ptlrpc: module is from the staging directory, the quality is unknown, you have been warned.

[   21.991248] ksocklnd: module is from the staging directory, the quality is unknown, you have been warned.

[   21.995922] LNet: Added LNI 10.0.185.34@tcp [8/256/0/180]

[   21.996488] LNet: Accept secure, port 988

[   22.030060] osc: module is from the staging directory, the quality is unknown, you have been warned.

[   22.054442] fld: module is from the staging directory, the quality is unknown, you have been warned.

[   22.061588] lov: module is from the staging directory, the quality is unknown, you have been warned.

[   22.069181] fid: module is from the staging directory, the quality is unknown, you have been warned.

[   22.098832] mdc: module is from the staging directory, the quality is unknown, you have been warned.

[   22.144466] lmv: module is from the staging directory, the quality is unknown, you have been warned.

[   22.180361] lustre: module is from the staging directory, the quality is unknown, you have been warned.

[   22.213582] mgc: module is from the staging directory, the quality is unknown, you have been warned.

[   22.219893] LNetError: 5006:0:(peer.c:2790:lnet_discovery_event_reply()) Multi-Rail state vanished from 10.0.60.206@tcp

[   22.222530] Lustre: Client version (2.15.3_114_gb61b66c_dirty). Server MGS version (2.10.5.0) is much older than client. Consider upgrading server

[   22.223813] LustreError: 16a-d: Server MGS version (2.10.5.0) refused connection from this client with an incompatible version (2.15.3_114_gb61b66c_dirty).  Client must be recompiled

[   22.225343] LustreError: 4956:0:(mgc_request.c:252:do_config_log_add()) MGC10.0.60.206@tcp: failed processing log, type 1: rc = -5

[   22.226482] LustreError: 4956:0:(client.c:1255:ptlrpc_import_delay_req()) @@@ IMP_CLOSED  req@00000000ef69d104 x1800396286722240/t0(0) o101->MGC10.0.60.206@tcp@10.0.60.206@tcp:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:QU/0/ffffffff rc 0/-1 job:''

[   22.228529] LustreError: 15c-8: MGC10.0.60.206@tcp: Confguration from log vktahbev-client failed from MGS -5. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info

[   22.230445] Lustre: Unmounted vktahbev-client

[   22.230954] LustreError: 4956:0:(super25.c:187:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -5

[   22.230568] cloud-init[2645]: mount.lustre: mount fs-0dee465b91dfa5c2d.fsx.eu-west-1.amazonaws.com@tcp:/vktahbev at /fsx-lustre failed: Input/output error

[   22.230689] cloud-init[2645]: Is the MGS running?

[   22.230861] cloud-init[2645]: Is the client a much newer or older version than the filesystem?

Moved to AL2

When I only change the AMI from AL2023 to AL2

        ecs_amd64_ami = ec2.MachineImage.from_ssm_parameter(
            # "/aws/service/ecs/optimized-ami/amazon-linux-2023/recommended/image_id"
            "/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id"
        )
        ecs_arm64_ami = ec2.MachineImage.from_ssm_parameter(
            # "/aws/service/ecs/optimized-ami/amazon-linux-2023/arm64/recommended/image_id"
            "/aws/service/ecs/optimized-ami/amazon-linux-2/arm64/recommended/image_id"
        )

it works

Welcome to Amazon Linux 2!
...
 13.567328] cloud-init[4606]: + echo 'AWS Batch for FFMPEG : Mount FSx Lustre Cluster'
[   13.567505] cloud-init[4606]: AWS Batch for FFMPEG : Mount FSx Lustre Cluster
[   13.567652] cloud-init[4606]: + fsx_dnsname=fs-xxxx.fsx.eu-west-1.amazonaws.com
[   13.567824] cloud-init[4606]: + fsx_mountname=vktahbev
[   13.568033] cloud-init[4606]: + fsx_mountpoint=/fsx-lustre
[   13.568236] cloud-init[4606]: + echo 'Linux Kernel:'
[   13.568422] cloud-init[4606]: Linux Kernel:
[   13.568587] cloud-init[4606]: + uname -r
[   13.568758] cloud-init[4606]: 4.14.343-260.564.amzn2.x86_64
[   13.568918] cloud-init[4606]: + '[' -f /etc/os-release ']'
[   13.569088] cloud-init[4606]: + source /etc/os-release
[   13.569263] cloud-init[4606]: ++ NAME='Amazon Linux'
[   13.569421] cloud-init[4606]: ++ VERSION=2
[   13.569590] cloud-init[4606]: ++ ID=amzn
[   13.569759] cloud-init[4606]: ++ ID_LIKE='centos rhel fedora'
[   13.569918] cloud-init[4606]: ++ VERSION_ID=2
[   13.570082] cloud-init[4606]: ++ PRETTY_NAME='Amazon Linux 2'
[   13.570244] cloud-init[4606]: ++ ANSI_COLOR='0;33'
[   13.570404] cloud-init[4606]: ++ CPE_NAME=cpe:2.3:o:amazon:amazon_linux:2
[   13.570573] cloud-init[4606]: ++ HOME_URL=https://amazonlinux.com/
[   13.570739] cloud-init[4606]: ++ SUPPORT_END=2025-06-30
[   13.570909] cloud-init[4606]: + [[ 2 == \2 ]]
[   13.571071] cloud-init[4606]: + echo 'Detected Amazon Linux 2, installing Lustre client'
[   13.571239] cloud-init[4606]: Detected Amazon Linux 2, installing Lustre client
[   13.571399] cloud-init[4606]: + amazon-linux-extras install -y lustre
...
[   31.084065] cloud-init[4606]: ================================================================================
[   31.084190] cloud-init[4606]: Package                Arch   Version                  Repository         Size
[   31.084358] cloud-init[4606]: ================================================================================
[   31.084513] cloud-init[4606]: Installing:
[   31.084678] cloud-init[4606]: lustre-client          x86_64 2.12.8-3.amzn2           amzn2extra-lustre 617 k
...
[   34.559723] cloud-init[4606]: Installed:
[   34.559854] cloud-init[4606]: lustre-client.x86_64 0:2.12.8-3.amzn2
...
[   35.763926] cloud-init[4606]: + mkdir -p /fsx-lustre
[   35.764273] cloud-init[4606]: + mount -t lustre -o relatime,flock fs-xxx.fsx.eu-west-1.amazonaws.com@tcp:/vktahbev /fsx-lustre
[   35.891904] LNet: HW NUMA nodes: 1, HW CPU cores: 16, npartitions: 4

[   35.895565] alg: No test for adler32 (adler32-zlib)

[   36.681936] lnet: module is from the staging directory, the quality is unknown, you have been warned.

[   36.715853] obdclass: module is from the staging directory, the quality is unknown, you have been warned.

[   36.728299] Lustre: Lustre: Build Version: 2.12.8_198_gde6dd89_dirty

[   36.841501] ptlrpc: module is from the staging directory, the quality is unknown, you have been warned.

[   36.861459] ksocklnd: module is from the staging directory, the quality is unknown, you have been warned.

[   36.870695] LNet: Added LNI 10.0.95.24@tcp [8/256/0/180]

[   36.874650] LNet: Accept secure, port 988

[   36.881111] fld: module is from the staging directory, the quality is unknown, you have been warned.

[   36.897029] lov: module is from the staging directory, the quality is unknown, you have been warned.

[   36.937974] osc: module is from the staging directory, the quality is unknown, you have been warned.

[   36.953777] fid: module is from the staging directory, the quality is unknown, you have been warned.

[   36.966733] mdc: module is from the staging directory, the quality is unknown, you have been warned.

[   36.981213] lmv: module is from the staging directory, the quality is unknown, you have been warned.

[   37.027445] lustre: module is from the staging directory, the quality is unknown, you have been warned.

[   37.043023] mgc: module is from the staging directory, the quality is unknown, you have been warned.

[  OK  ] Stopped Dynamically Generate Message Of The Day.
         Starting Dynamically Generate Message Of The Day...
[   37.225278] Lustre: Mounted vktahbev-client
tim-day-387 commented 1 month ago

From the system logs, it appears that you are using a 2.10 Lustre filesystem:

[   22.223813] LustreError: 16a-d: Server MGS version (2.10.5.0) refused connection from this client with an incompatible version (2.15.3_114_gb61b66c_dirty).  Client must be recompiled

The client included in the Amazon Linux 2023 kernel is version 2.15. The client included in AL2 is version 2.12. By default, the 2.15 client will not connect to 2.10 filesystems. It's recommended that you use a 2.12 or newer filesystem - or use a client older than 2.15 with your preexisting filesystem.

This doc explains more about Lustre client/server compatibility: https://docs.aws.amazon.com/fsx/latest/LustreGuide/lustre-client-matrix.html

elsaco commented 1 month ago

@tim-day-387 even with 2.15 filesystem the AL2023 client won't connect:

[Fri May 31 00:54:28 2024] LustreError: 26110:0:(mgc_request.c:252:do_config_log_add()) MGC172.31.84.120@tcp: failed processing log, type 1: rc = -5
[Fri May 31 00:54:38 2024] LustreError: 26155:0:(mgc_request.c:612:do_requeue()) failed processing log: -5
[Fri May 31 00:55:00 2024] LustreError: 15c-8: MGC172.31.84.120@tcp: Confguration from log jxlwlxxv-client failed from MGS -5. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info
[Fri May 31 00:55:00 2024] Lustre: Unmounted jxlwlxxv-client
[Fri May 31 00:55:11 2024] LustreError: 26110:0:(super25.c:187:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -5

Client module info:

[ec2-user]~$ modinfo lustre
filename:       /lib/modules/6.1.91-99.172.amzn2023.x86_64/kernel/drivers/staging/lustrefsx/lustre/llite/lustre.ko
license:        GPL
version:        2.15.3_114_gb61b66c_dirty
description:    Lustre Client File System
author:         OpenSFS, Inc. <http://www.lustre.org/>
alias:          fs-lustre
srcversion:     EAFEFA74278150D832AF4C5
depends:        obdclass,ptlrpc,libcfs,lnet,lov,mdc,lmv
staging:        Y
retpoline:      Y
intree:         Y
name:           lustre
vermagic:       6.1.91-99.172.amzn2023.x86_64 SMP preempt mod_unload modversions
sig_id:         PKCS#7
signer:         Amazon Linux Kernel Signing Key

The FSx summary reports Lustre version 2.15 so it's matching the client version.

elsaco commented 1 month ago

After launching another test instance including the default security group mounting the FSx share works:

[Fri May 31 01:48:00 2024] libcfs: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:00 2024] LNet: HW NUMA nodes: 1, HW CPU cores: 1, npartitions: 1
[Fri May 31 01:48:00 2024] alg: No test for adler32 (adler32-zlib)
[Fri May 31 01:48:01 2024] Key type ._llcrypt registered
[Fri May 31 01:48:01 2024] Key type .llcrypt registered
[Fri May 31 01:48:01 2024] lnet: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] obdclass: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] Lustre: Lustre: Build Version: 2.15.3_114_gb61b66c_dirty
[Fri May 31 01:48:01 2024] ptlrpc: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] ksocklnd: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] LNet: Added LNI 172.91.17.8@tcp [8/256/0/180]
[Fri May 31 01:48:01 2024] LNet: Accept secure, port 988
[Fri May 31 01:48:01 2024] osc: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] fld: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] lov: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] fid: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] mdc: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] lmv: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] lustre: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] mgc: module is from the staging directory, the quality is unknown, you have been warned.
[Fri May 31 01:48:01 2024] Lustre: jxlwlxxv: nosquash_nids is cleared
[Fri May 31 01:48:01 2024] Lustre: jxlwlxxv: root_squash is set to 0:0
[Fri May 31 01:48:02 2024] Lustre: Mounted jxlwlxxv-client

Lustre module version 2.15.3_114_gb61b66c_dirty used for testing.

@gmarchand it's a filesystem access issue!

stewartsmith commented 1 month ago

Looks like we can resolve this - feel free to reopen/comment if I'm wrong.