aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
830 stars 312 forks source link

PCluster 3.8 CREATE_FAILED using SharedStorage > FsxLustreSettings > FileSystemId #6353

Closed enlznep closed 3 months ago

enlznep commented 3 months ago

During pcluster create, I'm receiving

Question:

Running handlers complete
[2024-07-16T13:55:25+09:00] ERROR: Exception handlers complete
Infra Phase failed. 47 resources updated in 01 minutes 16 seconds
[2024-07-16T13:55:25+09:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/cinc-stacktrace.out
[2024-07-16T13:55:25+09:00] FATAL: ---------------------------------------------------------------------------------------
[2024-07-16T13:55:25+09:00] FATAL: PLEASE PROVIDE THE CONTENTS OF THE stacktrace.out FILE (above) IF YOU FILE A BUG REPORT
[2024-07-16T13:55:25+09:00] FATAL: ---------------------------------------------------------------------------------------
[2024-07-16T13:55:25+09:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: lustre[mount fsx] (aws-parallelcluster-environment::fsx line 33) had an error: Mixlib::ShellOut::ShellCommandFailed: mount[/scratch] (aws-parallelcluster-environment::fsx line 33) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '19'
---- Begin output of ["mount", "-t", "lustre", "-o", "defaults,_netdev,flock,user_xattr,noatime,noauto,x-systemd.automount", "fs-****.fsx.ap-northeast-1.amazonaws.com@tcp:/1234567", "/scratch"] ----
STDOUT:
STDERR: mount.lustre: mount fs-****.fsx.ap-northeast-1.amazonaws.com@tcp:/1234567 at /scratch failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
---- End output of ["mount", "-t", "lustre", "-o", "defaults,_netdev,flock,user_xattr,noatime,noauto,x-systemd.automount", "fs-****.fsx.ap-northeast-1.amazonaws.com@tcp:/1234567", "/scratch"] ----
Ran ["mount", "-t", "lustre", "-o", "defaults,_netdev,flock,user_xattr,noatime,noauto,x-systemd.automount", "fs-****.fsx.ap-northeast-1.amazonaws.com@tcp:/1234567", "/scratch"] returned 19

Required Info:

enlznep commented 3 months ago

Resolved by recreating the AMI and avoiding to update the kernel version