Xilinx / video-sdk

https://xilinx.github.io/video-sdk
Other
31 stars 14 forks source link

Install Lustre Client fails on VT1 #85

Closed gmarchand closed 8 months ago

gmarchand commented 10 months ago

Description: Impossible to install Lustre Client with the AMI AMD Xilinx Video SDK AMI with ECS support for VT1 Instances (AL2) despite it works with Amazon ECS-Optimized Amazon Linux 2 (AL2) x86_64 AMI

AMI Used : AMD Xilinx Video SDK AMI with ECS support for VT1 Instances (AL2) : https://aws.amazon.com/marketplace/pp/prodview-phvk6d4mq3hh6

User Data used by the EC2 Launch Template

#!/bin/bash -ex

exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1

uname -r

fsx_dnsname=%DNS_NAME%
fsx_mountname=%MOUNT_NAME%
fsx_mountpoint=%MOUNT_POINT%

amazon-linux-extras install -y lustre2.10
mkdir -p "$fsx_mountpoint"
mount -t lustre -o relatime,flock ${fsx_dnsname}@tcp:/${fsx_mountname} ${fsx_mountpoint}

System logs:

[  120.666982] cloud-init[23528]: + exec
[  120.667287] cloud-init[23528]: ++ tee /var/log/user-data.log
[  120.668318] cloud-init[23528]: ++ logger -t user-data -s
<13>Nov 29 11:58:07 user-data: + uname -r
<13>Nov 29 11:58:07 user-data: 4.14.305-227.531.amzn2.x86_64
<13>Nov 29 11:58:07 user-data: + fsx_dnsname=fs-xxx.fsx.eu-west-1.amazonaws.com
<13>Nov 29 11:58:07 user-data: + fsx_mountname=xxx
<13>Nov 29 11:58:07 user-data: + fsx_mountpoint=/fsx-lustre
<13>Nov 29 11:58:07 user-data: + amazon-linux-extras install -y lustre2.10
<13>Nov 29 11:58:09 user-data: Loaded plugins: dkms-build-requires, priorities, update-motd, upgrade-helper
<13>Nov 29 11:58:09 user-data: Existing lock /var/run/yum.pid: another copy is running as pid 23715.
<13>Nov 29 11:58:09 user-data: Another app is currently holding the yum lock; waiting for it to exit...
<13>Nov 29 11:58:09 user-data:   The other application is: yum
<13>Nov 29 11:58:09 user-data:     Memory : 221 M RSS (437 MB VSZ)
<13>Nov 29 11:58:09 user-data:     Started: Wed Nov 29 11:58:08 2023 - 00:01 ago
<13>Nov 29 11:58:09 user-data:     State  : Running, pid: 23715
<13>Nov 29 11:58:11 user-data: Another app is currently holding the yum lock; waiting for it to exit...
<13>Nov 29 11:58:11 user-data:   The other application is: yum
<13>Nov 29 11:58:11 user-data:     Memory : 334 M RSS (550 MB VSZ)
<13>Nov 29 11:58:11 user-data:     Started: Wed Nov 29 11:58:08 2023 - 00:03 ago
<13>Nov 29 11:58:11 user-data:     State  : Running, pid: 23715
<13>Nov 29 11:58:13 user-data: Another app is currently holding the yum lock; waiting for it to exit...
<13>Nov 29 11:58:13 user-data:   The other application is: yum
<13>Nov 29 11:58:13 user-data:     Memory : 349 M RSS (566 MB VSZ)
<13>Nov 29 11:58:13 user-data:     Started: Wed Nov 29 11:58:08 2023 - 00:05 ago
<13>Nov 29 11:58:13 user-data:     State  : Running, pid: 23715
<13>Nov 29 11:58:15 user-data: Another app is currently holding the yum lock; waiting for it to exit...
<13>Nov 29 11:58:15 user-data:   The other application is: yum
<13>Nov 29 11:58:15 user-data:     Memory : 350 M RSS (566 MB VSZ)
<13>Nov 29 11:58:15 user-data:     Started: Wed Nov 29 11:58:08 2023 - 00:07 ago
<13>Nov 29 11:58:15 user-data:     State  : Running, pid: 23715
[  OK  ] Started Dynamically Generate Message Of The Day.
<13>Nov 29 11:58:17 user-data: Cleaning repos: amzn2-core amzn2extra-docker amzn2extra-ecs amzn2extra-epel
<13>Nov 29 11:58:17 user-data:               : amzn2extra-lustre2.10 epel
<13>Nov 29 11:58:17 user-data: 34 metadata files removed
<13>Nov 29 11:58:17 user-data: 12 sqlite files removed
<13>Nov 29 11:58:17 user-data: 0 metadata files removed
<13>Nov 29 11:58:17 user-data: Loaded plugins: dkms-build-requires, priorities, update-motd, upgrade-helper
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:  One of the configured repositories failed (Unknown),
<13>Nov 29 11:58:52 user-data:  and yum doesn't have enough cached data to continue. At this point the only
<13>Nov 29 11:58:52 user-data:  safe thing yum can do is fail. There are a few ways to work "fix" this:
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:      1. Contact the upstream for the repository and get them to fix the problem.
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:      2. Reconfigure the baseurl/etc. for the repository, to point to a working
<13>Nov 29 11:58:52 user-data:         upstream. This is most often useful if you are using a newer
<13>Nov 29 11:58:52 user-data:         distribution release than is supported by the repository (and the
<13>Nov 29 11:58:52 user-data:         packages for the previous distribution release still work).
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:      3. Run the command with the repository temporarily disabled
<13>Nov 29 11:58:52 user-data:             yum --disablerepo=<repoid> ...
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:      4. Disable the repository permanently, so yum won't use it by default. Yum
<13>Nov 29 11:58:52 user-data:         will then just ignore the repository until you permanently enable it
<13>Nov 29 11:58:52 user-data:         again or use --enablerepo for temporary usage:
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:             yum-config-manager --disable <repoid>
<13>Nov 29 11:58:52 user-data:         or
<13>Nov 29 11:58:52 user-data:             subscription-manager repos --disable=<repoid>
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:      5. Configure the failing repository to be skipped, if it is unavailable.
<13>Nov 29 11:58:52 user-data:         Note that yum will try to contact the repo. when it runs most commands,
<13>Nov 29 11:58:52 user-data:         so will have to try and fail each time (and thus. yum will be be much
<13>Nov 29 11:58:52 user-data:         slower). If it is a very temporary problem though, this is often a nice
<13>Nov 29 11:58:52 user-data:         compromise:
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data:             yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
<13>Nov 29 11:58:52 user-data: 
<13>Nov 29 11:58:52 user-data: Cannot retrieve metalink for repository: epel/x86_64. Please verify its path and try again
<13>Nov 29 11:58:52 user-data: Installation failed. Check that you have permissions to install.
<13>Nov 29 11:58:52 user-data: Installing lustre-client
[  165.815024] cloud-init[23528]: Nov 29 11:58:52 cloud-init[23528]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-003 [13]
[  165.834238] cloud-init[23528]: Nov 29 11:58:52 cloud-init[23528]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[  165.837475] cloud-init[23528]: Nov 29 11:58:52 cloud-init[23528]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
ci-info: no authorized ssh keys fingerprints found for user ec2-user.

Works well with this AMI : Amazon ECS-Optimized Amazon Linux 2 (AL2) x8664 AMI https://aws.amazon.com/marketplace/pp/prodview-do6i4ripwbhs2?sr=0-1&ref=beagle&applicationId=AWSMPContessa

NastoohX commented 8 months ago

Hi, Sorry for the late reply. Looking at the provided logs, I am not able to correlate the installation issue with VT1 AMI. To see if this installation issue is due to our packages, proceed by removing the SDK packages, on a non-mission critical system, as per https://xilinx.github.io/video-sdk/v3.0/getting_started_on_vt1.html#installing-the-sdk-on-an-existing-ami, Step 3. Once SDK is removed, continue with your original installation. If this is successful, then try to re-install the SDK, by following the above link. If installation is not successful, then please provide the relevant logs. Cheers,

NastoohX commented 8 months ago

Hi, Closing this ticket due to inactivity. Feel free to reopen if needed. Cheers,

gmarchand commented 8 months ago

Hello @NastoohX I found the issue, Need to upgrade the ECS AMI. Here the reference : https://github.com/aws/amazon-ecs-ami/pull/191

hifarhanali commented 1 month ago

any updates?