aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
827 stars 312 forks source link

Pcluster build-image of Rocky 8.9 creating broken AMI #6397

Closed jagga13 closed 3 weeks ago

jagga13 commented 1 month ago

Hello,

I am trying to build a new parallel cluster AMI based on Rocky 8.9. It looks like the build-image process completes successfully but the AMI that is generated has a broken sshd. When I console into it via SSM, I can see that the sshd service is not able to start and that is because it is missing the ssh host keys:

image

I have confirmed that the parent AMI that I am using has a working ssh implementation and I can see the host keys in the /etc/ssh directory. Looks like something pcluster build-image is doing that is blowing them away. This is a new issue that I have never experienced in the past. Would appreciate any guidance.

hanwen-pcluste commented 3 weeks ago

Hi Jagga,

During ParallelCluster build-image, some files under/etc/ssh are cleaned up (code: https://github.com/aws/aws-parallelcluster-cookbook/blob/release-3.10/cookbooks/aws-parallelcluster-platform/files/ami_cleanup.sh). Can you add back those files and try cluster creation?

Meanwhile I will discuss the code with my co-workers.

Moreover, if you need further assistance, can you provide the cluster configuration file without sensitive information and which node has the error (e.g. Head node , Login node, ...)?

Thank you, Hanwen

jagga13 commented 3 weeks ago

Thanks for the response but it looks like this issue was due to some changes our pipeline was doing to the cloud config. Once we fixed that, the pcluster build-image process seems to be working fine and we are able to get into the systems without issues.

Thanks again.