aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
830 stars 312 forks source link

Unable to create new pcluster AMI #6358

Closed jagga13 closed 3 months ago

jagga13 commented 3 months ago

Hello,

I am trying to create a new parallel cluster AMI based on a custom AMI. This process seems to keep failing even though I have granted it full IAM access to S3/KMS/SSM. Here are the details:

AWS ParallelCluster version 3.10.1

Image config:

Region: us-west-2
Build:
  InstanceType: c5.xlarge
  ParentImage: ami-XXXXX
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::{xxxxxxx}:policy/kms-full-access
      - Policy: arn:aws:iam::aws:policy/AmazonS3FullAccess
      - Policy: arn:aws:iam::aws:policy/AmazonSSMFullAccess
  SecurityGroupIds:
    - sg-XXXXX
  SubnetId: subnet-XXXXX
  UpdateOsPackages:
    Enabled: false

I see the following errors in CloudFormation:

The following resource(s) failed to create: [ParallelClusterImage]. 
Resource handler returned message: "Error occurred during operation 'Workflow Execution ID: 'wf-b867ea03-6bf2-4910-a834-a548aa0728d2' failed with reason: Unable to bootstrap TOE'." (RequestToken: 395ac732-4b6b-1535-c8a3-b3c7413fa788, HandlerErrorCode: GeneralServiceException)

I see the following errors in CloudWatch:

Started step ApplyBuildComponents with action ExecuteComponents
Sending command to instance to run
Running command (command id: 3ec02c84-29d5-444d-9cb1-a639c91ac362)
Waiting for command to complete (command id: 3ec02c84-29d5-444d-9cb1-a639c91ac362). Attempt number: 1.
Command failed (command id: 3ec02c84-29d5-444d-9cb1-a639c91ac362, state: Failed)

The ec2 builder instance seems to come up in a healthy state but is terminated after this above failed step and I can't disable rollback on failure either by passing in the option since it might be too early in the build process. Any help would be appreciated!

Thanks!

jagga13 commented 3 months ago

I see the following corresponding error in the ssm logs within the instance that might be a clue:

2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Got reply msg Id 95b23bb7-8d6b-45e2-825c-7ff1aedae581 for RunCommandResult aws.ssm.77068c10-b575-4d48-bd5b-69f726df3fdf.i-0ee71b5fa4a9c568b, starting reply thread
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Got reply msg Id 4de7ad92-7487-42b1-8849-03c78b9e7c41 for RunCommandResult aws.ssm.77068c10-b575-4d48-bd5b-69f726df3fdf.i-0ee71b5fa4a9c568b, starting reply thread
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] started reply processing - 4de7ad92-7487-42b1-8849-03c78b9e7c41
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Sending reply {
  "additionalInfo": {
    "agent": {
      "lang": "en-US",
      "name": "amazon-ssm-agent",
      "os": "",
      "osver": "1",
      "ver": ""
    },
    "dateTime": "2024-07-18T01:31:08.161Z",
    "runId": "",
    "runtimeStatusCounts": {
      "Failed": 1
    }
  },
  "documentStatus": "Failed",
  "documentTraceOutput": "",
  "runtimeStatus": {
    "aws:runShellScript": {
      "status": "Failed",
      "code": 1,
      "name": "aws:runShellScript",
      "output": "Waiting for Cloud-init to initialize ...\nURL 'https://ec2imagebuilder-toe-us-west-2-prod.s3.us-west-2.amazonaws.com/bootstrap_scripts/bootstrap.sh' returned HTTP status '200'\n/var/lib/amazon/ssm/i-0ee71b5fa4a9c568b/document/orchestration/77068c10-b575-4d48-bd5b-69f726df3fdf/awsrunShellScript/0.awsrunShellScript/_script.sh: line 62: /tmp/imagebuilder/TaskOrchestratorAndExecutor/bootstrap.sh: Permission denied\n{\"failureMessage\":\"Unable to bootstrap TOE\"}\n\n----------ERROR-------\nfailed to run commands: exit status 1",
      "startDateTime": "2024-07-18T01:31:07.755Z",
      "endDateTime": "2024-07-18T01:31:08.160Z",
      "outputS3BucketName": "",
      "outputS3KeyPrefix": "",
      "stepName": "",
      "standardOutput": "Waiting for Cloud-init to initialize ...\nURL 'https://ec2imagebuilder-toe-us-west-2-prod.s3.us-west-2.amazonaws.com/bootstrap_scripts/bootstrap.sh' returned HTTP status '200'\n/var/lib/amazon/ssm/i-0ee71b5fa4a9c568b/document/orchestration/77068c10-b575-4d48-bd5b-69f726df3fdf/awsrunShellScript/0.awsrunShellScript/_script.sh: line 62: /tmp/imagebuilder/TaskOrchestratorAndExecutor/bootstrap.sh: Permission denied\n{\"failureMessage\":\"Unable to bootstrap TOE\"}\n",
      "standardError": "failed to run commands: exit status 1"
    }
  }
}
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] successfully sent reply message id: 4de7ad92-7487-42b1-8849-03c78b9e7c41
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] started reply processing - 95b23bb7-8d6b-45e2-825c-7ff1aedae581
2024-07-18 01:31:11 INFO [ssm-document-worker] [77068c10-b575-4d48-bd5b-69f726df3fdf] Stop the cloudwatchlogs publisher
2024-07-18 01:31:08 INFO [ssm-agent-worker] [MessageService] [MGSInteractor] Sending reply {
jagga13 commented 3 months ago

Please disregard. This turned out to be a documented issue with /tmp being mounted with the noexec option. After fixing tmp, I was able to build the AMI successfully.