cluster update fails in 3.10.0, 3.9.3

aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

https://github.com/aws/aws-parallelcluster

Apache License 2.0

827 stars 312 forks source link

cluster update fails in 3.10.0, 3.9.3 #6339

Open snemir2 opened 2 months ago

snemir2 commented 2 months ago

If you have an active AWS support contract, please open a case with AWS Premium Support team using the below documentation to report the issue: https://docs.aws.amazon.com/awssupport/latest/user/case-management.html

Before submitting a new issue, please search through open GitHub Issues and check out the troubleshooting documentation.

Please make sure to add the following data in order to facilitate the root cause detection.

Required Info:

AWS ParallelCluster version [e.g. 3.1.1]: 3.10.0
Full cluster configuration without any credentials or personal data.
Cluster name: A2AiClustertesting

Output of pcluster describe-cluster command.

pcluster describe-cluster -n A2AiClustertesting -r us-east-2
{
"creationTime": "2024-07-09T18:49:23.141Z",
"headNode": {
"launchTime": "2024-07-09T18:53:32.000Z",
"instanceId": "i-0976556062851f6ca",
"instanceType": "r6i.xlarge",
"state": "running",
"privateIpAddress": "10.2.46.69"
},
"version": "3.10.0",
"clusterConfiguration": {
"url": "https://parallelcluster-97a8b56da16cbe1e-v1-do-not-delete.s3.us-east-2.amazonaws.com/parallelcluster/3.10.0/clusters/a2aiclustertesting-3vtysbdo94ne39ev/configs/cluster-config.yaml?versionId=rhu_16ixsYqMF2ctGyWRVVtNGS.jwCQI&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAZQUXECJHGFFWWCDI%2F20240709%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20240709T212526Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEG0aCXVzLWVhc3QtMiJHMEUCIFlopqMIbJ6IffFlaCwfCvGgUh0RIeHnnlInHBRnLECNAiEAnl07d3CZ39jLquycKjIjGyDcuvvzOR%2FiJ7vAVG%2FBRDAq%2FgIINxAAGgw2NTQyMjU3MDc1OTgiDNQTAke7TLCxmdDjFirbAlrYusjjQ3oD4XjjeNPyzpzxeX6as8JfiomXPRwmzsHOJl7ttg11miKNZ1h4h%2Fgt2MN%2FVucaVJoc%2BnWfHiXHQ8PTfWqjisZ698iw2QrMzLYzatufZSuwpfumz93eH1E8UCtNctjCvUdIqsr6vwTXFKoPqXhKm5KgZ5pfgqK381VNQmFP1xxPqnflpyL0pRnIRBC76XWdaD1zNAZluzp0Zxce75MiXjPT1NPqqu%2Fcux3VSTHgvPbuJfF2yri5pfRpp7n7KiLHgBus8OAfM%2FEwFMLvtnNPP61Hk%2BU0YvWZvuuXF6lLisxqxw4wZYNB0zR7zF3GecXDvuW4ZS%2Bapdme8hzOCk4xh4XG271G1p6Ch%2FG%2BvIWF4roQGgJBu3mOWrOEERzihvgeEDCZUsyIhJnUrJjSPuYsfAlf7aDDvEgru4sW2tKjCsShGpph%2F3cQa2hw4Y1k0DaUSCHPVZ5GMMXVtrQGOqcBeG9WoHb5rdd%2FG0uUI3pfUDWLVFC%2FyswoY22gb0Rkk8GIb3bBSm9SrYZOmxXw5lz%2FOP8X446KfcLMzInE2WeSZ9cijK5RT%2FAywuCQm4yXCglbra%2B0OG0r%2BWc%2BX0MkFDrtepKJRKeMH7pesvzqm8MWkqWpUUC59r9u%2BKa58HZQ6jjx0Icl00MUxa17OqYzQ0vUZqADUxggW9QddoJcdcpKLSlKUkEfjSY%3D&X-Amz-Signature=63e0e16188b1f9e6c696e5b980610f33e775c3fd801df5c1b4d618362eba722a"
},
"tags": [
{
  "value": "branch/release-v4.0",
  "key": "A2AI:a2ai-cloud-version"
},
{
  "value": "mig8KP4B19EMB",
  "key": "map-migrated"
},
{
  "value": "3.10.0",
  "key": "parallelcluster:version"
},
{
  "value": "A2AiClustertesting",
  "key": "parallelcluster:cluster-name"
},
{
  "value": "sergey",
  "key": "A2AI:creator"
},
{
  "value": "dev",
  "key": "A2AI:a2ai-cloud-env"
}
],
"cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
"clusterName": "A2AiClustertesting",
"computeFleetStatus": "RUNNING",
"cloudformationStackArn": "arn:aws:cloudformation:us-east-2:654225707598:stack/A2AiClustertesting/f531bdf0-3e23-11ef-997c-06835d7b2d0f",
"lastUpdatedTime": "2024-07-09T19:32:18.912Z",
"region": "us-east-2",
"clusterStatus": "UPDATE_FAILED",
"scheduler": {
"type": "slurm"
}
}

[Optional] Arn of the cluster CloudFormation main stack:

Bug description and how to reproduce: A clear and concise description of what the bug is and the steps to reproduce the behavior.

Cluster repeatedly fails to update and from cloud-formation point of view goes to "rollback complete" . (custom routines do not appear even to get called)

If you are reporting issues about scaling or job failure: We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.

For issues with Slurm scheduler, please attach the following logs:

From Head node: /var/log/parallelcluster/clustermgtd, /var/log/parallelcluster/clusterstatusmgtd (if version >= 3.2.0), /var/log/parallelcluster/slurm_resume.log, /var/log/parallelcluster/slurm_suspend.log, /var/log/parallelcluster/slurm_fleet_status_manager.log (if version >= 3.2.0) and/var/log/slurmctld.log.
From Compute node: /var/log/parallelcluster/computemgtd.log and /var/log/slurmd.log.

If you are reporting issues about cluster creation failure or node failure:

If the cluster fails creation, please re-execute create-cluster action using --rollback-on-failure false option.

We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.

Please be sure to attach the following logs:

From Head node: /var/log/cloud-init.log, /var/log/cfn-init.log and /var/log/chef-client.log (attached)
From Compute node: /var/log/cloud-init-output.log. NA logs.tgz <-headnode logs

Additional context: Any other context about the problem. E.g.:

CLI logs: ~/.parallelcluster/pcluster-cli.log
Custom bootstrap scripts, if any
Screenshots, if useful.

snemir2 commented 2 months ago

a bit of update: i rolled back very same cluster config to PC3.9.3 and tried very same 'pcluster update' successfully. The issue is clearly PC 3.10.? specific

hehe7318 commented 1 month ago

Hi snemir2,

Can you share your original and updated pcluster configuration yaml file? That could help us to reproduce the error.

might be related to https://github.com/aws/aws-parallelcluster/issues/6329

Seems not related.

Best regards, Xuanqi He

hehe7318 commented 1 month ago

Hi snemir2,

We have been investigating the issue with the failed update of your AWS ParallelCluster. Our initial findings from the cfn-init.log file suggest that a critical point of failure might be related to the portkey.service. The log shows:

              + systemctl restart portkey
              Warning: The unit file, source configuration file, or drop-ins of portkey.service changed on disk. Run 'systemctl daemon-reload' to reload units.
              Job for portkey.service failed because the control process exited with error code.
              See "systemctl status portkey.service" and "journalctl -xeu portkey.service" for details.
              + echo 'skipping portkey restart'
              skipping portkey restart

The failure to restart portkey.service could have impacted subsequent operations, such as accessing the S3 bucket, as seen in the error message below:

Error executing action `run` on resource 'bash[configure_portkey]'
Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
---- Begin output of "bash"  ----
STDOUT: 
STDERR: + mkdir -p /etc/portkey
+ aws s3 ls s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/a2ai/branch/release-v4.0/portkey/
---- End output of "bash"  ----
Ran "bash"  returned 1
Failed to execute OnNodeUpdated script 1 s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v4.0/download_and_run_cookbook.sh, return code: 1.
CloudFormation signaled successfully with status FAILURE

This error might stem from:

S3 Bucket Path or Permissions: The path might be incorrect, or the IAM role might not have the necessary permissions to access the S3 bucket.
Network Configuration Issues: These might involve the ENI (Elastic Network Interface), which could prevent S3 bucket access.

The sequence of events suggests that the portkey.service restart failure may have been the root cause, leading to issues with accessing the S3 bucket. We recommend running systemctl daemon-reload to reload the unit files and then retrying the update process.

To assist us in further diagnosing the problem and providing a resolution, we kindly request the following:

Chef Client Log: Please provide the chef-client.log file from the affected instances for more detailed insights.
ParallelCluster Configuration Files:
- The original parallelcluster configuration YAML file before the update.
- The updated parallelcluster configuration YAML file used during the failed update.

These details will help us better understand the configuration and environment. We will continue our investigation and keep you informed of any progress.

Best regards, Xuanqi He

snemir2 commented 1 month ago

Hi @hehe7318 - Thank you for looking into the issue. I am almost sure that the error above you are referring to is a consequence and not the source of the problem (took the portkey out of solution but had same failure) . As you can see from the cloud formation/cloud trail, it (for some strange reason) tries to do "runInstances" api call, fails to re-provision ENI on the head node and fails. That causes networking problems on the instance that you are seeing. FYI -- also seeing this problem on 3.9.3

This is the failed api call from cloudtrail.

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROAZQUXECJHPLP3U6GDM:sergey@a2-ai.com",
        "arn": "arn:aws:sts::654225707598:assumed-role/AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620/sergey@a2-ai.com",
        "accountId": "654225707598",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROAZQUXECJHPLP3U6GDM",
                "arn": "arn:aws:iam::654225707598:role/aws-reserved/sso.amazonaws.com/us-east-2/AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620",
                "accountId": "654225707598",
                "userName": "AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620"
            },
            "attributes": {
                "creationDate": "2024-08-05T11:03:25Z",
                "mfaAuthenticated": "false"
            }
        },
        "invokedBy": "cloudformation.amazonaws.com"
    },
    "eventTime": "2024-08-05T11:04:28Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "RunInstances",
    "awsRegion": "us-east-2",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    "errorCode": "Client.InvalidNetworkInterface.InUse",
    "errorMessage": "Interface: [eni-07c94d1a9cd037f2f] in use.",
    "requestParameters": {
        "instancesSet": {
            "items": [
                {
                    "minCount": 1,
                    "maxCount": 1
                }
            ]
        },
        "blockDeviceMapping": {},
        "monitoring": {
            "enabled": false
        },
        "disableApiTermination": false,
        "disableApiStop": false,
        "clientToken": "14c7a9db-2c7d-217d-48d5-262ade2651c6",
        "ebsOptimized": false,
        "tagSpecificationSet": {
            "items": [
                {
                    "resourceType": "instance",
                    "tags": [
                        {
                            "key": "parallelcluster:version",
                            "value": "3.9.3"
                        },
                        {
                            "key": "aws:cloudformation:stack-name",
                            "value": "A2AiClustertesting"
                        },
                        {
                            "key": "aws:cloudformation:stack-id",
                            "value": "arn:aws:cloudformation:us-east-2:654225707598:stack/A2AiClustertesting/4a6dc1c0-50e3-11ef-9498-021a39b766eb"
                        },
                        {
                            "key": "A2AI:creator",
                            "value": "sergey"
                        },
                        {
                            "key": "parallelcluster:networking",
                            "value": "EFA=NONE"
                        },
                        {
                            "key": "parallelcluster:filesystem",
                            "value": "efs=1, multiebs=1, raid=0, fsx=0"
                        },
                        {
                            "key": "Name",
                            "value": "HeadNode"
                        },
                        {
                            "key": "map-migrated",
                            "value": "mig8KP4B19EMB"
                        },
                        {
                            "key": "A2AI:a2ai-cloud-version",
                            "value": "branch/release-v4.0"
                        },
                        {
                            "key": "parallelcluster:cluster-name",
                            "value": "A2AiClustertesting"
                        },
                        {
                            "key": "aws:cloudformation:logical-id",
                            "value": "HeadNode"
                        },
                        {
                            "key": "parallelcluster:node-type",
                            "value": "HeadNode"
                        },
                        {
                            "key": "parallelcluster:attributes",
                            "value": "ubuntu2204, slurm, 3.9.3, x86_64"
                        },
                        {
                            "key": "A2AI:a2ai-cloud-env",
                            "value": "dev"
                        }
                    ]
                }
            ]
        },
        "launchTemplate": {
            "launchTemplateId": "lt-093388f1a2c21ab1a",
            "version": "2"
        }
    },
    "responseElements": null,
    "requestID": "9ea4e084-bdb9-4d8c-a268-f388506ee1ea",
    "eventID": "3a98e509-4f97-4b03-a55f-704a1303439d",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "654225707598",
    "eventCategory": "Management"
}

francisreyes-tfs commented 1 month ago

Wow got this issue as well. simply adding a new instanceType for a slurm cluster queue triggers the creation of a new HeadNode in cloudformation, parallelcluster 3.10.0 ., and the error stems from cloudformation wanting to create a new Head Node while the ENI is still attached to the old head node . Now why this triggers a new HeadNode resource, I don't know, but with my own work in CustomResources, the Cloudformation custom resource is returning a new ID When an update is triggered which tells CF that the HeadNode needs to be replaced.

hanwen-pcluste commented 1 month ago

Apologies for the late reply.

Can you share your original and updated cluster configuration YAML file? That could help us to reproduce the error. I tried to add a new InstanceType and successfully updated my cluster.

samcofer commented 3 days ago

Hello! I'm seeing this issue as well when trying to update our customAMI. Have their been any updates here? I'm happy to share our before and after configuration if that would be helpful?

This is on cluster version 3.9.3

`HeadNode: CustomActions: OnNodeConfigured: Script: "s3://OBSCURED/install-pwb-config.sh" Iam: S3Access:

BucketName: OBSCURED AdditionalIamPolicies:
Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Policy: arn:aws:iam::637485797898:policy/elbaccess-c5f9897 InstanceType: t3.xlarge Networking: SubnetId: subnet-0a937bc9f0c04ad8b AdditionalSecurityGroups:
sg-09e341de4a0f2773e LocalStorage: RootVolume: Size: 120 SharedStorageType: Efs Ssh: KeyName: OBSCURED Image: Os: ubuntu2004 CustomAmi: ami-043eba58b4b8131c6 Region: eu-west-1 Scheduling: Scheduler: slurm SlurmSettings: EnableMemoryBasedScheduling: true Database: Uri: slurm-OBSCURED.eu-west-1.rds.amazonaws.com:3306 UserName: slurm_db_admin PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED DatabaseName: slurm
SlurmQueues:
- Name: interactive ComputeResources:
  - Name: rstudio InstanceType: t3.xlarge MaxCount: 20 MinCount: 1 Efa: Enabled: FALSE CustomSlurmSettings: OverSubscribe: FORCE:2 CustomActions: OnNodeConfigured: Script: "s3://OBSCURED/config-compute.sh" Iam: S3Access:
    - BucketName: OBSCURED Networking: PlacementGroup: Enabled: FALSE SubnetIds:
    - subnet-0a937bc9f0c04ad8b
- Name: all ComputeResources:
  - Name: rstudio InstanceType: t3.xlarge MaxCount: 10 MinCount: 0 Efa: Enabled: FALSE CustomActions: OnNodeConfigured: Script: "s3://OBSCURED/config-compute.sh" Iam: S3Access:
    - BucketName: OBSCURED Networking: PlacementGroup: Enabled: FALSE SubnetIds:
    - subnet-0a937bc9f0c04ad8b
- Name: gpu ComputeResources:
  - Name: large InstanceType: p3.2xlarge MaxCount: 1 MinCount: 0 Efa: Enabled: FALSE CustomActions: OnNodeConfigured: Script: "s3://OBSCURED/config-compute.sh" Iam: S3Access:
    - BucketName: OBSCURED Networking: PlacementGroup: Enabled: FALSE SubnetIds:
    - subnet-0a937bc9f0c04ad8b

LoginNodes: Pools:

Name: login Count: 2 InstanceType: t3.xlarge Networking: AdditionalSecurityGroups:
- sg-09ca531e5331195f1 SubnetIds:
- subnet-0a937bc9f0c04ad8b Ssh: KeyName: OBSCURED

DevSettings: Timeouts: HeadNodeBootstrapTimeout: 7200 # timeout in seconds ComputeNodeBootstrapTimeout: 7200 # timeout in seconds

SharedStorage:

MountDir: /home Name: home StorageType: FsxLustre FsxLustreSettings: StorageCapacity: 1200 DeploymentType: SCRATCH_2
MountDir: /opt/rstudio Name: rstudio StorageType: Efs
MountDir: /opt/apps Name: appstack StorageType: Efs EfsSettings: FileSystemId: fs-OBSCURED

DirectoryService: DomainName: OBSCURED DomainAddr: ldap://OBSCURED PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED DomainReadOnlyUser: cn=Administrator,cn=Users,dc=pwb,dc=posit,dc=co GenerateSshKeysForUsers: true AdditionalSssdConfigs: override_homedir : /home/%u ldap_id_use_start_tls : false ldap_tls_reqcert : never ldap_auth_disable_tls_never_use_in_production : true

Tags:

Key: rs:environment Value: development
Key: rs:owner Value: OBSCURED
Key: rs:project Value: solutions
Key: rs:subsystem Value: ukhsa `

After Update: