aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
827 stars 312 forks source link

cluster update fails in 3.10.0, 3.9.3 #6339

Open snemir2 opened 2 months ago

snemir2 commented 2 months ago

If you have an active AWS support contract, please open a case with AWS Premium Support team using the below documentation to report the issue: https://docs.aws.amazon.com/awssupport/latest/user/case-management.html

Before submitting a new issue, please search through open GitHub Issues and check out the troubleshooting documentation.

Please make sure to add the following data in order to facilitate the root cause detection.

Required Info:

Bug description and how to reproduce: A clear and concise description of what the bug is and the steps to reproduce the behavior.

Cluster repeatedly fails to update and from cloud-formation point of view goes to "rollback complete" . (custom routines do not appear even to get called)

If you are reporting issues about scaling or job failure: We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.

For issues with Slurm scheduler, please attach the following logs:

If you are reporting issues about cluster creation failure or node failure:

If the cluster fails creation, please re-execute create-cluster action using --rollback-on-failure false option.

We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.

Please be sure to attach the following logs:

Additional context: Any other context about the problem. E.g.:

snemir2 commented 2 months ago

might be related to https://github.com/aws/aws-parallelcluster/issues/6329

image
snemir2 commented 2 months ago

a bit of update: i rolled back very same cluster config to PC3.9.3 and tried very same 'pcluster update' successfully. The issue is clearly PC 3.10.? specific

hehe7318 commented 1 month ago

Hi snemir2,

Can you share your original and updated pcluster configuration yaml file? That could help us to reproduce the error.

might be related to https://github.com/aws/aws-parallelcluster/issues/6329

Seems not related.

Best regards, Xuanqi He

hehe7318 commented 1 month ago

Hi snemir2,

We have been investigating the issue with the failed update of your AWS ParallelCluster. Our initial findings from the cfn-init.log file suggest that a critical point of failure might be related to the portkey.service. The log shows:

              + systemctl restart portkey
              Warning: The unit file, source configuration file, or drop-ins of portkey.service changed on disk. Run 'systemctl daemon-reload' to reload units.
              Job for portkey.service failed because the control process exited with error code.
              See "systemctl status portkey.service" and "journalctl -xeu portkey.service" for details.
              + echo 'skipping portkey restart'
              skipping portkey restart

The failure to restart portkey.service could have impacted subsequent operations, such as accessing the S3 bucket, as seen in the error message below:

Error executing action `run` on resource 'bash[configure_portkey]'
Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
---- Begin output of "bash"  ----
STDOUT: 
STDERR: + mkdir -p /etc/portkey
+ aws s3 ls s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/a2ai/branch/release-v4.0/portkey/
---- End output of "bash"  ----
Ran "bash"  returned 1
Failed to execute OnNodeUpdated script 1 s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v4.0/download_and_run_cookbook.sh, return code: 1.
CloudFormation signaled successfully with status FAILURE

This error might stem from:

The sequence of events suggests that the portkey.service restart failure may have been the root cause, leading to issues with accessing the S3 bucket. We recommend running systemctl daemon-reload to reload the unit files and then retrying the update process.

To assist us in further diagnosing the problem and providing a resolution, we kindly request the following:

These details will help us better understand the configuration and environment. We will continue our investigation and keep you informed of any progress.

Best regards, Xuanqi He

snemir2 commented 1 month ago

Hi @hehe7318 - Thank you for looking into the issue. I am almost sure that the error above you are referring to is a consequence and not the source of the problem (took the portkey out of solution but had same failure) . As you can see from the cloud formation/cloud trail, it (for some strange reason) tries to do "runInstances" api call, fails to re-provision ENI on the head node and fails. That causes networking problems on the instance that you are seeing. FYI -- also seeing this problem on 3.9.3

image

This is the failed api call from cloudtrail.

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROAZQUXECJHPLP3U6GDM:sergey@a2-ai.com",
        "arn": "arn:aws:sts::654225707598:assumed-role/AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620/sergey@a2-ai.com",
        "accountId": "654225707598",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROAZQUXECJHPLP3U6GDM",
                "arn": "arn:aws:iam::654225707598:role/aws-reserved/sso.amazonaws.com/us-east-2/AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620",
                "accountId": "654225707598",
                "userName": "AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620"
            },
            "attributes": {
                "creationDate": "2024-08-05T11:03:25Z",
                "mfaAuthenticated": "false"
            }
        },
        "invokedBy": "cloudformation.amazonaws.com"
    },
    "eventTime": "2024-08-05T11:04:28Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "RunInstances",
    "awsRegion": "us-east-2",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    "errorCode": "Client.InvalidNetworkInterface.InUse",
    "errorMessage": "Interface: [eni-07c94d1a9cd037f2f] in use.",
    "requestParameters": {
        "instancesSet": {
            "items": [
                {
                    "minCount": 1,
                    "maxCount": 1
                }
            ]
        },
        "blockDeviceMapping": {},
        "monitoring": {
            "enabled": false
        },
        "disableApiTermination": false,
        "disableApiStop": false,
        "clientToken": "14c7a9db-2c7d-217d-48d5-262ade2651c6",
        "ebsOptimized": false,
        "tagSpecificationSet": {
            "items": [
                {
                    "resourceType": "instance",
                    "tags": [
                        {
                            "key": "parallelcluster:version",
                            "value": "3.9.3"
                        },
                        {
                            "key": "aws:cloudformation:stack-name",
                            "value": "A2AiClustertesting"
                        },
                        {
                            "key": "aws:cloudformation:stack-id",
                            "value": "arn:aws:cloudformation:us-east-2:654225707598:stack/A2AiClustertesting/4a6dc1c0-50e3-11ef-9498-021a39b766eb"
                        },
                        {
                            "key": "A2AI:creator",
                            "value": "sergey"
                        },
                        {
                            "key": "parallelcluster:networking",
                            "value": "EFA=NONE"
                        },
                        {
                            "key": "parallelcluster:filesystem",
                            "value": "efs=1, multiebs=1, raid=0, fsx=0"
                        },
                        {
                            "key": "Name",
                            "value": "HeadNode"
                        },
                        {
                            "key": "map-migrated",
                            "value": "mig8KP4B19EMB"
                        },
                        {
                            "key": "A2AI:a2ai-cloud-version",
                            "value": "branch/release-v4.0"
                        },
                        {
                            "key": "parallelcluster:cluster-name",
                            "value": "A2AiClustertesting"
                        },
                        {
                            "key": "aws:cloudformation:logical-id",
                            "value": "HeadNode"
                        },
                        {
                            "key": "parallelcluster:node-type",
                            "value": "HeadNode"
                        },
                        {
                            "key": "parallelcluster:attributes",
                            "value": "ubuntu2204, slurm, 3.9.3, x86_64"
                        },
                        {
                            "key": "A2AI:a2ai-cloud-env",
                            "value": "dev"
                        }
                    ]
                }
            ]
        },
        "launchTemplate": {
            "launchTemplateId": "lt-093388f1a2c21ab1a",
            "version": "2"
        }
    },
    "responseElements": null,
    "requestID": "9ea4e084-bdb9-4d8c-a268-f388506ee1ea",
    "eventID": "3a98e509-4f97-4b03-a55f-704a1303439d",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "654225707598",
    "eventCategory": "Management"
}
francisreyes-tfs commented 1 month ago

Wow got this issue as well. simply adding a new instanceType for a slurm cluster queue triggers the creation of a new HeadNode in cloudformation, parallelcluster 3.10.0 ., and the error stems from cloudformation wanting to create a new Head Node while the ENI is still attached to the old head node . Now why this triggers a new HeadNode resource, I don't know, but with my own work in CustomResources, the Cloudformation custom resource is returning a new ID When an update is triggered which tells CF that the HeadNode needs to be replaced.

hanwen-pcluste commented 1 month ago

Apologies for the late reply.

Can you share your original and updated cluster configuration YAML file? That could help us to reproduce the error. I tried to add a new InstanceType and successfully updated my cluster.

samcofer commented 3 days ago

Hello! I'm seeing this issue as well when trying to update our customAMI. Have their been any updates here? I'm happy to share our before and after configuration if that would be helpful?

This is on cluster version 3.9.3

`HeadNode: CustomActions: OnNodeConfigured: Script: "s3://OBSCURED/install-pwb-config.sh" Iam: S3Access:

LoginNodes: Pools:

DevSettings: Timeouts: HeadNodeBootstrapTimeout: 7200 # timeout in seconds ComputeNodeBootstrapTimeout: 7200 # timeout in seconds

SharedStorage:

DirectoryService: DomainName: OBSCURED DomainAddr: ldap://OBSCURED PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED DomainReadOnlyUser: cn=Administrator,cn=Users,dc=pwb,dc=posit,dc=co GenerateSshKeysForUsers: true AdditionalSssdConfigs: override_homedir : /home/%u ldap_id_use_start_tls : false ldap_tls_reqcert : never ldap_auth_disable_tls_never_use_in_production : true

Tags:

After Update:

`HeadNode: CustomActions: OnNodeConfigured: Script: "s3://OBSCURED/install-pwb-config.sh" Iam: S3Access:

LoginNodes: Pools:

DevSettings: Timeouts: HeadNodeBootstrapTimeout: 7200 # timeout in seconds ComputeNodeBootstrapTimeout: 7200 # timeout in seconds

SharedStorage:

DirectoryService: DomainName: OBSCURED DomainAddr: ldap://OBSCURED PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED DomainReadOnlyUser: cn=Administrator,cn=Users,dc=pwb,dc=posit,dc=co GenerateSshKeysForUsers: true AdditionalSssdConfigs: override_homedir : /home/%u ldap_id_use_start_tls : false ldap_tls_reqcert : never ldap_auth_disable_tls_never_use_in_production : true

Tags:

samcofer commented 3 days ago

Hello! I'm seeing this issue as well when trying to update our customAMI. Have their been any updates here? I'm happy to share our before and after configuration if that would be helpful?

This is on cluster version 3.9.3

HeadNode: 
  CustomActions: 
    OnNodeConfigured: 
      Script: "s3://OBSCURED/install-pwb-config.sh"
  Iam: 
    S3Access: 
      - BucketName: OBSCURED
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - Policy: arn:aws:iam::637485797898:policy/elbaccess-c5f9897
  InstanceType: t3.xlarge
  Networking: 
    SubnetId: subnet-0a937bc9f0c04ad8b
    AdditionalSecurityGroups: 
      - sg-09e341de4a0f2773e
  LocalStorage:
    RootVolume:
      Size: 120 
  SharedStorageType: Efs
  Ssh:
    KeyName: OBSCURED
Image: 
  Os: ubuntu2004
  CustomAmi: ami-043eba58b4b8131c6
Region: eu-west-1
Scheduling: 
  Scheduler: slurm
  SlurmSettings:
    EnableMemoryBasedScheduling: true
    Database:
      Uri: slurm-OBSCURED.eu-west-1.rds.amazonaws.com:3306
      UserName: slurm_db_admin
      PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
      DatabaseName: slurm    
  SlurmQueues: 

    - Name: interactive
      ComputeResources:
        - Name: rstudio 
          InstanceType: t3.xlarge
          MaxCount: 20 
          MinCount: 1
          Efa:
            Enabled: FALSE
      CustomSlurmSettings:
        OverSubscribe: FORCE:2
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds:
          - subnet-0a937bc9f0c04ad8b

    - Name: all 
      ComputeResources:
        - Name: rstudio 
          InstanceType: t3.xlarge
          MaxCount: 10
          MinCount: 0 
          Efa:
            Enabled: FALSE
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds:
          - subnet-0a937bc9f0c04ad8b

    - Name: gpu 
      ComputeResources:
        - Name: large
          InstanceType: p3.2xlarge
          MaxCount: 1
          MinCount: 0
          Efa:
            Enabled: FALSE
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds: 
          - subnet-0a937bc9f0c04ad8b

LoginNodes:
  Pools:
    - Name: login
      Count: 2 
      InstanceType: t3.xlarge
      Networking:
        AdditionalSecurityGroups: 
          - sg-09ca531e5331195f1
        SubnetIds: 
          - subnet-0a937bc9f0c04ad8b
      Ssh:
        KeyName: OBSCURED

DevSettings:
  Timeouts:
    HeadNodeBootstrapTimeout: 7200  # timeout in seconds
    ComputeNodeBootstrapTimeout: 7200  # timeout in seconds

SharedStorage:
  - MountDir: /home
    Name: home
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      DeploymentType: SCRATCH_2
  - MountDir: /opt/rstudio
    Name: rstudio
    StorageType: Efs
  - MountDir: /opt/apps
    Name: appstack
    StorageType: Efs
    EfsSettings:
      FileSystemId: fs-OBSCURED

DirectoryService:
  DomainName: OBSCURED
  DomainAddr: ldap://OBSCURED
  PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
  DomainReadOnlyUser: cn=Administrator,cn=Users,dc=pwb,dc=posit,dc=co
  GenerateSshKeysForUsers: true
  AdditionalSssdConfigs: 
    override_homedir : /home/%u
    ldap_id_use_start_tls : false
    ldap_tls_reqcert : never
    ldap_auth_disable_tls_never_use_in_production : true

Tags:
  - Key: rs:environment
    Value: development
  - Key: rs:owner
    Value: OBSCURED 
  - Key: rs:project
    Value: solutions
  - Key: rs:subsystem
    Value: ukhsa

After Update:

HeadNode: 
  CustomActions: 
    OnNodeConfigured: 
      Script: "s3://OBSCURED/install-pwb-config.sh"
  Iam: 
    S3Access: 
      - BucketName: OBSCURED
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - Policy: arn:aws:iam::637485797898:policy/elbaccess-c5f9897
  InstanceType: t3.xlarge
  Networking: 
    SubnetId: subnet-0a937bc9f0c04ad8b
    AdditionalSecurityGroups: 
      - sg-09e341de4a0f2773e
  LocalStorage:
    RootVolume:
      Size: 120 
  SharedStorageType: Efs
  Ssh:
    KeyName: OBSCURED
Image: 
  Os: ubuntu2004
  CustomAmi: ami-092b5633346d89b54
Region: eu-west-1
Scheduling: 
  Scheduler: slurm
  SlurmSettings:
    EnableMemoryBasedScheduling: true
    Database:
      Uri: slurm-OBSCURED.eu-west-1.rds.amazonaws.com:3306
      UserName: slurm_db_admin
      PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
      DatabaseName: slurm    
  SlurmQueues: 

    - Name: interactive
      ComputeResources:
        - Name: rstudio 
          InstanceType: t3.xlarge
          MaxCount: 20 
          MinCount: 1
          Efa:
            Enabled: FALSE
      CustomSlurmSettings:
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds:
          - subnet-0a937bc9f0c04ad8b

    - Name: all 
      ComputeResources:
        - Name: rstudio 
          InstanceType: t3.xlarge
          MaxCount: 10
          MinCount: 0 
          Efa:
            Enabled: FALSE
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds:
          - subnet-0a937bc9f0c04ad8b

    - Name: gpu 
      ComputeResources:
        - Name: large
          InstanceType: p3.2xlarge
          MaxCount: 1
          MinCount: 0
          Efa:
            Enabled: FALSE
      CustomActions:
        OnNodeConfigured:
          Script: "s3://OBSCURED/config-compute.sh"
      Iam:
        S3Access:
          - BucketName: OBSCURED
      Networking:
        PlacementGroup:
          Enabled: FALSE
        SubnetIds: 
          - subnet-0a937bc9f0c04ad8b

LoginNodes:
  Pools:
    - Name: login
      Count: 2 
      InstanceType: t3.xlarge
      Networking:
        AdditionalSecurityGroups: 
          - sg-09ca531e5331195f1
        SubnetIds: 
          - subnet-0a937bc9f0c04ad8b
      Ssh:
        KeyName: OBSCURED

DevSettings:
  Timeouts:
    HeadNodeBootstrapTimeout: 7200  # timeout in seconds
    ComputeNodeBootstrapTimeout: 7200  # timeout in seconds

SharedStorage:
  - MountDir: /home
    Name: home
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      DeploymentType: SCRATCH_2
  - MountDir: /opt/rstudio
    Name: rstudio
    StorageType: Efs
  - MountDir: /opt/apps
    Name: appstack
    StorageType: Efs
    EfsSettings:
      FileSystemId: fs-OBSCURED

DirectoryService:
  DomainName: OBSCURED
  DomainAddr: ldap://OBSCURED
  PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
  DomainReadOnlyUser: cn=Administrator,cn=Users,dc=pwb,dc=posit,dc=co
  GenerateSshKeysForUsers: true
  AdditionalSssdConfigs: 
    override_homedir : /home/%u
    ldap_id_use_start_tls : false
    ldap_tls_reqcert : never
    ldap_auth_disable_tls_never_use_in_production : true

Tags:
  - Key: rs:environment
    Value: development
  - Key: rs:owner
    Value: OBSCURED 
  - Key: rs:project
    Value: solutions
  - Key: rs:subsystem
    Value: ukhsa