aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
818 stars 309 forks source link

Cluster Update Failure When Adding a New Slurm Queue #4286

Open joehellmersNOAA opened 1 year ago

joehellmersNOAA commented 1 year ago

Required Info:

Image: Os: alinux2

HeadNode: InstanceType: m5.xlarge Networking: SubnetId: subnet-0aa9d3bd709f86d50 SecurityGroups:

Scheduling: Scheduler: slurm SlurmSettings: Dns: DisableManagedDns: true UseEc2Hostnames: true SlurmQueues:

SharedStorage:

AdditionalPackages:

IntelSoftware:

IntelHpcPlatform: true

DirectoryService: DomainName: dc=ncisdev,dc=noaa DomainAddr: ldap://10.101.14.78,ldap://10.101.9.221 PasswordSecretArn: arn:aws:secretsmanager:us-east-1:716453263077:secret:MicrosoftAD.Admin.Password-gDGZv6 DomainReadOnlyUser: cn=adjoin,ou=service,ou=NCISDEV,dc=ncisdev,dc=noaa AdditionalSssdConfigs: ldap_auth_disable_tls_never_use_in_production: True

Tags:

 - Cluster name: oar-pcluster

- Output of `pcluster describe-cluster` command.

{ "creationTime": "2022-06-30T19:35:47.455Z", "headNode": { "launchTime": "2022-06-30T19:38:44.000Z", "instanceId": "i-00c4711259d394ae3", "instanceType": "m5.xlarge", "state": "running", "privateIpAddress": "10.102.8.85" }, "version": "3.1.4", "clusterConfiguration": { "url": "https://parallelcluster-e2ca1557272da85b-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.1.4/clusters/oar-pcluster-wtisscfpi6pepclv/configs/cluster-config.yaml?versionId=ZkTmb6RC_GHCxLLymscOGWLSeh0womE4&AWSAccessKeyId=AKIA2NT7RG3SULLIW55M&Signature=EPFIQnc8YB%2BsP3CunBz3WBPvLfY%3D&Expires=1661212485" }, "tags": [ { "value": "13051420fneea0147", "key": "noaa:taskerorderid" }, { "value": "3.1.4", "key": "parallelcluster:version" }, { "value": "noaa5006", "key": "noaa:fismaid" }, { "value": "dev", "key": "noaa:environment" }, { "value": "OAR WRF-Chem", "key": "noaa:application" }, { "value": "nesdis", "key": "noaa:lineoffice" }, { "value": "40-00", "key": "noaa:programoffice" } ], "cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE", "clusterName": "oar-pcluster", "computeFleetStatus": "RUNNING", "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:716453263077:stack/oar-pcluster/d7077910-f8ab-11ec-aa98-0e38e0433449", "lastUpdatedTime": "2022-08-22T22:34:59.022Z", "region": "us-east-1", "clusterStatus": "UPDATE_FAILED" }


 - [Optional] Arn of the cluster CloudFormation main stack:

**Bug description and how to reproduce:**
When trying to use to update the already existing cluster with the configuration above I get an error.

`Interface: [eni-0966644c5bb6b0347] in use. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidNetworkInterface.InUse; Request ID: 557b2408-e32b-4d37-adfd-a0332a8c0ce9; Proxy: null)`

The only thing that is changing from the original configuration is the addition of the Slurm queue oarm6id24xlarge.

**If you are reporting issues about scaling or job failure:**
We cannot work on issues without proper logs. We **STRONGLY** recommend following [this guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-get-logs) and attach the complete cluster log archive with the ticket.

For issues with Slurm scheduler, please attach the following logs:
* From Head node: `/var/log/parallelcluster/clustermgtd`, `/var/log/parallelcluster/clusterstatusmgtd` (if version >= 3.2.0), `/var/log/parallelcluster/slurm_resume.log`, `/var/log/parallelcluster/slurm_suspend.log`, `/var/log/parallelcluster/slurm_fleet_status_manager.log` (if version >= 3.2.0) and`/var/log/slurmctld.log`. 
* From Compute node:  `/var/log/parallelcluster/computemgtd.log` and `/var/log/slurmd.log`.

**If you are reporting issues about cluster creation failure or node failure:**

If the cluster fails creation, please re-execute `create-cluster` action using `--rollback-on-failure false` option.

We cannot work on issues without proper logs. We **STRONGLY** recommend following [this guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-get-logs) and attach the complete cluster log archive with the ticket.

Please be sure to attach the following logs:
* From Head node: `/var/log/cloud-init.log`, `/var/log/cfn-init.log` and `/var/log/chef-client.log`
* From Compute node:  `/var/log/cloud-init-output.log`.

**Additional context:**
Any other context about the problem. E.g.:
 - CLI logs: `~/.parallelcluster/pcluster-cli.log`
 - Custom bootstrap scripts, if any
 - Screenshots, if useful.
EddyMM commented 1 year ago

Hello @joehellmersNOAA, Based on the Slurm Configuration you've provided, the Scheduling Section seems to have 2 SlurmQueues sections. The ParallelCluster Configuration format requires the Scheduling section to have a single SlurmQueues section under which a list of queues can be provided. (Scheduling-v3) Kindly try updating the cluster after removing the second SlurmQueues section and moving oarm6id24xlarge queue to the first one.

This should have been caught by some validation. Could you confirm that you did not suppress validation when running the update? If not, then some validation checks may be needed.

joehellmersNOAA commented 1 year ago

Thanks for the catch.

No I did not suppress validations.

I'll try it again shortly and let you know.

BTW, why does the new queue need to be the first one in the list?

joehellmersNOAA commented 1 year ago

I made the following Scheduling section, and got the same error:

Scheduling:
  Scheduler: slurm
  SlurmSettings:
    Dns:
      DisableManagedDns: true
      UseEc2Hostnames: true
  SlurmQueues:
  - Name: oarm6id24xlarge
    ComputeSettings:
      LocalStorage:
        RootVolume:
          Size: 500
    ComputeResources:
    - Name: m6id24xlarge
      InstanceType: m6id.24xlarge
      MinCount: 0
      MaxCount: 2
    Networking:
      SubnetIds:
      - subnet-0aa9d3bd709f86d50
      SecurityGroups:
      - sg-045290f659c3be158
    Iam:
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  - Name: oar
    ComputeResources:
    - Name: m5xlarge
      InstanceType: m5.xlarge
      MinCount: 2
      MaxCount: 10
    Networking:
      SubnetIds:
      - subnet-0aa9d3bd709f86d50
      SecurityGroups:
      - sg-045290f659c3be158
    Iam:
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
EddyMM commented 1 year ago

Hello @joehellmersNOAA,

I'm not yet able to reproduce the issue with the above configuration.

The Network Interface referenced in the error seems to be attached to the HeadNode. Normally the HeadNode attributes (such as the network interface) should not be updated when making updates to the compute nodes/queue

Could you kindly share the ChangeSet printed in the terminal when you attempt running the cluster update? Also, can you confirm that no manual changes were made to the HeadNode Launch Template (AWSConsole>EC2>Launch templates>HeadNodeLaunchTemplate_*)? Thanks.

BTW, why does the new queue need to be the first one in the list?

The new queue does not need to the first one in the list. I believe you added it as the first one following my comment "Kindly try updating the cluster after removing the second SlurmQueues section and moving oarm6id24xlarge queue to the first one."? In this regard I was referring to the first SlurmQueues section.

EddyMM commented 1 year ago

It would also be helpful to share the CLI logs located at ~/.parallelcluster/pcluster-cli.log* when the pcluster update failure occurs. This could help us reproduce the issue or identify the point of failure. Thanks.

joehellmersNOAA commented 1 year ago

pcluster-cli.log Yes, I can confirm that the LaunchTemplate was not manually configured.

The ChangeSet is

{
  "cluster": {
    "clusterName": "oar-pcluster",
    "cloudformationStackStatus": "UPDATE_IN_PROGRESS",
    "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:716453263077:stack/oar-pcluster/d7077910-f8ab-11ec-aa98-0e38e0433449",
    "region": "us-east-1",
    "version": "3.1.4",
    "clusterStatus": "UPDATE_IN_PROGRESS"
  },
  "validationMessages": [
    {
      "level": "WARNING",
      "type": "DomainAddrValidator",
      "message": "The use of the ldaps protocol is strongly encouraged for security reasons."
    }
  ],
  "changeSet": [
    {
      "parameter": "HeadNode.LocalStorage",
      "requestedValue": {
        "RootVolume": {
          "Size": 600,
          "VolumeType": "gp3"
        }
      },
      "currentValue": "-"
    },
    {
      "parameter": "Scheduling.SlurmQueues",
      "requestedValue": {
        "Name": "oarm6id24xlarge",
        "ComputeSettings": {
          "LocalStorage": {
            "RootVolume": {
              "Size": 500
            }
          }
        },
        "ComputeResources": [
          {
            "Name": "m6id24xlarge",
            "InstanceType": "m6id.24xlarge",
            "MinCount": 0,
            "MaxCount": 2
          }
        ],
        "Networking": {
          "SubnetIds": [
            "subnet-0aa9d3bd709f86d50"
          ],
          "SecurityGroups": [
            "sg-045290f659c3be158"
          ]
        },
        "Iam": {
          "AdditionalIamPolicies": [
            {
              "Policy": "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
            }
          ]
        }
      }
    }
  ]
}
EddyMM commented 1 year ago

Thanks for sending the logs.

Based on the ChangeSet, the LocalStorage properties of the HeadNode were updated:

      "parameter": "HeadNode.LocalStorage",
      "requestedValue": {
        "RootVolume": {
          "Size": 600,
          "VolumeType": "gp3"
        }
      },
      "currentValue": "-"

At this moment, ParallelCluster does not support updates to the HeadNode of an existing cluster.

In this case, CloudFormation tried to replace the existing HeadNode instance but it had the NetworkInterface (eni-0966644c5bb6b0347) being used hence the error:

Interface: [eni-0966644c5bb6b0347] in use. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidNetworkInterface.InUse; Request ID: 557b2408-e32b-4d37-adfd-a0332a8c0ce9; Proxy: null)

Was the HeadNode's LocalStorage property defined in the earlier versions of the cluster configuration?

joehellmersNOAA commented 1 year ago

Yes, I had manually updated the storage on the HeadNode. I modified the configuration file to not include that specification, but it is still getting the same message. Attached are the chanageset and log file for the latest attempt. changeset2.log pcluster-cli2.log

demartinofra commented 1 year ago

I'm afraid this is happening because when CloudFormation performs an UPDATE rollback it does not remove newer template versions but only resets the default version. Because of this all new updates will be recognized as having a change in the head node LT. Can you try to remove all versions except for version 1 for the head node Launch Template (lt-0e1bdf08d1d3e53a4) either through console or AWS CLI.

joehellmersNOAA commented 1 year ago

That did the trick. Is this considered a bug?

chenwany commented 1 year ago

Hi @joehellmersNOAA , thanks for reporting the issue. Yes, the update of removing root volume should be blocked by our validation logic. We are tracking this issue and tracking the fix internally. I will keep the issue open to track the fix.

Thank you!