Open joehellmersNOAA opened 1 year ago
Hello @joehellmersNOAA,
Based on the Slurm Configuration you've provided, the Scheduling
Section seems to have 2 SlurmQueues
sections. The ParallelCluster Configuration format requires the Scheduling section to have a single SlurmQueues
section under which a list of queues can be provided. (Scheduling-v3)
Kindly try updating the cluster after removing the second SlurmQueues
section and moving oarm6id24xlarge
queue to the first one.
This should have been caught by some validation. Could you confirm that you did not suppress validation when running the update? If not, then some validation checks may be needed.
Thanks for the catch.
No I did not suppress validations.
I'll try it again shortly and let you know.
BTW, why does the new queue need to be the first one in the list?
I made the following Scheduling section, and got the same error:
Scheduling:
Scheduler: slurm
SlurmSettings:
Dns:
DisableManagedDns: true
UseEc2Hostnames: true
SlurmQueues:
- Name: oarm6id24xlarge
ComputeSettings:
LocalStorage:
RootVolume:
Size: 500
ComputeResources:
- Name: m6id24xlarge
InstanceType: m6id.24xlarge
MinCount: 0
MaxCount: 2
Networking:
SubnetIds:
- subnet-0aa9d3bd709f86d50
SecurityGroups:
- sg-045290f659c3be158
Iam:
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
- Name: oar
ComputeResources:
- Name: m5xlarge
InstanceType: m5.xlarge
MinCount: 2
MaxCount: 10
Networking:
SubnetIds:
- subnet-0aa9d3bd709f86d50
SecurityGroups:
- sg-045290f659c3be158
Iam:
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Hello @joehellmersNOAA,
I'm not yet able to reproduce the issue with the above configuration.
The Network Interface referenced in the error seems to be attached to the HeadNode. Normally the HeadNode attributes (such as the network interface) should not be updated when making updates to the compute nodes/queue
Could you kindly share the ChangeSet printed in the terminal when you attempt running the cluster update? Also, can you confirm that no manual changes were made to the HeadNode Launch Template (AWSConsole>EC2>Launch templates>HeadNodeLaunchTemplate_*)? Thanks.
BTW, why does the new queue need to be the first one in the list?
The new queue does not need to the first one in the list.
I believe you added it as the first one following my comment "Kindly try updating the cluster after removing the second SlurmQueues section and moving oarm6id24xlarge queue to the first one.
"? In this regard I was referring to the first SlurmQueues section
.
It would also be helpful to share the CLI logs located at ~/.parallelcluster/pcluster-cli.log*
when the pcluster update
failure occurs.
This could help us reproduce the issue or identify the point of failure. Thanks.
pcluster-cli.log Yes, I can confirm that the LaunchTemplate was not manually configured.
The ChangeSet is
{
"cluster": {
"clusterName": "oar-pcluster",
"cloudformationStackStatus": "UPDATE_IN_PROGRESS",
"cloudformationStackArn": "arn:aws:cloudformation:us-east-1:716453263077:stack/oar-pcluster/d7077910-f8ab-11ec-aa98-0e38e0433449",
"region": "us-east-1",
"version": "3.1.4",
"clusterStatus": "UPDATE_IN_PROGRESS"
},
"validationMessages": [
{
"level": "WARNING",
"type": "DomainAddrValidator",
"message": "The use of the ldaps protocol is strongly encouraged for security reasons."
}
],
"changeSet": [
{
"parameter": "HeadNode.LocalStorage",
"requestedValue": {
"RootVolume": {
"Size": 600,
"VolumeType": "gp3"
}
},
"currentValue": "-"
},
{
"parameter": "Scheduling.SlurmQueues",
"requestedValue": {
"Name": "oarm6id24xlarge",
"ComputeSettings": {
"LocalStorage": {
"RootVolume": {
"Size": 500
}
}
},
"ComputeResources": [
{
"Name": "m6id24xlarge",
"InstanceType": "m6id.24xlarge",
"MinCount": 0,
"MaxCount": 2
}
],
"Networking": {
"SubnetIds": [
"subnet-0aa9d3bd709f86d50"
],
"SecurityGroups": [
"sg-045290f659c3be158"
]
},
"Iam": {
"AdditionalIamPolicies": [
{
"Policy": "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
]
}
}
}
]
}
Thanks for sending the logs.
Based on the ChangeSet, the LocalStorage properties of the HeadNode were updated:
"parameter": "HeadNode.LocalStorage",
"requestedValue": {
"RootVolume": {
"Size": 600,
"VolumeType": "gp3"
}
},
"currentValue": "-"
At this moment, ParallelCluster does not support updates to the HeadNode of an existing cluster.
In this case, CloudFormation tried to replace the existing HeadNode instance but it had the NetworkInterface (eni-0966644c5bb6b0347
) being used hence the error:
Interface: [eni-0966644c5bb6b0347] in use. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidNetworkInterface.InUse; Request ID: 557b2408-e32b-4d37-adfd-a0332a8c0ce9; Proxy: null)
Was the HeadNode's LocalStorage property defined in the earlier versions of the cluster configuration?
Yes, I had manually updated the storage on the HeadNode. I modified the configuration file to not include that specification, but it is still getting the same message. Attached are the chanageset and log file for the latest attempt. changeset2.log pcluster-cli2.log
I'm afraid this is happening because when CloudFormation performs an UPDATE rollback it does not remove newer template versions but only resets the default version. Because of this all new updates will be recognized as having a change in the head node LT. Can you try to remove all versions except for version 1 for the head node Launch Template (lt-0e1bdf08d1d3e53a4
) either through console or AWS CLI.
That did the trick. Is this considered a bug?
Hi @joehellmersNOAA , thanks for reporting the issue. Yes, the update of removing root volume should be blocked by our validation logic. We are tracking this issue and tracking the fix internally. I will keep the issue open to track the fix.
Thank you!
Required Info:
Image: Os: alinux2
HeadNode: InstanceType: m5.xlarge Networking: SubnetId: subnet-0aa9d3bd709f86d50 SecurityGroups:
Scheduling: Scheduler: slurm SlurmSettings: Dns: DisableManagedDns: true UseEc2Hostnames: true SlurmQueues:
SharedStorage:
AdditionalPackages:
IntelSoftware:
IntelHpcPlatform: true
DirectoryService: DomainName: dc=ncisdev,dc=noaa DomainAddr: ldap://10.101.14.78,ldap://10.101.9.221 PasswordSecretArn: arn:aws:secretsmanager:us-east-1:716453263077:secret:MicrosoftAD.Admin.Password-gDGZv6 DomainReadOnlyUser: cn=adjoin,ou=service,ou=NCISDEV,dc=ncisdev,dc=noaa AdditionalSssdConfigs: ldap_auth_disable_tls_never_use_in_production: True
Tags:
{ "creationTime": "2022-06-30T19:35:47.455Z", "headNode": { "launchTime": "2022-06-30T19:38:44.000Z", "instanceId": "i-00c4711259d394ae3", "instanceType": "m5.xlarge", "state": "running", "privateIpAddress": "10.102.8.85" }, "version": "3.1.4", "clusterConfiguration": { "url": "https://parallelcluster-e2ca1557272da85b-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.1.4/clusters/oar-pcluster-wtisscfpi6pepclv/configs/cluster-config.yaml?versionId=ZkTmb6RC_GHCxLLymscOGWLSeh0womE4&AWSAccessKeyId=AKIA2NT7RG3SULLIW55M&Signature=EPFIQnc8YB%2BsP3CunBz3WBPvLfY%3D&Expires=1661212485" }, "tags": [ { "value": "13051420fneea0147", "key": "noaa:taskerorderid" }, { "value": "3.1.4", "key": "parallelcluster:version" }, { "value": "noaa5006", "key": "noaa:fismaid" }, { "value": "dev", "key": "noaa:environment" }, { "value": "OAR WRF-Chem", "key": "noaa:application" }, { "value": "nesdis", "key": "noaa:lineoffice" }, { "value": "40-00", "key": "noaa:programoffice" } ], "cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE", "clusterName": "oar-pcluster", "computeFleetStatus": "RUNNING", "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:716453263077:stack/oar-pcluster/d7077910-f8ab-11ec-aa98-0e38e0433449", "lastUpdatedTime": "2022-08-22T22:34:59.022Z", "region": "us-east-1", "clusterStatus": "UPDATE_FAILED" }