Open snemir2 opened 2 months ago
might be related to https://github.com/aws/aws-parallelcluster/issues/6329
a bit of update: i rolled back very same cluster config to PC3.9.3 and tried very same 'pcluster update' successfully. The issue is clearly PC 3.10.? specific
Hi snemir2,
Can you share your original and updated pcluster configuration yaml file? That could help us to reproduce the error.
might be related to https://github.com/aws/aws-parallelcluster/issues/6329
Seems not related.
Best regards, Xuanqi He
Hi snemir2,
We have been investigating the issue with the failed update of your AWS ParallelCluster. Our initial findings from the cfn-init.log
file suggest that a critical point of failure might be related to the portkey.service
. The log shows:
+ systemctl restart portkey
Warning: The unit file, source configuration file, or drop-ins of portkey.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Job for portkey.service failed because the control process exited with error code.
See "systemctl status portkey.service" and "journalctl -xeu portkey.service" for details.
+ echo 'skipping portkey restart'
skipping portkey restart
The failure to restart portkey.service
could have impacted subsequent operations, such as accessing the S3 bucket, as seen in the error message below:
Error executing action `run` on resource 'bash[configure_portkey]'
Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
---- Begin output of "bash" ----
STDOUT:
STDERR: + mkdir -p /etc/portkey
+ aws s3 ls s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/a2ai/branch/release-v4.0/portkey/
---- End output of "bash" ----
Ran "bash" returned 1
Failed to execute OnNodeUpdated script 1 s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v4.0/download_and_run_cookbook.sh, return code: 1.
CloudFormation signaled successfully with status FAILURE
This error might stem from:
The sequence of events suggests that the portkey.service
restart failure may have been the root cause, leading to issues with accessing the S3 bucket. We recommend running systemctl daemon-reload
to reload the unit files and then retrying the update process.
To assist us in further diagnosing the problem and providing a resolution, we kindly request the following:
These details will help us better understand the configuration and environment. We will continue our investigation and keep you informed of any progress.
Best regards, Xuanqi He
Hi @hehe7318 - Thank you for looking into the issue. I am almost sure that the error above you are referring to is a consequence and not the source of the problem (took the portkey out of solution but had same failure) . As you can see from the cloud formation/cloud trail, it (for some strange reason) tries to do "runInstances" api call, fails to re-provision ENI on the head node and fails. That causes networking problems on the instance that you are seeing. FYI -- also seeing this problem on 3.9.3
This is the failed api call from cloudtrail.
{
"eventVersion": "1.09",
"userIdentity": {
"type": "AssumedRole",
"principalId": "AROAZQUXECJHPLP3U6GDM:sergey@a2-ai.com",
"arn": "arn:aws:sts::654225707598:assumed-role/AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620/sergey@a2-ai.com",
"accountId": "654225707598",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "AROAZQUXECJHPLP3U6GDM",
"arn": "arn:aws:iam::654225707598:role/aws-reserved/sso.amazonaws.com/us-east-2/AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620",
"accountId": "654225707598",
"userName": "AWSReservedSSO_AWSAdministratorAccess_6cb29b3b61ef9620"
},
"attributes": {
"creationDate": "2024-08-05T11:03:25Z",
"mfaAuthenticated": "false"
}
},
"invokedBy": "cloudformation.amazonaws.com"
},
"eventTime": "2024-08-05T11:04:28Z",
"eventSource": "ec2.amazonaws.com",
"eventName": "RunInstances",
"awsRegion": "us-east-2",
"sourceIPAddress": "cloudformation.amazonaws.com",
"userAgent": "cloudformation.amazonaws.com",
"errorCode": "Client.InvalidNetworkInterface.InUse",
"errorMessage": "Interface: [eni-07c94d1a9cd037f2f] in use.",
"requestParameters": {
"instancesSet": {
"items": [
{
"minCount": 1,
"maxCount": 1
}
]
},
"blockDeviceMapping": {},
"monitoring": {
"enabled": false
},
"disableApiTermination": false,
"disableApiStop": false,
"clientToken": "14c7a9db-2c7d-217d-48d5-262ade2651c6",
"ebsOptimized": false,
"tagSpecificationSet": {
"items": [
{
"resourceType": "instance",
"tags": [
{
"key": "parallelcluster:version",
"value": "3.9.3"
},
{
"key": "aws:cloudformation:stack-name",
"value": "A2AiClustertesting"
},
{
"key": "aws:cloudformation:stack-id",
"value": "arn:aws:cloudformation:us-east-2:654225707598:stack/A2AiClustertesting/4a6dc1c0-50e3-11ef-9498-021a39b766eb"
},
{
"key": "A2AI:creator",
"value": "sergey"
},
{
"key": "parallelcluster:networking",
"value": "EFA=NONE"
},
{
"key": "parallelcluster:filesystem",
"value": "efs=1, multiebs=1, raid=0, fsx=0"
},
{
"key": "Name",
"value": "HeadNode"
},
{
"key": "map-migrated",
"value": "mig8KP4B19EMB"
},
{
"key": "A2AI:a2ai-cloud-version",
"value": "branch/release-v4.0"
},
{
"key": "parallelcluster:cluster-name",
"value": "A2AiClustertesting"
},
{
"key": "aws:cloudformation:logical-id",
"value": "HeadNode"
},
{
"key": "parallelcluster:node-type",
"value": "HeadNode"
},
{
"key": "parallelcluster:attributes",
"value": "ubuntu2204, slurm, 3.9.3, x86_64"
},
{
"key": "A2AI:a2ai-cloud-env",
"value": "dev"
}
]
}
]
},
"launchTemplate": {
"launchTemplateId": "lt-093388f1a2c21ab1a",
"version": "2"
}
},
"responseElements": null,
"requestID": "9ea4e084-bdb9-4d8c-a268-f388506ee1ea",
"eventID": "3a98e509-4f97-4b03-a55f-704a1303439d",
"readOnly": false,
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "654225707598",
"eventCategory": "Management"
}
Wow got this issue as well. simply adding a new instanceType for a slurm cluster queue triggers the creation of a new HeadNode in cloudformation, parallelcluster 3.10.0 ., and the error stems from cloudformation wanting to create a new Head Node while the ENI is still attached to the old head node . Now why this triggers a new HeadNode resource, I don't know, but with my own work in CustomResources, the Cloudformation custom resource is returning a new ID When an update is triggered which tells CF that the HeadNode needs to be replaced.
Apologies for the late reply.
Can you share your original and updated cluster configuration YAML file? That could help us to reproduce the error. I tried to add a new InstanceType
and successfully updated my cluster.
Hello! I'm seeing this issue as well when trying to update our customAMI. Have their been any updates here? I'm happy to share our before and after configuration if that would be helpful?
This is on cluster version 3.9.3
`HeadNode: CustomActions: OnNodeConfigured: Script: "s3://OBSCURED/install-pwb-config.sh" Iam: S3Access:
sg-09e341de4a0f2773e
LocalStorage:
RootVolume:
Size: 120
SharedStorageType: Efs
Ssh:
KeyName: OBSCURED
Image:
Os: ubuntu2004
CustomAmi: ami-043eba58b4b8131c6
Region: eu-west-1
Scheduling:
Scheduler: slurm
SlurmSettings:
EnableMemoryBasedScheduling: true
Database:
Uri: slurm-OBSCURED.eu-west-1.rds.amazonaws.com:3306
UserName: slurm_db_admin
PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
DatabaseName: slurm
SlurmQueues:
Name: interactive ComputeResources:
Name: all ComputeResources:
Name: gpu ComputeResources:
LoginNodes: Pools:
DevSettings: Timeouts: HeadNodeBootstrapTimeout: 7200 # timeout in seconds ComputeNodeBootstrapTimeout: 7200 # timeout in seconds
SharedStorage:
DirectoryService: DomainName: OBSCURED DomainAddr: ldap://OBSCURED PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED DomainReadOnlyUser: cn=Administrator,cn=Users,dc=pwb,dc=posit,dc=co GenerateSshKeysForUsers: true AdditionalSssdConfigs: override_homedir : /home/%u ldap_id_use_start_tls : false ldap_tls_reqcert : never ldap_auth_disable_tls_never_use_in_production : true
Tags:
After Update:
`HeadNode: CustomActions: OnNodeConfigured: Script: "s3://OBSCURED/install-pwb-config.sh" Iam: S3Access:
sg-09e341de4a0f2773e
LocalStorage:
RootVolume:
Size: 120
SharedStorageType: Efs
Ssh:
KeyName: OBSCURED
Image:
Os: ubuntu2004
CustomAmi: ami-092b5633346d89b54
Region: eu-west-1
Scheduling:
Scheduler: slurm
SlurmSettings:
EnableMemoryBasedScheduling: true
Database:
Uri: slurm-OBSCURED.eu-west-1.rds.amazonaws.com:3306
UserName: slurm_db_admin
PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
DatabaseName: slurm
SlurmQueues:
Name: interactive ComputeResources:
Name: all ComputeResources:
Name: gpu ComputeResources:
LoginNodes: Pools:
DevSettings: Timeouts: HeadNodeBootstrapTimeout: 7200 # timeout in seconds ComputeNodeBootstrapTimeout: 7200 # timeout in seconds
SharedStorage:
DirectoryService: DomainName: OBSCURED DomainAddr: ldap://OBSCURED PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED DomainReadOnlyUser: cn=Administrator,cn=Users,dc=pwb,dc=posit,dc=co GenerateSshKeysForUsers: true AdditionalSssdConfigs: override_homedir : /home/%u ldap_id_use_start_tls : false ldap_tls_reqcert : never ldap_auth_disable_tls_never_use_in_production : true
Tags:
Hello! I'm seeing this issue as well when trying to update our customAMI. Have their been any updates here? I'm happy to share our before and after configuration if that would be helpful?
This is on cluster version 3.9.3
HeadNode:
CustomActions:
OnNodeConfigured:
Script: "s3://OBSCURED/install-pwb-config.sh"
Iam:
S3Access:
- BucketName: OBSCURED
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
- Policy: arn:aws:iam::637485797898:policy/elbaccess-c5f9897
InstanceType: t3.xlarge
Networking:
SubnetId: subnet-0a937bc9f0c04ad8b
AdditionalSecurityGroups:
- sg-09e341de4a0f2773e
LocalStorage:
RootVolume:
Size: 120
SharedStorageType: Efs
Ssh:
KeyName: OBSCURED
Image:
Os: ubuntu2004
CustomAmi: ami-043eba58b4b8131c6
Region: eu-west-1
Scheduling:
Scheduler: slurm
SlurmSettings:
EnableMemoryBasedScheduling: true
Database:
Uri: slurm-OBSCURED.eu-west-1.rds.amazonaws.com:3306
UserName: slurm_db_admin
PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
DatabaseName: slurm
SlurmQueues:
- Name: interactive
ComputeResources:
- Name: rstudio
InstanceType: t3.xlarge
MaxCount: 20
MinCount: 1
Efa:
Enabled: FALSE
CustomSlurmSettings:
OverSubscribe: FORCE:2
CustomActions:
OnNodeConfigured:
Script: "s3://OBSCURED/config-compute.sh"
Iam:
S3Access:
- BucketName: OBSCURED
Networking:
PlacementGroup:
Enabled: FALSE
SubnetIds:
- subnet-0a937bc9f0c04ad8b
- Name: all
ComputeResources:
- Name: rstudio
InstanceType: t3.xlarge
MaxCount: 10
MinCount: 0
Efa:
Enabled: FALSE
CustomActions:
OnNodeConfigured:
Script: "s3://OBSCURED/config-compute.sh"
Iam:
S3Access:
- BucketName: OBSCURED
Networking:
PlacementGroup:
Enabled: FALSE
SubnetIds:
- subnet-0a937bc9f0c04ad8b
- Name: gpu
ComputeResources:
- Name: large
InstanceType: p3.2xlarge
MaxCount: 1
MinCount: 0
Efa:
Enabled: FALSE
CustomActions:
OnNodeConfigured:
Script: "s3://OBSCURED/config-compute.sh"
Iam:
S3Access:
- BucketName: OBSCURED
Networking:
PlacementGroup:
Enabled: FALSE
SubnetIds:
- subnet-0a937bc9f0c04ad8b
LoginNodes:
Pools:
- Name: login
Count: 2
InstanceType: t3.xlarge
Networking:
AdditionalSecurityGroups:
- sg-09ca531e5331195f1
SubnetIds:
- subnet-0a937bc9f0c04ad8b
Ssh:
KeyName: OBSCURED
DevSettings:
Timeouts:
HeadNodeBootstrapTimeout: 7200 # timeout in seconds
ComputeNodeBootstrapTimeout: 7200 # timeout in seconds
SharedStorage:
- MountDir: /home
Name: home
StorageType: FsxLustre
FsxLustreSettings:
StorageCapacity: 1200
DeploymentType: SCRATCH_2
- MountDir: /opt/rstudio
Name: rstudio
StorageType: Efs
- MountDir: /opt/apps
Name: appstack
StorageType: Efs
EfsSettings:
FileSystemId: fs-OBSCURED
DirectoryService:
DomainName: OBSCURED
DomainAddr: ldap://OBSCURED
PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
DomainReadOnlyUser: cn=Administrator,cn=Users,dc=pwb,dc=posit,dc=co
GenerateSshKeysForUsers: true
AdditionalSssdConfigs:
override_homedir : /home/%u
ldap_id_use_start_tls : false
ldap_tls_reqcert : never
ldap_auth_disable_tls_never_use_in_production : true
Tags:
- Key: rs:environment
Value: development
- Key: rs:owner
Value: OBSCURED
- Key: rs:project
Value: solutions
- Key: rs:subsystem
Value: ukhsa
After Update:
HeadNode:
CustomActions:
OnNodeConfigured:
Script: "s3://OBSCURED/install-pwb-config.sh"
Iam:
S3Access:
- BucketName: OBSCURED
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
- Policy: arn:aws:iam::637485797898:policy/elbaccess-c5f9897
InstanceType: t3.xlarge
Networking:
SubnetId: subnet-0a937bc9f0c04ad8b
AdditionalSecurityGroups:
- sg-09e341de4a0f2773e
LocalStorage:
RootVolume:
Size: 120
SharedStorageType: Efs
Ssh:
KeyName: OBSCURED
Image:
Os: ubuntu2004
CustomAmi: ami-092b5633346d89b54
Region: eu-west-1
Scheduling:
Scheduler: slurm
SlurmSettings:
EnableMemoryBasedScheduling: true
Database:
Uri: slurm-OBSCURED.eu-west-1.rds.amazonaws.com:3306
UserName: slurm_db_admin
PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
DatabaseName: slurm
SlurmQueues:
- Name: interactive
ComputeResources:
- Name: rstudio
InstanceType: t3.xlarge
MaxCount: 20
MinCount: 1
Efa:
Enabled: FALSE
CustomSlurmSettings:
CustomActions:
OnNodeConfigured:
Script: "s3://OBSCURED/config-compute.sh"
Iam:
S3Access:
- BucketName: OBSCURED
Networking:
PlacementGroup:
Enabled: FALSE
SubnetIds:
- subnet-0a937bc9f0c04ad8b
- Name: all
ComputeResources:
- Name: rstudio
InstanceType: t3.xlarge
MaxCount: 10
MinCount: 0
Efa:
Enabled: FALSE
CustomActions:
OnNodeConfigured:
Script: "s3://OBSCURED/config-compute.sh"
Iam:
S3Access:
- BucketName: OBSCURED
Networking:
PlacementGroup:
Enabled: FALSE
SubnetIds:
- subnet-0a937bc9f0c04ad8b
- Name: gpu
ComputeResources:
- Name: large
InstanceType: p3.2xlarge
MaxCount: 1
MinCount: 0
Efa:
Enabled: FALSE
CustomActions:
OnNodeConfigured:
Script: "s3://OBSCURED/config-compute.sh"
Iam:
S3Access:
- BucketName: OBSCURED
Networking:
PlacementGroup:
Enabled: FALSE
SubnetIds:
- subnet-0a937bc9f0c04ad8b
LoginNodes:
Pools:
- Name: login
Count: 2
InstanceType: t3.xlarge
Networking:
AdditionalSecurityGroups:
- sg-09ca531e5331195f1
SubnetIds:
- subnet-0a937bc9f0c04ad8b
Ssh:
KeyName: OBSCURED
DevSettings:
Timeouts:
HeadNodeBootstrapTimeout: 7200 # timeout in seconds
ComputeNodeBootstrapTimeout: 7200 # timeout in seconds
SharedStorage:
- MountDir: /home
Name: home
StorageType: FsxLustre
FsxLustreSettings:
StorageCapacity: 1200
DeploymentType: SCRATCH_2
- MountDir: /opt/rstudio
Name: rstudio
StorageType: Efs
- MountDir: /opt/apps
Name: appstack
StorageType: Efs
EfsSettings:
FileSystemId: fs-OBSCURED
DirectoryService:
DomainName: OBSCURED
DomainAddr: ldap://OBSCURED
PasswordSecretArn: arn:aws:secretsmanager:eu-west-1:637485797898:secret:OBSCURED
DomainReadOnlyUser: cn=Administrator,cn=Users,dc=pwb,dc=posit,dc=co
GenerateSshKeysForUsers: true
AdditionalSssdConfigs:
override_homedir : /home/%u
ldap_id_use_start_tls : false
ldap_tls_reqcert : never
ldap_auth_disable_tls_never_use_in_production : true
Tags:
- Key: rs:environment
Value: development
- Key: rs:owner
Value: OBSCURED
- Key: rs:project
Value: solutions
- Key: rs:subsystem
Value: ukhsa
If you have an active AWS support contract, please open a case with AWS Premium Support team using the below documentation to report the issue: https://docs.aws.amazon.com/awssupport/latest/user/case-management.html
Before submitting a new issue, please search through open GitHub Issues and check out the troubleshooting documentation.
Please make sure to add the following data in order to facilitate the root cause detection.
Required Info:
AWS ParallelCluster version [e.g. 3.1.1]: 3.10.0
Full cluster configuration without any credentials or personal data.
Cluster name: A2AiClustertesting
Output of
pcluster describe-cluster
command.[Optional] Arn of the cluster CloudFormation main stack:
Bug description and how to reproduce: A clear and concise description of what the bug is and the steps to reproduce the behavior.
Cluster repeatedly fails to update and from cloud-formation point of view goes to "rollback complete" . (custom routines do not appear even to get called)
If you are reporting issues about scaling or job failure: We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.
For issues with Slurm scheduler, please attach the following logs:
/var/log/parallelcluster/clustermgtd
,/var/log/parallelcluster/clusterstatusmgtd
(if version >= 3.2.0),/var/log/parallelcluster/slurm_resume.log
,/var/log/parallelcluster/slurm_suspend.log
,/var/log/parallelcluster/slurm_fleet_status_manager.log
(if version >= 3.2.0) and/var/log/slurmctld.log
./var/log/parallelcluster/computemgtd.log
and/var/log/slurmd.log
.If you are reporting issues about cluster creation failure or node failure:
If the cluster fails creation, please re-execute
create-cluster
action using--rollback-on-failure false
option.We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.
Please be sure to attach the following logs:
/var/log/cloud-init.log
,/var/log/cfn-init.log
and/var/log/chef-client.log
(attached)/var/log/cloud-init-output.log
. NA logs.tgz <-headnode logsAdditional context: Any other context about the problem. E.g.:
~/.parallelcluster/pcluster-cli.log