aws-samples / 1click-hpc

Deploy your HPC Cluster on AWS in 20min. with just 1-Click.
MIT No Attribution
62 stars 44 forks source link

LBInit issues #21

Open rvencu opened 2 years ago

rvencu commented 2 years ago

I bumped a while ago into LBInit issues, meaning when I delete a stack usually LBInit fails to delete. The workaround is to wait some more minutes then retry the stack delete and it works.

But today I started having problems with its creation. In the cloudwatch log I find this:

{
    "Status": "FAILED",
    "Reason": "See the details in CloudWatch Log Stream: 2022/06/22/[$LATEST]d883c58469a545d58f5fb61863e901ef",
    "PhysicalResourceId": "2022/06/22/[$LATEST]d883c58469a545d58f5fb61863e901ef",
    "StackId": "arn:aws:cloudformation:us-east-1:842865360552:stack/origtest/0cdfe300-f1fa-11ec-b068-121de38a7e19",
    "RequestId": "10fc583d-c908-41c1-af07-751ba3a4b563",
    "LogicalResourceId": "LBInit",
    "NoEcho": false,
    "Data": {
        "ClientErrorCode": "NoSuchEntity",
        "ClientErrorMessage": "The Server Certificate with name origtest-981587795.us-east-1.elb.amazonaws.com cannot be found."
    }
}

I have another HPC cluster active, with a different name, it should not interfere with the creation of another cluster in the account. The above error still appears with everything set on AUTO

rvencu commented 2 years ago

I started to debug the issue and found that all previous certificates were not deleted at rollback / delete stack. And I guess I hit a kind of limit because saving the certificate did not work anymore

Cleaned up old certificates the LBInit creation succeeded.

Of course, the error on LBInit deletion still needs to be addressed.

nicolaven commented 2 years ago

noted! thanks