ghost commented 3 years ago

Thanks for this module

What

Followed readme to deploy a cluster, but it doesn't work.

I continually get the following message:

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secrets from ssm: service call has been retried 5 time(s): RequestCanceled: request context canceled caused by: context deadli...

But am unable to track down why - as the logs are also not available.

I made sure to provide the secretstring in SSM Parameter store as outlined and ensured it matches the .tpl file. So I am a bit stumped.

Any help would be very welcome.

Thanks in advance for any help you can provide.

cbishop-elsevier commented 3 years ago

@slimdevl

A bit more information on what part of the process you received this error (terraform init, terraform plan, terraform apply, post apply ECS Service failure, etc) would help to better provide you with troubleshooting suggestions.

Have you tried looking in CloudTrail Events for the API Request that failed?

terraform apply operation failures will generally supply the RequestID in the error stack.

You can then search through CloudTrail Events for the precise API call that failed.

See: https://docs.aws.amazon.com/awscloudtrail/latest/userguide/view-cloudtrail-events-console.html

Hope this helps!

cbishop-elsevier commented 3 years ago

@slimdevl

One other suggestion - I would first check the AWS IAM Role Terraform is Assuming (sts:AssumeRole) based on the AWS Profile you specify to ensure it has all the required privileges for ECS Service Deployments, as well as any required Trust Relationships and ability to create and iam:PassRole required ServiceLinkedRoles or ECS Task Execution Roles...

Here are a couple of docs which may provide insight:

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/codedeploy_IAM_role.html

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using-service-linked-roles.html

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html

Also, if you received this error during terraform apply operation, you may be able to review the output values noted in the following Terraform doc regarding the aws_caller_identity - outputs aren't rendered during terraform plan ops, and aren't always available after a failed apply operation, it depends on where exactly in the execution the failure occurs:

https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/caller_identity

https://www.terraform.io/docs/language/values/outputs.html

Regards,

Chris Bishop

cbishop-elsevier commented 3 years ago

@slimdevl

Something else I just realized while encountering a similar issue myself - the type of issue you reported are quite often associated with either timeout or healthcheck failures for various services like DynamoDB, Route53, Service Discovery, SSM etc etc...

Take a look at the following resources and add Retry X Times / Timeout: XX where you can (if not exists):

https://github.com/aws-samples/serverless-jenkins-on-aws-fargate/blob/main/example/bootstrap/main.tf#L39-L56

https://github.com/aws-samples/serverless-jenkins-on-aws-fargate/blob/main/modules/jenkins_platform/ecs.tf#L119-L144

https://github.com/aws-samples/serverless-jenkins-on-aws-fargate/blob/main/modules/jenkins_platform/jenkins_image.tf#L35-L47

https://github.com/aws-samples/serverless-jenkins-on-aws-fargate/blob/main/modules/jenkins_platform/templates/jenkins-controller.json.tpl#L34-L39

You might also consider adding DependsOn conditions to any of the module resources which have a dependency on any other resource which is associated with any given service known to be prone to API timeouts, etc - however I would avoid using depends_on waiters unless you have no other viable option, preferring to let Terraform do what it does best - interpolate and coordinate.

See: https://www.terraform.io/docs/language/meta-arguments/depends_on.html

Anyway - I hope any of my suggestions here may help you resolve your issue!

I am also starting evaluation of Jenkins on ECS Fargate to determine if such an implementation will be viable in our use case, so I had hoped sharing my experience might save you some time and effort! 🙂

Cheers!

Chris Bishop

aMfM9E2 commented 3 years ago

@slimdevl

Seem this is not a well aws solution as before everything from aws blog should be one button build. I hit the same issue before, seem there is some network behavior change on ECS platform version 1.4, details can check in stackoverflow. In my case work after I create an VPC with 2 private subnet and pointed to a NAT gateway also 2 public subnets point to IGW.

apogorielov commented 3 years ago

Hi @aMfM9E2,

My apologies for the inconvenience. Unfortunately in order to accommodate use cases with per-existing network infrastructure (i.e. when a VPC is connected to a corporate network and provisioned before hands) we had to move VPC creation as pre-requisite. Based on your comment, I have updated README to be a bit more explicit and indicate that you need to have a NAT gateway in the private subnets.

Hi @slimdevl,

As was correctly mentioned by @aMfM9E2 such behavior can happen when private subnets do not have internet access, and cannot access AWS API for pulling information i.e. ECR repositories and secrets. Please make sure that the private subnets have NAT gateways installed and let us know if the issue persists.

ghost commented 3 years ago

Thanks @apogorielov and @aMfM9E2, I will give your suggestions a try.

apogorielov commented 3 years ago

Closing this issue since the module was successfully tested as part of last PR

aws-samples / serverless-jenkins-on-aws-fargate

example unable to access SSM jenkins-pwd #7

What