DevelopingSpace / starchart

A self-serve tool for managing custom domains and certificates
MIT License
21 stars 13 forks source link

Fail earlier with DNS issues #368

Closed humphd closed 1 year ago

humphd commented 1 year ago

I tried creating a DNS record on staging, and it's failing due to what appears to be a config/permissions error:

starchart_mycustomdomain-dev.1.spbo7h7vjgwx@cudm-mgmt01dv.dcm.senecacollege.ca    | {"$fault":"client","$metadata":{"attempts":1,"httpStatusCode":403,"requestId":"a250980e-b074-41da-85e4-08cc5a7d9e66","totalRetryDelay":0},"Code":"AccessDenied","Type":"Sender","level":"warn","message":"User: arn:aws:iam::REDACTED is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::REDACTED because no identity-based policy allows the route53:ChangeResourceRecordSets action","name":"AccessDenied","timestamp":"2023-03-17T15:28:22.328Z"}
starchart_mycustomdomain-dev.1.spbo7h7vjgwx@cudm-mgmt01dv.dcm.senecacollege.ca    | {"$fault":"client","$metadata":{"attempts":1,"httpStatusCode":403,"requestId":"cc2ff934-ea00-4035-aa38-55c86e1b43bb","totalRetryDelay":0},"Code":"AccessDenied","Type":"Sender","level":"warn","message":"User: arn:aws:iam::REDACTED is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::REDACTED because no identity-based policy allows the route53:ChangeResourceRecordSets action","name":"AccessDenied","timestamp":"2023-03-17T15:28:37.589Z"}
starchart_mycustomdomain-dev.1.spbo7h7vjgwx@cudm-mgmt01dv.dcm.senecacollege.ca    | {"$fault":"client","$metadata":{"attempts":1,"httpStatusCode":403,"requestId":"99b035ba-1c17-4518-89db-54b69a2c63a3","totalRetryDelay":0},"Code":"AccessDenied","Type":"Sender","level":"warn","message":"User: arn:aws:iam::REDACTED is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::REDACTED because no identity-based policy allows the route53:ChangeResourceRecordSets action","name":"AccessDenied","timestamp":"2023-03-17T15:29:07.734Z"}
starchart_mycustomdomain-dev.1.spbo7h7vjgwx@cudm-mgmt01dv.dcm.senecacollege.ca    | {"$fault":"client","$metadata":{"attempts":1,"httpStatusCode":403,"requestId":"aa4a0070-fa9f-4b43-8d99-8c73d7e87fe6","totalRetryDelay":0},"Code":"AccessDenied","Type":"Sender","level":"warn","message":"User: arn:aws:iam::REDACTED is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::REDACTED because no identity-based policy allows the route53:ChangeResourceRecordSets action","name":"AccessDenied","timestamp":"2023-03-17T15:30:08.063Z"}
starchart_mycustomdomain-dev.1.spbo7h7vjgwx@cudm-mgmt01dv.dcm.senecacollege.ca    | {"$fault":"client","$metadata":{"attempts":1,"httpStatusCode":403,"requestId":"d1b53823-f8b8-4765-9fe0-b05c7c2c83b0","totalRetryDelay":0},"Code":"AccessDenied","Type":"Sender","level":"warn","message":"User: arn:aws:iam::REDACTED is not authorized to perform: route53:ChangeResourceRecordSets on resource: arn:aws:route53:::REDACTED because no identity-based policy allows the route53:ChangeResourceRecordSets action","name":"AccessDenied","timestamp":"2023-03-17T15:32:08.322Z"}

However, it never gets to the error state on the server or in UI:

Screenshot 2023-03-17 at 11 49 37 AM

I'm discussing the config/permissions with ITS, but I wanted to talk about the failure case in our DNS flow. These 5 errors happen pretty quickly, and then nothing happens.

We should fail sooner I think. In this case, no amount of waiting is going to fix the problem.

In the case of being Throttled by AWS, we'll get this:

Five requests per second per AWS account per Region. If you submit more than five requests per second in a Region, Resolver returns an HTTP 400 error (Bad request). The response header also includes a Code element with a value of Throttling and a Message element with a value of Rate exceeded.

If we get throttled, retrying later is good. If we are failing some call to AWS, retrying forever isn't likely to fix things.

cc @Genne23v, @dadolhay

Genne23v commented 1 year ago

Can we throw UnrecoverableError based on error type from AWS? (Not sure which error we should throw this error though) Then we can stop the workflow without reducing waiting time.