aws / chalice

Python Serverless Microframework for AWS
Apache License 2.0
10.65k stars 1.01k forks source link

Race condition in CF template for custom domain mapping #1735

Open chkothe opened 3 years ago

chkothe commented 3 years ago

The way the api_gateway_custom_domain feature is represented in the CFN template generated by chalice package is currently prone to a race condition, but it’s relatively easy to fix.

Issue Currently, the ApiCustomDomainMapping resource (I guess that’s of type AWS::ApiGateway::BasePathMapping) doesn’t wait until the RestAPI’s Stage resource has been created -- however, if that happens, its creation fails (as in the screenshot below).

reproduction

That stage resource isn’t explicit in the template generated by Chalice because it’s only injected by the Serverless transform, and it’s then named RestAPI<stagename>Stage (in my screenshot it’s called RestAPIv1Stage because my stage is called v1); that's as per SAM docs.

Fortunately the RestAPI (the AWS::Serverless::Api) has an attribute that conveniently serves to help with exactly that issue, namely Stage (i.e. RestAPI.Stage). That expression gets replaced by the SAM transform by the actual logical name of the stage resource that it injects. That is, the !Ref RestAPI.Stage becomes !Ref RestAPIv1Stage if the stage were called v1 (also mentioned on the same SAM doc page where it says Referenceable property). And finally the !Ref RestAPIv1Stage resolves to the name of the stage (here v1), which is documented here under Return Values. So the technique comes down to using !Ref RestAPI.Stage instead of the literal stage name (v1 for me), which creates an explicit dependency of the base path mapping on the stage resource, and thus delays its creation until the stage has been created. That technique is also described in this git issue comment. With that fix, the stack creation succeeds as in the below screenshot.

CF-dependencies

At least one person commented that it the old (broken) behavior be non-deterministic, although in our stack the old version seems to fail pretty much every time (the new version always succeeds).

I made a PR that implements that, it’s basically 1 line and another line in the unit test. (for the record, one could also use a DependsOn and I had tried that too and it works too, but it’s addressed more concisely by simply referencing the .Stage attribute).

Edit: Concise rewrite of the issue description so it’s easier to follow.

chkothe commented 3 years ago

Btw I cleaned up the issue description above, should now be much clearer

MasterNayru commented 3 years ago

@chkothe Did you test this change on a stack that had been created with the current behaviour and then try applying an update after your change had been applied? I have hit this exact same issue when dealing with raw CloudFormation templates (not SAM) but I would have thought that creating an explicit stage with the same name would cause stack updates to fail if they go from the old behaviour to this behaviour (with the stage in it)

chkothe commented 3 years ago

Hey, yes, we did (going from chalice 1.23 to this PR).

So if you apply this fix to an existing stack, what it does is that it replaces a literal string (in a base path mapping resource) by a reference to a stage resource, which in turn resolves to the same string as before.

And the stage resource that's referenced has already been an implicit part of the SAM template all along, so no physical resource gets added, removed, or updated by this. Actually, when you apply the change, the deployed base path mapping wouldn’t even be updated either, since nothing in the physical resource actually changed (same string as before) -- the only thing that changes is the dependency order that CFN is required to adhere to.

Therefore, if I was worried about anything, it’d be that CFN gives its trademark 'error' that “nothing actually changed” when doing a deploy with the fix applied. Now in the context of chalice, the version update would incur some other material changes in the stack (namely the lambda code package or the dependencies layer image), so that particular hiccup wouldn't happen.

MasterNayru commented 3 years ago

@chkothe Right, so there is a Stage resource being created but it's just not being made an explicit dependency. Looks like a pretty straightforward improvement to me, then.

The issue I was referring to happens in the case where you create a deployment with a string name but don't have a Stage resource in the stack, but then later try to use a Stage resource with the same name as whatever was set before. It seems like the SAM resource is doing the right thing in that regard, though.

jonathanabila commented 3 years ago

@chkothe thank you for the solution! <3

Any updates on the revision of this ticket? This was a serious problem for me and after using the code in PR everything worked as expected.

It would be awesome if we could include this fix in the new release - it's small and probably quick to review.

Update:

After applying this solution, I realized that there are two stages in my Gateway API instead of just one: image

This wasn't supposed to happen, right @chkothe?

chkothe commented 3 years ago

@jonathanabila sure np. That issue is most likely unrelated (the PR doesn't add or rename any resources), e.g., could be a config change while iterating on the deployment.

jonathanabila commented 3 years ago

@jonathanabila sure np. That issue is most likely unrelated (the PR doesn't add or rename any resources), e.g., could be a config change while iterating on the deployment.

Ok - I'll try again in another environment - thanks again!

jonathanabila commented 3 years ago

@jonathanabila sure np. That issue is most likely unrelated (the PR doesn't add or rename any resources), e.g., could be a config change while iterating on the deployment.

Hi @chkothe I did another deployment to create a new environment and got the same error: image

Do you have any information I can give you to help with the debugging process?

chkothe commented 3 years ago

Hi Johnathan, if you want to be 120% sure that it's not caused by this PR, you might also try a different way that comes down to the same end result (delaying the creation of the custom domain mapping until after the stage has been created): https://github.com/chkothe/chalice/commit/b464c7f9b3826494ae7ac63ce7463bced20f546e

... that's the commit for an alternative PR I had made with the same goal (namely fixing the resource creation order), which works by adding a DependsOn clause to the CF template instead of a cross-reference. Again, none of them should affect naming in any way, only the order in which resources are created, but surely there's a lot going on at deploy time and the latter definitely makes it more clear what's changing (although unfortunately it's not the prettiest depends-on due to the way SAM names its resources). So if you were to use that, and yet you still get mystery stages appearing every now and then in your CF stack then it'd prove pretty conclusively that it's caused by some other glitch (whatever that may be).

Nath-P commented 2 years ago

Is there a reason chkothe 's commit can't be merged? I hit this issue today. (deploying using cdk) I can work around by deploying the stack first without the custom mapping and then adding it later but its a pain

AlexFromXD commented 1 year ago

any update of this ?

Ankushpandey-ti commented 1 year ago

any update on this? I am facing same issue can't deploy chalice using cdk, getting this error

Resource handler returned message: "Invalid stage identifier specified (Service: ApiGateway, Status Code: 400, Request ID: b3ecd546-1030-4f67-bd99-b32bd282c82c)" (RequestToken: dda5727b-6185-e7db-25c2-a9210135ff7a, HandlerErrorCode: InvalidRequest)
isaacsancheza commented 11 months ago

A work around when using CDK.

from typing import Any, cast

import aws_cdk as cdk
from aws_cdk import aws_apigateway as apigatewayv1
from aws_cdk import aws_apigatewayv2 as apigateway
from constructs import Construct
from chalice.cdk import Chalice

class ChaliceNestedStack(cdk.NestedStack):
    def __init__(
            self, 
            scope: Construct, 
            construct_id: str, 
            /, 
            *, 
            source_dir: str = 'chalice',
            stage_config: dict[str, Any],
        ) -> None:
        super().__init__(scope, construct_id)

        self.app = Chalice(
            self, 
            'App', 
            source_dir=source_dir,
            stage_config=stage_config,
        )

        rest_api = cast(apigateway.CfnApi, self.app.get_resource('RestAPI'))
        custom_domain = cast(apigateway.CfnDomainName, self.app.get_resource('ApiGatewayCustomDomain'))
        base_path_mapping = cast(apigatewayv1.CfnBasePathMapping, self.app.get_resource('ApiGatewayCustomDomainMapping'))

        custom_domain.add_dependency(rest_api)        
        base_path_mapping.add_dependency(custom_domain)
empeekdev commented 9 months ago

Almost 3 years and no fix?