aws / aws-appsync-community

The AWS AppSync community
https://aws.amazon.com/appsync
Apache License 2.0
506 stars 32 forks source link

AppSync detached resolvers #146

Open alextriaca opened 3 years ago

alextriaca commented 3 years ago

It is possible for resolvers to become detached when an incorrect schema is deployed and correcting the schema does not reattach the resolvers. In addition this issue can ripple to affect other resolvers on other fields.

Reproduction Steps

The most localised reproduction of this can be done with a single field schema and resolver. Given the following schema and API: Schema:

type Query {
    ping: String
}

API (In CDK):

from aws_cdk import core, aws_appsync

class GraphqlTestStack(core.Stack):
    def __init__(self, scope: core.Construct, **kwargs) -> None:
        super().__init__(scope, "GraphqlTestStack", **kwargs)

        api = aws_appsync.GraphqlApi(
            self,
            "test_api-",
            name="test_api",
            schema=aws_appsync.Schema.from_asset("resources/schema.graphql"),
        )

        api.add_none_data_source("ping").create_resolver(
            type_name="Query",
            field_name="ping",
            request_mapping_template=aws_appsync.MappingTemplate.from_string(
                '{"version": "2018-05-29"}'
            ),
            response_mapping_template=aws_appsync.MappingTemplate.from_string(
                '$util.toJson("pong")'
            ),
        )

If you then deploy this stack it will deploy successfully with a query of ping returning pong. If you then deploy the following schema you will end up with a resolver that points to nothing:

type Query {
}

If you then realise your mistake and correct the schema and add the ping field back and redeploy, the schema will correct itself and the resolver will be present but the resolver will not be attached and a query to ping will return null.

Reproduction Steps (a more sinister presentation)

We can extend the previous example to add three fields to our Query like so:

Schema:

type Query {
    ping1: String
    ping2: String
    ping3: String
}

API (In CDK):

from aws_cdk import core, aws_appsync

class GraphqlTestStack(core.Stack):
    def __init__(self, scope: core.Construct, **kwargs) -> None:
        super().__init__(scope, "GraphqlTestStack", **kwargs)

        api = aws_appsync.GraphqlApi(
            self,
            "test_api-",
            name="test_api",
            schema=aws_appsync.Schema.from_asset("resources/schema.graphql"),
        )

        api.add_none_data_source("ping1").create_resolver(
            type_name="Query",
            field_name="ping1",
            request_mapping_template=aws_appsync.MappingTemplate.from_string('{"version": "2018-05-29"}'),
            response_mapping_template=aws_appsync.MappingTemplate.from_string('$util.toJson("pong1")'),
        )
        api.add_none_data_source("ping2").create_resolver(
            type_name="Query",
            field_name="ping2",
            request_mapping_template=aws_appsync.MappingTemplate.from_string('{"version": "2018-05-29"}'),
            response_mapping_template=aws_appsync.MappingTemplate.from_string('$util.toJson("pong2")'),
        )
        api.add_none_data_source("ping3").create_resolver(
            type_name="Query",
            field_name="ping3",
            request_mapping_template=aws_appsync.MappingTemplate.from_string('{"version": "2018-05-29"}'),
            response_mapping_template=aws_appsync.MappingTemplate.from_string('$util.toJson("pong3")'),
        )

If you then deploy a broken version of the schema:

type Query {
    ping1: String
    ping2: String this_is_a_typo
    ping3: String
}

This will successfully deploy however the resulting schema will be empty:

type Query {

}

If you then correct this mistake and deploy the original schema, the schema will be corrected and all three resolvers will be deployed, however all three will be detached.


This is :bug: Bug Report

alextriaca commented 3 years ago

I'm not sure I follow how this can be expected behaviour. There are two issues I see here: 1) You cannot deploy a resolver for a non existent field, you will simply get a Cloudformation error. However you can delete the schema from underneath the resolver without the same error being raised (example one above). A resolver requires you specify the specific Type and Field it connects to so mapping between the two is always possible. 2) It is possible to deploy an "incorrect" schema which passes the deployment validation but does not deploy correctly (example two above). This can then result in multiple resolvers detaching for no reason (and as you stated, will not reattach once they are corrected).

ansman commented 3 years ago

I agree with @alextriaca. This cannot be expected behavior as we have never manually attached the resolvers in the first place.

The flow is something like this:

  1. An initial schema is deployed
  2. CDK creates the resolvers AND attached them for us
  3. A new, invalid, schema is deployed
  4. CDK detaches the resolvers
  5. A new, valid, schema is deployed
  6. CDK does NOT attach the resolvers (which I would argue is expected)
alextriaca commented 3 years ago

I've found a better example of how there is some subtle bug here rather than this being intended functionality. The resolvers below each have the timestamp embedded in their ID. Each time this stack is deployed it will deploy new resolvers, attach them to the API and then delete the old ones.

Schema:

type Query {
    ping1: String
    ping2: String
    ping3: String
}

API (In CDK):

from aws_cdk import core, aws_appsync
from datetime import datetime

class GraphqlTestStack(core.Stack):
    def __init__(self, scope: core.Construct, **kwargs) -> None:
        super().__init__(scope, "GraphqlTestStack", **kwargs)

        api = aws_appsync.GraphqlApi(
            self,
            "test_api-",
            name="test_api",
            schema=aws_appsync.Schema.from_asset("resources/schema.graphql"),
        )

        api.add_none_data_source(f"ping1_{int(datetime.now().timestamp())}").create_resolver(
            type_name="Query",
            field_name="ping1",
            request_mapping_template=aws_appsync.MappingTemplate.from_string('{"version": "2018-05-29"}'),
            response_mapping_template=aws_appsync.MappingTemplate.from_string('$util.toJson("pong1")'),
        )
        api.add_none_data_source("ping2_{int(datetime.now().timestamp())}").create_resolver(
            type_name="Query",
            field_name="ping2",
            request_mapping_template=aws_appsync.MappingTemplate.from_string('{"version": "2018-05-29"}'),
            response_mapping_template=aws_appsync.MappingTemplate.from_string('$util.toJson("pong2")'),
        )
        api.add_none_data_source("ping1_{int(datetime.now().timestamp())}").create_resolver(
            type_name="Query",
            field_name="ping3",
            request_mapping_template=aws_appsync.MappingTemplate.from_string('{"version": "2018-05-29"}'),
            response_mapping_template=aws_appsync.MappingTemplate.from_string('$util.toJson("pong3")'),
        )

Where this breaks is if you then deploy a broken schema (as below) you will now get a deploy error (resolver 2 will fail to attach) and this will trigger a rollback. After rollback, none of the resolvers will be attached. This is definitely unintended.

type Query {
    ping1: String
    ping2: String this_is_a_typo
    ping3: String
}
ansman commented 3 years ago

I would love to get some attention on this. All of our resolvers were just detached during a deployment and deploying again did not reattach them. I had to comment out the resolver part, deploy, uncomment out the resolvers and deploy again to make them be reattached again.

shaneyuan18 commented 3 years ago

We identify a bug on schema validation process, and are working on a fix so that it is not possible to deploy an incorrect schema.

ansman commented 3 years ago

I don’t think that will fix the root issue. I got this without deploying an invalid schema. I changed the name of some lambdas and data sources which resulted in all affected resolvers being detached

alextriaca commented 3 years ago

I can also confirm I've seen the same behaviour as @ansman. The issue I had with reporting it was that I couldn't reliably reproduce it. Its exactly as @ansman says, when renaming a resolver/function/schema/field/something you can end up with a detatched resolver. I've really struggled to produce this reliably though. I do however think it's more closely related to how the resolvers fail to reattach rather than an invalid schema can cause this to occur. @shaneyuan18, super excited to hear about that schema validation change though, will definitely increase the speed of failures being detected.

shaneyuan18 commented 3 years ago

If I understand correctly, I think there are 2 different root causes to resolvers being detached.

Issue 1: as @alextriaca mentioned earlier in the thread with the invalid schema update. This should be fixed once we have the schema validation work ready.

Issue 2: @ansman mentioned that resolvers detached due to data source update? I think this is a different issue and root cause is different from the one above.

If the above statement is correct, we can open a new Github issue on Issue 2, and discuss error reproducing steps and related items there. Thanks.

alextriaca commented 3 years ago

@shaneyuan18 sounds like a plan. There is definitely something more subtle here on issue 2 and a separate ticket to discuss it makes sense 👍

akuma12 commented 3 years ago

I think we've experienced a similar issue to Issue 2 in @shaneyuan18 's comment above. Due to an update to CDK's AppSync libraries, our resolver logical resource IDs were changed. On deployment, this caused the creation of new resolvers, and the cleanup deletion of the old ones. I believe the new resolvers were created fine and attached to the fields, but when CloudFormation deleted the old resolvers, it also deleted the field->resolver mapping, leaving the field without an attached resolver. We haven't reproduced it yet, but if we do I'll post the results here, or in a new ticket if it's warranted.

twrichards commented 3 years ago

Somewhat inspired by @Dave-McCraw 's comment on a related issue I have also added a workaround in our CDK code which computes an md5 checksum of the Schema and includes this as a comment in the resolver templates, as our CD will run cloudformation step if the clouformation has changed (as such a timestamp would run this every time and slow our deploys when in reality we don't change the schema that often).

I can't share the PR at the moment but here's the implementation if its useful...

    const gqlSchema = appsync.Schema.fromAsset(
      join(__dirname, "../shared/graphql/schema.graphql")
    );
danielblignaut commented 3 years ago

I know this is not a productive comment but the broader issue of resolvers becoming dettached (irrespective of the cause) really makes Appsync unbearable on large schema deployments. Please can some time be invested at a cloudformation or appsync team level to resolve this. One would expect that on a valid cloudformation template deployment, where you have a list of resolvers matching graphql type and fields, that those resolvers are attached and on aninvalid / failed deployment, the state of the appsync app should return to where it was before.

I know there is complexity to solving this but anything short of that is an unreliable product for a part of a tech-stack that needs to be the most robust and failure-proof - your API. The fact that one can have successful api layer deployments, with resolvers specified in the cloudformation template only to perform an API query and find out that it's not attached to the appsync API is a big problem.

This is a topic that's been brought up on:

@ahmadizm or any other team members, can we bring more attention to this? I'm happy to invest my time working to help resolve this (whether it's our own use case examples or suggestions on how to resolve it)

bigkraig commented 3 years ago

Unfortunately there are a lot of sharp edges with Cloudformation & Appsync and they don't appear to be likely resolved anytime soon.

We have renamed our Query/Mutation types to Query#/Mutation# and will increment # whenever we are likely to run into one of these problems. It causes a minor outage on the fields but its a better workaround than having an unknown amount of detached resolvers.

Some other related issues:

https://github.com/aws-amplify/amplify-cli/issues/682

https://github.com/aws/aws-cdk/issues/13269

jpignata commented 2 years ago

Heya. AppSync team member here. I just wanted to check in on this issue. While the subject of this issue is still very much a sharp edge (i.e., the resolver just goes away, and your stack has no idea), we have worked to reduce some of the friction around managing resolvers. I wanted to check in with folks to understand what pain you're experiencing around this topic to focus our efforts. I'd appreciate if you could post and re-articulate the friction or wishlist items here or shoot me an email at pignataj at amazon.com. Thank you!

elishevarom commented 2 years ago

I just had the same issue. I accidentally deployed an invalid schema and my resolver got detached. Redeploying with a valid schema did not reattach the resolver. It happened on the test environment so it doesn't really matter right now, but would have been a problem in the prod environment. (It also took me a while to pinpoint the issue because there was no error, just a null response.) I commented out my resolver, deployed, and commented it back. I agree that we shouldn't have to do above steps and the resolver should be attached automatically.

bigkraig commented 2 years ago

@jpignata We have moved away from AppSync due to this and other performance related issues.

jpignata commented 2 years ago

Gotcha, @bigkraig. Thanks for letting me know. If you have any interest in chatting further, I'd love to hear more.

@elishevarom Makes sense, and I agree. Looks like some folks have workarounds where they version resolvers and the schema, and increment each time. We can do better than this.

mille-printemps commented 2 years ago

I came across this issue. Is there any update?

@twrichards , can I ask you if your workaround, i.e. inserting a hash value into response mapping templates as a comment, is still working? I thought that this would be the most feasible and simplest way to work around this issue at this point if it was still working.

twrichards commented 2 years ago

I came across this issue. Is there any update?

@twrichards , can I ask you if your workaround, i.e. inserting a hash value into response mapping templates as a comment, is still working? I thought that this would be the most feasible and simplest way to work around this issue at this point if it was still working.

@mille-printemps Yes definitely still working. I'm not aware of whether the bug is fixed though.

mille-printemps commented 2 years ago

@twrichards , Thank you for the prompt reply. I will use this way to work around this issue.

I hope that this issue is fixed soon...

jeremycare commented 1 year ago

@jpignata Any update on this? We are facing the issue right now..

jpignata commented 1 year ago

@jeremycare can you possibly shoot me an email? I’d like to look at your issue specifically with our team to see what’s up. pignataj at amazon dot com.

trevorbennett commented 1 year ago

My team just encountered this issue, or some variant of it. Had three people pulling hair out to solve it, ultimate solution was to nuke everything underneath the graphql API itself (resolvers, data sources, and function configurations).

We commented out all those resources, deployed via sam, uncommented them, redeployed, and it worked like a charm.

jeremycare commented 1 year ago

We also ran into that yesterday and had to do the exact same thing. @jpignata it might be a regression on your side

alharris-at commented 1 year ago

One workaround I've seen work here is to explicitly model dependencies between all resolvers in CDK/CFN and the Appsync Schema object (not API), as well as all datasources and the Schema object.

At least that mitigation worked when I was encountering issues related to updating my schema and some resolvers leading to the untouched resolvers detached, but based on the comments here about needing to explicitly rename all resolvers, I suspect something similar may be happening in your use-cases.