ardatan / graphql-mesh

🕸️ GraphQL Federation Framework for any API services such as REST, OpenAPI, Swagger, SOAP, gRPC and more...
https://the-guild.dev/graphql/mesh
MIT License
3.28k stars 342 forks source link

gRPC handler timeout #5295

Open matthewdfleming opened 1 year ago

matthewdfleming commented 1 year ago

Currently when I'm using the generated SDK I set a timeout of 4000, so 4 seconds. What is happening though is that the connection seems to go into la la land for a full 10 minutes. The response from the sdk call doesn't come back until then.

sources:
  - name: GRPC
    handler:
      grpc:
        endpoint: "{env.GRPC_API_URL}"
        requestTimeout: 4000
        useHTTPS: true
        source: sources/authentication.proto

I know this isn't enough to troubleshoot and I'm willing to step through it myself but when I tried this I had no idea which library was responsible for actually making the gRPC call (not when generating the graphql schema but when using it).

Anyone point me in the right direction?

matthewdfleming commented 1 year ago

@ardatan or maybe @Urigo can you maybe point me in the right direction to figure out/debug where the connection requestTimeout ends up and which library is actually invoking the gRPC call?

matthewdfleming commented 1 year ago

So what I'm doing is that I have a Nestjs server using the graphql-mesh generated clients to call a gRPC endpoint..

Nestjs --> graphql-mesh client -1-> gRPC server.

What I'm seeing is a 1 CANCELLED: Call response after some time. The calls work for a while but then suddenly seize up. The CANCELLED response only comes back like 15 minutes later. And then after that, no more calls work. Any ideas?

ardatan commented 1 year ago

Maybe you can share a reproduction on CodeSandbox or StackBlitz.

matthewdfleming commented 1 year ago

@ardatan yeah I'm really trying to do that but the error seems to happen after some period of time (probably based upon some condition that is unknown to me now). I'm trying to lock down how to even reproduce the problem consistently in the first place. Am I right that the @grpc-js dependency is the library actually making the gRPC calls to the remote server?

ardatan commented 1 year ago

We use the official gRPC client for gRPC requests; https://github.com/Urigo/graphql-mesh/blob/master/packages/handlers/grpc/package.json#L42

matthewdfleming commented 1 year ago

@ardatan Here's what I've found... So when we call getMeshSDK(), mesh essentially creates the schema and executor to make the calls. For the gRPC source, this creates an gRPC Client which then creates a Http2SubchannelConnector which has a http2.ClientHttp2Session. That client/connector/session is used for all of the calls to the meshSdk because the sdk is memoized and not re-created on every call (which makes sense).

So what happens (not exactly sure when) is that the underlying http2 session gets cancelled at some point. Once that happens that connection/session is never going to work again (it's canceled) and a new connection/session would need to be established for the client in order to make new calls.

Annoyingly this condition isn't a fail-fast type of deal either, so the execute-start call doesn't have a corresponding execute-end until after there's some sort of timeout that happens (like 15 minutes!). Once the execute-end comes back we see 1 CANCELLED: Call cancelled as the error.

So it looks like something is amiss in either how this package is using the grpc-js client (e.g. shouldn't memoize, not responding to a channel state change like CANCELLED appropriately, etc) or there's something internally off in the client itself. On the latter, I made sure I'm at the latest version of the grpc-js client (which as of now is 1.8.13).

I'd love to be able to reproduce this consistently but don't have a ton of time to dig into this right now. Thoughts on this?

matthewdfleming commented 1 year ago

So in the interim, I've changed the NestJS service class (which uses the api) to not be a singleton. In NestJS parlance that means changing the service to be REQUEST scoped rather than DEFAULT. I'm going to test that out now but my guess is that it'll work fine but maybe a little less efficiently.

matthewdfleming commented 1 year ago

So even creating a new wrapper class on each request didn't help as the Http2SubchannelConnector (and session) is only still instantiated once.

ardatan commented 1 year ago

I see. Still I need a reproduction in order to help you :)

matthewdfleming commented 1 year ago

@ardatan Can you check out the log I posted here to see what you can make of it? In particular it was unexpected to me that there were two client creations calls during the GrpcHandler.visit part of things.. My hunch is that somehow the channels are smashing against each other and I was expecting only a single Client to be created.

Edit (explanation of multiple Clients)

So the graphql-mesh implementation creates a ServiceClient for each Service definition inside the proto files. And each ServiceClient has its own http2 connection underneath.

That makes sense to me.. maybe it's a problem with the 'retry' logic in the grpc-js lib or something when the connections are not already initialized. I can try to 'connect' prior to using the other service calls.. via the connectivityState Queries that get injected into the graphql document via the processor.

matthewdfleming commented 1 year ago

@ardatan So I was able to get to the bottom (theoretically) about the issue. It seems like we have an unstable server environment that's going up and down. My currently hypothesis is that mesh does not seem to be setting the 'deadline' in the grpc-js client despite me specifying a 'requestTimeout' in the grpc handler file. I see this when I enable grpc-js trace logs (set environment variables GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG):

| D 2023-04-10T18:29:54.115Z | channel | (2) dns:<<snip>> createResolvingCall [3948] method="GetClientToken", deadline=Infinity
| D 2023-04-10T18:29:54.115Z | resolving_call | [3948] Created
| D 2023-04-10T18:29:54.115Z | resolving_call | [3948] Deadline: Infinity
| D 2023-04-10T18:29:54.115Z | resolving_call | [3948] start called
| D 2023-04-10T18:29:54.115Z | resolving_call | [3948] startRead called

So I think this is ultimately the issue I'm seeing.

matthewdfleming commented 1 year ago

I can send up a PR but I figured it out.. essentially gRPC's client constructor takes an additional argument which specifies a CallOptions argument. A quick hack to make that work from the programmatic calling is to patch utils.ts to include that extra argument as an additional callFnArgument.

When you call the sdk you do so by passing a second object with a deadline present (a future Date).. e.g.

      const { whatever } = await api.theMethodNeedingATimeout_query({
        // method input
      },
      {
        deadline: Date.now() + 4000       // a 4000ms timeout on this one call
      }
    );
matthewdfleming commented 1 year ago

@ardatan So enabling deadlines definitely helped with individual request stability. So each request has a defined timeout and won't stay open for minutes (mine were staying open for 15 minutes). So even with that in place, the http2 session and connections do not clean themselves up when the deadline is hit. For connection cleanup to happen (more frequently), I've enabled grpc keepalives. The following patch has both things in there:

diff --git a/cjs/index.js b/cjs/index.js
index 4c68c4bfc17ccf509488f03d69a23b7ffc1f35b4..3304bcee3923968d9e3110380883e621f3211ff5 100644
--- a/cjs/index.js
+++ b/cjs/index.js
@@ -326,7 +326,11 @@ ${rootJsonEntries
             if (typeof ServiceClient !== 'function') {
                 throw new Error(`Object at path ${objPath} is not a Service constructor`);
             }
-            const client = new ServiceClient((_a = string_interpolation_1.stringInterpolator.parse(this.config.endpoint, { env: cross_helpers_1.process.env })) !== null && _a !== void 0 ? _a : this.config.endpoint, creds);
+            const client = new ServiceClient((_a = string_interpolation_1.stringInterpolator.parse(this.config.endpoint, { env: cross_helpers_1.process.env })) !== null && _a !== void 0 ? _a : this.config.endpoint, creds,  {
+                "grpc.keepalive_time_ms": 10000,
+                "grpc.keepalive_timeout_ms": 1500,
+                "grpc.keepalive_permit_without_calls": 1
+            });
             const subId = this.pubsub.subscribe('destroy', () => {
                 client.close();
                 this.pubsub.unsubscribe(subId);
diff --git a/cjs/utils.js b/cjs/utils.js
index ab5fac2e258980859de0454860603dfb77005576..0d87ed45eb0f74baa64286c1cf26ee00a46c7707 100644
--- a/cjs/utils.js
+++ b/cjs/utils.js
@@ -69,6 +69,9 @@ function addMetaDataToCall(callFn, input, resolverData, metaData, isResponseStre
         }
         callFnArguments.push(meta);
     }
+    if (resolverData && resolverData.context) {
+        callFnArguments.push({ ...resolverData.context });
+    }
     return new Promise((resolve, reject) => {
         const call = callFn(...callFnArguments, (error, response) => {
             if (error) {

These are both simple patches and are not the ideal implementation for both a GraphiQL/web client vs using the typescript generated client. The KeepAlive and Deadline settings would ideally be configurable via the meshrc.yaml file and work in both calling contexts.