GoogleCloudPlatform / esp-v2

A service proxy that provides API management capabilities using Google Service Infrastructure.
https://cloud.google.com/endpoints/
Apache License 2.0
272 stars 171 forks source link

Websocket connections closes after sending first message to server #704

Open submarines-and opened 2 years ago

submarines-and commented 2 years ago

Also logged in GCP support portal as case id: 29925315

We are having a strange issue with web sockets and esp-v2. Basically, new websocket connections closes after sending first message to server with status wasClean false.

image

Response from websocket connection looks fine:

image image image

Our setup: Browser --> Cloud endpoints (with espv2) --> Cloud run --> Socket.io through NestJs on NodeJs

If we remove Cloud endpoints and go directly to cloud run, everything works fine. Browser --> Cloud run --> Nodejs

My top suspicions: 1) There is a timeout being hit for streams.E.g. max_stream_duration or similar. I am using defaults for all timeout values; maybe there is another value I can increase? https://cloud.google.com/endpoints/docs/openapi/specify-esp-v2-startup-options#timeout

This issue seems similar if this is the case: https://github.com/projectcontour/contour/issues/3692

2) Maybe we have misunderstood what is required in the openapi specification. Currently, we have a /get endpoint for the web sockets. If we remove this, no requests come through, but maybe there is a magic way to configure this.

image

Is this a bug in esp-v2 or is there maybe some flag we are missing to get it to work?

submarines-and commented 2 years ago

On closer inspection, it seems like the connection is closing on first message sent to server.

Reproduction (swap web socket-url to one using cloud endpoints vs not). Works when bypassing

var socketEndpoint = "wss://websocket-url.com" let socket = new WebSocket(socketEndpoint); socket.onopen = async (message) => console.log('CONNECTED', message); socket.onmessage = (message) => message.data === "2" ? socket.send(${3}) : console.log(message); //ping socket.onclose = (message) => console.log('CLOSE', message) socket.onerror = (ex) => console.log('ERROR', ex);

WAIT AND THEN SEND - CLOSES INSTANTLY socket.send(40${JSON.stringify({ message: "test", })});

qiwzhang commented 2 years ago

How did you deploy ESPv2? I assume it is in Cloud Run, Is following your setup?

websocket client ->   ESPv2 (deployed in Cloud Run)  ->  Webscoket server (deployed in Cloud Run)

We have an integration test to test websocket. Of course it is all in localhost as

websocket-client   ->  ESPv2   -> websocket echo server 

websocket server just echos whatever message it receives. client send 5 messages, and verified it received 5 echoed messages.

I just checked our code we dont set max_stream_duration

So I don't know, could you turn on debug by adding --enable_debug flag in ESPv2 and get its log for us. It may provide more clues.

submarines-and commented 2 years ago

I did some more testing (using a different socket library) and can see that the connection in fact closes as soon as the SERVER sends a message. Maybe the gateway is considering the connection finished when it sends data?

1) That setup is correct. We are using the deploy script found in the official GCP tutorial

2) I added the debug flag, but I am not really sure what to look for. Is there I way I can share the logs privately (because they contain quite a bit of internal information)

qiwzhang commented 2 years ago

Since you deployed ESPv2 in Cloud Run, you can get the logs from Cloud Console, => Cloud Run => logs.

you can email it directly to me at qiwzhang@google.com

submarines-and commented 2 years ago

Logs emailed (and uploaded to GCP support case)

submarines-and commented 2 years ago

I also tried using server side events as a substitute, but they also do not work (close after 15 seconds), while working fine if espv2 is bypassed.

In envoy docs is says route timeout needs to be disabled: https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts#route-timeouts

But the only options I have in espv2 isdeadline in openapi-spec and http_request_timeout_s in deploy options. I have set both to 300 and it still closes after 15s.

Is this unsupported, or is there another flag to set? I know this is unrelated to this issue, but if SSE works, I can use that instead and this issue does not need to be investigated.

nareddyt commented 2 years ago

15s seems suspiciously similar to our OpenAPI default deadline: https://cloud.google.com/endpoints/docs/openapi/openapi-extensions#deadline

I have set both to 300 and it still closes after 15s.

After updating the OpenAPI spec and using gcloud services deploy, did you make sure to rebuild and redeploy the ESPv2 container onto Cloud Run?

nareddyt commented 2 years ago

Here is the logic where we update Envoy route's timeout and idle_timeout based on the deadline configuration

https://github.com/GoogleCloudPlatform/esp-v2/blob/a8d79f086947cce47695d8e49fc25d6f8c5c3bb1/src/go/configinfo/service_info.go#L586-L628

We have a specific case for method.IsStreaming to disable timeout completely.

My suspicion is how we determine if method.IsStreaming is set, maybe we are setting it to false for your method. Is there any way to specify in OpenAPI spec if your method is using server side events? If not, then ESPv2 will not know which paths in your OpenAPI spec are unary vs server side streaming. It will not disable the regular timeout field.

For now, let's try to debug why regular websockets are not working based on the logs you sent @qiwzhang .

qiwzhang commented 2 years ago

Here is the timeline from the ESP logs I attached here https://b.corp.google.com/issues/235315519#comment5.

05:21:28.019: request come with Upgrade

05:21:28.033: server replied with websocket accept

05:21:28.092 upstream connection reset

upstream connection reset is less than 1 second after accepted.

So I don't feel it is our default deadline issue.

On Mon, Jun 13, 2022 at 7:12 AM Teju Nareddy @.***> wrote:

Here is the logic where we update Envoy route's timeout and idle_timeout based on the deadline configuration

https://github.com/GoogleCloudPlatform/esp-v2/blob/a8d79f086947cce47695d8e49fc25d6f8c5c3bb1/src/go/configinfo/service_info.go#L586-L628

We have a specific case for method.IsStreaming to disable timeout completely.

My suspicion is how we determine if method.IsStreaming is set, maybe we are setting it to false for your method. Is there any way to specify in OpenAPI spec if your method is using server side events? If not, then ESPv2 will not know which paths in your OpenAPI spec are unary vs server side streaming. It will not disable the regular timeout field.

For now, let's try to debug why regular websockets are not working based on the logs you sent @qiwzhang https://github.com/qiwzhang .

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/esp-v2/issues/704#issuecomment-1153972274, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ6TBZ2PSVHKEV5KR67Q7LVO46WFANCNFSM5YBTOUBQ . You are receiving this because you were mentioned.Message ID: @.***>

maroux commented 1 year ago

Running into the same issue. Was this ever resolved @submarines-and @qiwzhang

submarines-and commented 1 year ago

We switched architecture

1) Switched web sockets for "Server side events". 2) Replaced cloud endpoints with a custom gateway running in cloud run