envoyproxy / gateway

Manages Envoy Proxy as a Standalone or Kubernetes-based Application Gateway
https://gateway.envoyproxy.io
Apache License 2.0
1.5k stars 320 forks source link

Intermittent 'OAuth Flow Failure' potentially caused by reuse of old connections in pool #3178

Open coro opened 4 months ago

coro commented 4 months ago

Description:

We are frequently seeing routes hosted by the Gateway fail with a 401, returning an error OAuth flow failed.. Looking at the logs it seems this occurs where Envoy decides to re-use an existing long-lived connection to our OIDC issuer. Failures seem to correlate with Envoy picking connections that are in the order of 50 minutes old.

Repro steps: Using the following SecurityPolicy:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: SecurityPolicy
metadata:
  name: grafana-public
  namespace: default
spec:
  oidc:
    clientID: <ID>
    clientSecret:
      group: ""
      kind: Secret
      name: oauth-secret
    logoutPath: /logout
    provider:
      issuer: https://oauth.id.jumpcloud.com
    redirectURL: https://grafana.my.website/oauth2/callback
    scopes:
    - openid
    - email
  targetRef:
    group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: grafana-public

Intermittently requests to this route that involve an OAuth flow via our IdP result in this error message. Refreshing the page solves this issue.

Environment: Gateway: v1.0.0 installed via Helm Envoy: envoyproxy/envoy:distroless-v1.29.2

Logs: Filtering debug logs based off the ConnectionId shows it's trying to use a connection from about 50 mins earlier, and immediately getting a remote close:

2024-04-11T11:15:24.284912467Z stderr F [2024-04-11 11:15:24.284][13][debug][connection] [./source/common/network/connection_impl.h:98] [Tags: "ConnectionId":"631"] current connecting state: true
2024-04-11T11:15:24.284917407Z stderr F [2024-04-11 11:15:24.284][13][debug][client] [source/common/http/codec_client.cc:57] [Tags: "ConnectionId":"631"] connecting
2024-04-11T11:15:24.284924487Z stderr F [2024-04-11 11:15:24.284][13][debug][connection] [source/common/network/connection_impl.cc:1009] [Tags: "ConnectionId":"631"] connecting to 104.18.6.220:443
2024-04-11T11:15:24.284981748Z stderr F [2024-04-11 11:15:24.284][13][debug][connection] [source/common/network/connection_impl.cc:1028] [Tags: "ConnectionId":"631"] connection in progress
2024-04-11T11:15:24.2864077Z stderr F [2024-04-11 11:15:24.286][13][debug][connection] [source/common/network/connection_impl.cc:746] [Tags: "ConnectionId":"631"] connected
2024-04-11T11:15:24.291450813Z stderr F [2024-04-11 11:15:24.291][13][debug][client] [source/common/http/codec_client.cc:88] [Tags: "ConnectionId":"631"] connected
2024-04-11T11:15:24.291462863Z stderr F [2024-04-11 11:15:24.291][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:328] [Tags: "ConnectionId":"631"] attaching to next stream
2024-04-11T11:15:24.291465283Z stderr F [2024-04-11 11:15:24.291][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:182] [Tags: "ConnectionId":"631"] creating stream
2024-04-11T11:15:24.291498664Z stderr F [2024-04-11 11:15:24.291][13][debug][client] [source/common/http/codec_client.cc:141] [Tags: "ConnectionId":"631"] encode complete
2024-04-11T11:15:24.727530651Z stderr F [2024-04-11 11:15:24.727][13][debug][client] [source/common/http/codec_client.cc:128] [Tags: "ConnectionId":"631"] response complete
2024-04-11T11:15:24.727776477Z stderr F [2024-04-11 11:15:24.727][13][debug][pool] [source/common/http/http1/conn_pool.cc:53] [Tags: "ConnectionId":"631"] response complete
2024-04-11T11:15:24.727778997Z stderr F [2024-04-11 11:15:24.727][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:215] [Tags: "ConnectionId":"631"] destroying stream: 0 remaining
2024-04-11T12:05:31.600070064Z stderr F [2024-04-11 12:05:31.599][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:265] [Tags: "ConnectionId":"631"] using existing fully connected connection
2024-04-11T12:05:31.600072024Z stderr F [2024-04-11 12:05:31.599][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:182] [Tags: "ConnectionId":"631"] creating stream
2024-04-11T12:05:31.600075984Z stderr F [2024-04-11 12:05:31.599][13][debug][client] [source/common/http/codec_client.cc:141] [Tags: "ConnectionId":"631"] encode complete
2024-04-11T12:05:31.600379448Z stderr F [2024-04-11 12:05:31.600][13][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:241] [Tags: "ConnectionId":"631"] remote address:104.18.6.220:443,TLS_error:|33554536:system library:OPENSSL_internal:Connection reset by peer:TLS_error_end
2024-04-11T12:05:31.600384448Z stderr F [2024-04-11 12:05:31.600][13][debug][connection] [source/common/network/connection_impl.cc:714] [Tags: "ConnectionId":"631"] remote close
2024-04-11T12:05:31.600386439Z stderr F [2024-04-11 12:05:31.600][13][debug][connection] [source/common/network/connection_impl.cc:278] [Tags: "ConnectionId":"631"] closing socket: 0
2024-04-11T12:05:31.600413999Z stderr F [2024-04-11 12:05:31.600][13][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:329] [Tags: "ConnectionId":"631"] SSL shutdown: rc=-1
2024-04-11T12:05:31.600416589Z stderr F [2024-04-11 12:05:31.600][13][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:241] [Tags: "ConnectionId":"631"] remote address:104.18.6.220:443,TLS_error:|33554536:system library:OPENSSL_internal:Connection reset by peer:TLS_error_end|33554464:system library:OPENSSL_internal:Broken pipe:TLS_error_end
2024-04-11T12:05:31.600437819Z stderr F [2024-04-11 12:05:31.600][13][debug][client] [source/common/http/codec_client.cc:107] [Tags: "ConnectionId":"631"] disconnect. resetting 1 pending requests
2024-04-11T12:05:31.600440799Z stderr F [2024-04-11 12:05:31.600][13][debug][client] [source/common/http/codec_client.cc:158] [Tags: "ConnectionId":"631"] request reset
2024-04-11T12:05:31.600880165Z stderr F [2024-04-11 12:05:31.600][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:495] [Tags: "ConnectionId":"631"] client disconnected, failure reason: TLS_error:|33554536:system library:OPENSSL_internal:Connection reset by peer:TLS_error_end|33554464:system library:OPENSSL_internal:Broken pipe:TLS_error_end
2024-04-11T12:05:31.600927146Z stderr F [2024-04-11 12:05:31.600][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:215] [Tags: "ConnectionId":"631"] destroying stream: 0 remaining
arkodg commented 4 months ago

cc @zhaohuabing

zhaohuabing commented 4 months ago

Looks like the IDP closed the connection after 50 minutes and sometimes Envoy didn't realize this when it initiated the remote call to the IDP (likely a race condition between the server sending RST and envoy reusing a connection from its pool).

Unfortunately, Envoy OAuth2 filter does not support retries when failing to connect to IDP. As a workaroud, increasing the TCP socket timeout at the IDP side would help reduce the chances of failures.

Blocked by upstream issue: https://github.com/envoyproxy/envoy/issues/33572

ikewabo commented 4 months ago

Having the same issue with Okta IdP Okta OIDC error: OAuth flow failed GET http://xxxxxx.ddns.net/oauth2/callback?code=Tz0xxxxxxxxNKvHnLEg5MqOXyCRvByfwMCxxxxxxxxxxk&state=http%3A%2F%2Fxxxxxxx.ddns.net%2F 401 (Unauthorized)

`apiVersion: gateway.envoyproxy.io/v1alpha1
kind: SecurityPolicy
metadata:
  name: oidc-example
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: backend
  oidc:
    provider:
      issuer: "https://xxxxxx.okta.com/oauth2/default"
    redirectURL: http://xxxxxxx.ddns.net/oauth2/callback  
    clientID: "0xxxxxxxXoexxxxx7"
    clientSecret:
      name: "eg-secret"
    logoutPath: /logout`
Any luck on this?
zhaohuabing commented 4 months ago

@ikewabo @coro how often does this happen? what's the percentage of failed requests?

coro commented 4 months ago

@ikewabo @coro how often does this happen? what's the percentage of failed requests?

Looking at the average for last week, roughly 1/3 of OAuth requests are hitting this error for us.

ikewabo commented 4 months ago

@ikewabo @coro how often does this happen? what's the percentage of failed requests?

Happens a 100% of the time. Thought it was my Deployment - however - changed to basic Nginx and issue persists.

zhaohuabing commented 4 months ago

Both don't sound like connection race issues since the percentage is too high.

ikewabo commented 3 months ago

Both don't sound like connection race issues since the percentage is too high.

You may be right. I didn't attribute this to anything - just seeking help. Note: my envoy gateway route successfully without SecurityPolicy. My error happens only when SecurityPolicy is introduced. I read a clip that says HTTPS is mandatory for SecurityPolicy to take effect. How true is this requirement?

Have you ever had success in a non-https setup?

coro commented 3 months ago

@zhaohuabing Just circling back on this - we've reached out to Jumpcloud to see if there's anything on their end which would confirm the suspicions we mentioned above about timeouts. Interestingly, I've done some further packet capture and it seems like Jumpcloud may be rejecting all OAuth messages on long-lived connections - I'm seeing some connections being sent TCP RSTs after just 7 minutes, not just connections older than 30 minutes. The longest connection I can see from the packet capture that isn't sent a RST is only 100 seconds old.

Another interesting observation is that Jumpcloud logs two successful OAuth authentications whenever this failure occurs, as opposed to just logging one where the OAuth flow actually succeeds. These two authentications are 2-3 seconds apart.

Is there anything we can enable / configure in the OAuth filter to help diagnose this further?

Additionally, it would be great to get more options to configure the OAuth filter, e.g. enabling refresh token support. At the moment the access token generated expires every hour, forcing a refresh in the browser to continue using the site we are exposing.

zhaohuabing commented 3 months ago

Both don't sound like connection race issues since the percentage is too high.

You may be right. I didn't attribute this to anything - just seeking help. Note: my envoy gateway route successfully without SecurityPolicy. My error happens only when SecurityPolicy is introduced. I read a clip that says HTTPS is mandatory for SecurityPolicy to take effect. How true is this requirement?

Have you ever had success in a non-https setup?

The OIDC e2e tests success with HTTP setup. However, most browsers don't allow secure cookies for HTTP connections, and many OIDC providers don't allow HTTP callback URL for production.

zhaohuabing commented 3 months ago

Is there anything we can enable / configure in the OAuth filter to help diagnose this further?

The Envoy debug log may help, which can be turned on with a customized EnvoyProxy CR. https://gateway.envoyproxy.io/latest/tasks/operations/customize-envoyproxy/

Additionally, it would be great to get more options to configure the OAuth filter, e.g. enabling refresh token support. At the moment the access token generated expires every hour, forcing a refresh in the browser to continue using the site we are exposing.

That would be a great addition to the current OIDC functions. @coro would you please raise an issue to track it?

coro commented 2 months ago

@zhaohuabing Just to let you know where I've got up to on this - JumpCloud have confirmed that the keep-alive timeout on connections to the OAuth token endpoint on their end is 60s. Looking at the packet capture they don't send any connection close messages after this 60s is elapsed, only a TCP reset once a message on a connection older than this comes through. This explains why Envoy attempts to reuse this connection - from Envoy's point of view the connection is indeed alive.

My expectation is that if either Envoy did a retry on token fetches, or JumpCloud sent a connection close once the keep-alive timeout expired, then we wouldn't hit this issue. Does that match your understanding?

zhaohuabing commented 2 months ago

@coro Thanks for digging into this. I believe that Envoy OAuth2 filter should retry on token fetches.

This is tracked in this issue: https://github.com/envoyproxy/envoy/issues/33572. If no one picks this up, I'll try to address it when I get some cycles.

arkodg commented 2 months ago

will enabling tcp keepAlives in the cluster config work ? https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/http_uri.proto#config-core-v3-httpuri worth trying it out using a EnvoyPatchPolicy

zhaohuabing commented 2 months ago

will enabling tcp keepAlives in the cluster config work ? https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/http_uri.proto#config-core-v3-httpuri worth trying it out using a EnvoyPatchPolicy

Brilliant! It should solve this problem.

arkodg commented 2 months ago

probably also worth looking at the cluster metrics to see what is happening

coro commented 2 months ago

id prefer to understand what is happening in the envoy side, and why its reusing the connection even after receiving an RST, so prefer if we collected cluster metrics, logs in debug/trace level before we proposed a fix

@arkodg Just seen this on the linked PR. If you're referring to this issue, then I'd like to clarify: Envoy isn't reusing any connections that have received a TCP RST. The issue is that the TCP RST is only sent by the OIDC server once Envoy attempts to reuse an old connection, which could be any amount of time after the OIDC server's connection TTL. I believe if the OIDC server sent a connection close as soon as it considered the connection past the TTL then Envoy wouldn't use it for OIDC and we wouldn't see this issue. But instead the server only sends RSTs at the point of OAuth flow.

I've attached debug logs at the top of this issue which cover a time where we hit this.

In terms of cluster metrics, all our Gateways have prometheus metrics collected. What particular metrics would be of interest to you? Separately, I will try the EnvoyPatchPolicy and let you know the results. Thanks for your help so far!

zhaohuabing commented 2 months ago

@coro I think the metric @arkodg mentioned in the PR is envoy_cluster_upstream_cx_destroy_local_with_active_rq and upstream_cx_destroy_remote_with_active_rq.

https://karlstoney.com/istio-503s-ucs-and-tcp-fun-times/

There is still a question for me: why did the server not send the FIN/RST when closing the connection?

Even we add the tcpKeepalive in that PR, the keepalive timeout at the OIDC server(60s) is shorter than the default one(75s), so a default tcpKeepalive won't solve your problem.

Maybe you could ask JumpCloud if it's possible to make it longer?

https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/address.proto#envoy-v3-api-msg-config-core-v3-tcpkeepalive

keepalive_interval (UInt32Value) The number of seconds between keep-alive probes. Default is to use the OS level configuration (unless overridden, Linux defaults to 75s.)

coro commented 2 months ago

@zhaohuabing @arkodg Thanks for that link, that's a fascinating read!

I've attached a screenshot of the metrics we observed during an instance of this error here, which will hopefully clarify what's happening from Envoy's perspective: https://github.com/envoyproxy/gateway/pull/3568#issuecomment-2176482113

I am not sure why their server is not sending an early FIN/RST, rather than only when a connection older than their configured TTL is used. I have confirmed with their engineering team that this is the behaviour of their server, and I don't believe we'll be able to change it.

That said, I implemented the EnvoyPatchPolicy a week ago and have not seen any instance of this problem since:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyPatchPolicy
metadata:
  name: tcp-timeout-patch
  namespace: kube-system
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: cluster-lb-public
    namespace: kube-system
  type: JSONPatch
  jsonPatches:
    - type: "type.googleapis.com/envoy.config.cluster.v3.Cluster"
      name: "oauth_id_jumpcloud_com_443"
      operation:
        op: add
        path: "/upstream_connection_options"
        value:
          tcp_keepalive:
            keepalive_probes: 1
            keepalive_time: 59 # less than jumpcloud's 60s timeout

As far as this issue is concerned, I'm happy for this to be closed as we've got something that works for us pretty reliably now. That said, feel free to keep this open if you'd like any more metrics or logs from us, or to discuss any other solutions. You've been very responsive and helpful with us which I greatly appreciate, and if there's anything more that would be useful to you from our system let me know!

zhaohuabing commented 2 months ago

@coro Glad to hear that you solved this with EnvoyPatchPolicy. Let's keep this issue open and solve this upstream by enabling retires in the OAuth filter.

arkodg commented 2 months ago

thanks for testing this out @coro, glad the workaround of keeping the connection alive using keep alive works

benediktwerner commented 1 month ago

We seem to be running into the same issue, seeing quite a high amount of OAuth failures due to envoy_cluster_upstream_cx_destroy_remote_with_active_rq.

It looks like in our case the issue is caused by AWS Network Load Balancers in front of the OAuth upstream, which have an idle timeout of 350s and ofc can't properly close the HTTP connection since they operate at the TCP level. Although not sure if they couldn't still close the connection better but they don't and this doesn't seem to be configurable and works as documented by AWS. They only send an RST packet once you try to send anything on the "closed" connection, which means the first request sent by Envoy on such an expired connection will immediately fail.

We'll try setting up a keep alive but definitely would be nice to have a solution for this that doesn't require a patch (especially since it seems like we'll be able to get rid of all our other patches with Envoy Gateway v1.1).

Seems like the upstream issue for Oauth retries got closed due to inactivity, are there still any plans to implement something like that?

Though I guess setting up a keep-alive probably is the better solution in our case, even if OAuth retries are probably a more general way to solve these kinds of issue by default, since it doesn't require knowing any specific idle timeouts.

So I guess would it make sense to add a tcpKeepalive to the OIDC config? I guess maybe to OIDCProvider? If that sounds reasonable, I can create a PR for it.

arkodg commented 1 month ago

hoping https://github.com/envoyproxy/gateway/issues/3069 can also accommodate cluster level settings for the oauth2 token endpoint

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.