feat(patch) added upstream keepalive enhancement

jeremyjpj0916 commented 5 years ago

Opening for discussion and potential future merge if Kong believes the value is there.

Firstly what does this fix: https://discuss.konghq.com/t/routing-issues-when-sending-to-multiple-kubernetes-pods/1880/7 https://github.com/Kong/kong/issues/4001 https://github.com/Kong/kong/issues/3788 https://github.com/Kong/kong/issues/2835 https://trac.nginx.org/nginx/ticket/1340

Initial testing of it is showing positive results: https://discuss.konghq.com/t/building-kong-from-src-idea-and-current-error/3699/4

We will be working on some follow up in our specific architecture(I am confident this is something specific to our setup), but it seems from upstream perspective working as intended fixing the problem explained in earlier issues and below.

Simple explanation: In todays world of complex routing, NGINX has failed to produce upstream keepalive behavior that supports with all types and forms of TLS connection management to services sharing an edge on same ip:port pair(think lb'ed services, cloud platforms). But now Kong can.

Technical explanation: When routing to multiple TLS upstreams, you can connect to a variety of edges that lead onward to app servers, a common setup:

Passthrough - Edge honors the SNI from the first inbound tcp handshake and any keepalive through that same connection will land on the same app server the edge saw from first handshake. This is the type of routing that can cause NGINX trouble.
Re-encrypt - Edge will strip ssl on edge, evaluate host header, and determine which app server to route to prior to re-encrypting the transaction before sending off to app server.
Edge - Strip ssl on edge, route to app server base on host header over http.

The snag Kong faces due to underlying NGINX lack of upstream keepalive capabilities is this: When Kong hosts an assortment of upstreams requiring a TLS connection the following can happen:

App server 1 = myapp1.company.com Kong -> Edge 10.96.130.74 : 443 (ip:port) -> 10.22.138.72 : 443 App server 1 (Passthrough edge)

App server 2 = myapp2.company.com Kong -> Edge 10.96.130.74 : 443 (ip:port) -> 10.11.145.35 : 443 App server 2(Any TLS type edge)

App server 1 if called first will take all traffic meant for App server 2 in current NGINX behavior.

Kong establishes a keepalive connection to an upstream app with passthrough tls, the connection is established and routed for tls handshaking via SNI. Now say there is a totally different upstream application(App server 2) hosted behind this same edge, could be passthrough, edge, or re-encrypt protected as long as its its a tls exposed application through the same edge.

The problem is Kong has already established a keepalive TLS encrypted connection over this ip:port pair before to the edge. The edge at this point will take traffic from Kong intended for App server 2 and send traffic across the old passthrough TLS connection that had keepalive to the older App server 1. So in a nutshell the first passthrough route called on an edge will swallow up all tls traffic destined for the same edge regardless of app server the transaction was meant to be destined for(Kong will set the appropriate host header regardless, but standard edge route design for passthrough are governed by the SNI on the first handshake makes it have undesired behavior, and from a technical perspective the edge can't be any smarter realistically which puts the pressure for a fix on Kong/NGINX).

This patch written by Thibault himself changes the behavior by making nginx "smarter" and managing connections not simply by ip:port pairs, but taking the SNI(hostname) into account as well. So now a separate keepalive connection can be maintained so multiple applications can receive proper tls traffic through the same edge(ip:port) without the first passthrough edge route application invoked taking all the tls traffic destined through the same edge.

To enable the patch to leverage different upstream keepalive behavior you must set it like so, note the new pool_ssl_name=on optional argument to trigger this patches changed connection management behavior:

keepalive ${{UPSTREAM_KEEPALIVE}} pool_ssl_name=on;

I open this PR not because I necessarily think anything will change anytime soon (although it would be welcomed by us as its one of our 2 biggest issues right now), but to kick off the dialog and see opinions, obviously this was not implemented in the past for a reason by Kong, so maybe some light on the why could be shed here. The design being an optional argument gives me hope since its a non-breaking change this could be worked in. Also I selfishly would like to see it included sometime because I much rather Kong ensure the patch is safe from NGINX version to NGINX version with code changes vs my own untrained eyes 😆, hopefully NGINX upstream keepalive code is not messed with too often though!

jeremyjpj0916 commented 5 years ago

Regarding the earlier comment made that we were facing something in our architecture was indeed the case, in our dev sandbox there was an issue with client RST's, the tldr is its not due to this change its the LB infront of Kong, more here: https://discuss.konghq.com/t/building-kong-from-src-idea-and-current-error/3699/7?u=jeremyjpj0916

So far results continue to be promising, upstream tls connection management is behaving as desired. Likely will roll this patch change out to production on 7/1 as well as stage continues to show positive results.

Some info on keepalive vs non keepalive in our testing:

Keepalive vs no Keepalive perf difference, these tests result in a 93% reduction in latency when routing to a backend exposed over TLS with 1 thread 1 connection wrk:

With Keepalive(5 tests):

  # wrk -t 1 -c 1 -d 60 https://gateway.company.com/ha/passthrough/kongdev/test
Running 1m test @ https://gateway.company.com/ha/passthrough/kongdev/test
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.29ms   77.13ms   1.01s    96.10%
    Req/Sec   108.60     25.24   131.00     91.33%
  6285 requests in 1.00m, 1.74MB read
Requests/sec:    104.71
Transfer/sec:     29.76KB

# wrk -t 1 -c 1 -d 60 https://gateway.company.com/ha/passthrough/kongdev/test
Running 1m test @ https://gateway.company.com/ha/passthrough/kongdev/test
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    17.66ms   73.63ms   1.01s    98.32%
    Req/Sec   109.43     12.10   131.00     90.51%
  6447 requests in 1.00m, 1.79MB read
Requests/sec:    107.28
Transfer/sec:     30.49KB

# wrk -t 1 -c 1 -d 60 https://gateway.company.com/ha/passthrough/kongdev/test
Running 1m test @ https://gateway.company.com/ha/passthrough/kongdev/test
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.34ms    7.32ms 209.37ms   98.35%
    Req/Sec   112.50      9.74   131.00     83.28%
  6711 requests in 1.00m, 1.86MB read
Requests/sec:    111.85
Transfer/sec:     31.79KB

# wrk -t 1 -c 1 -d 60 https://gateway.company.com/ha/passthrough/kongdev/test
Running 1m test @ https://gateway.company.com/ha/passthrough/kongdev/test
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.93ms   23.84ms 461.87ms   99.00%
    Req/Sec   112.81     10.88   131.00     83.19%
  6701 requests in 1.00m, 1.86MB read
Requests/sec:    111.65
Transfer/sec:     31.73KB

# wrk -t 1 -c 1 -d 60 https://gateway.company.com/ha/passthrough/kongdev/test
Running 1m test @ https://gateway.company.com/ha/passthrough/kongdev/test
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.14ms   12.28ms 220.40ms   98.47%
    Req/Sec   110.47     13.50   131.00     89.09%
  6578 requests in 1.00m, 1.83MB read
Requests/sec:    109.55
Transfer/sec:     31.13KB

Keepalive Disabled(5 tests):

# wrk -t 1 -c 1 -d 60 https://gateway.company.com/ha/passthrough/kongdev/test
Running 1m test @ https://gateway.company.com/ha/passthrough/kongdev/test
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   205.89ms  367.14ms   1.81s    85.56%
    Req/Sec    45.32     18.92    70.00     74.58%
  1704 requests in 1.00m, 482.96KB read
Requests/sec:     28.39
Transfer/sec:      8.05KB

# wrk -t 1 -c 1 -d 60 https://gateway.company.com/ha/passthrough/kongdev/test
Running 1m test @ https://gateway.company.com/ha/passthrough/kongdev/test
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   216.57ms  366.44ms   1.74s    84.39%
    Req/Sec    44.21     19.47    70.00     74.67%
  1405 requests in 1.00m, 399.04KB read
  Socket errors: connect 0, read 0, write 0, timeout 1
Requests/sec:     23.41
Transfer/sec:      6.65KB

# wrk -t 1 -c 1 -d 60 https://gateway.company.com/ha/passthrough/kongdev/test
Running 1m test @ https://gateway.company.com/ha/passthrough/kongdev/test
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   206.19ms  375.95ms   1.83s    85.09%
    Req/Sec    65.92     34.28   161.00     64.61%
  2430 requests in 1.00m, 1.93MB read
  Non-2xx or 3xx responses: 1558
Requests/sec:     40.43
Transfer/sec:     32.89KB

# wrk -t 1 -c 1 -d 60 https://gateway.company.com/ha/passthrough/kongdev/test
Running 1m test @ https://gateway.company.com/ha/passthrough/kongdev/test
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   145.99ms  304.76ms   1.70s    86.53%
    Req/Sec    79.53     32.53   131.00     76.16%
  3498 requests in 1.00m, 2.28MB read
  Non-2xx or 3xx responses: 2723
Requests/sec:     58.26
Transfer/sec:     38.88KB

# wrk -t 1 -c 1 -d 60 https://gateway.company.com/ha/passthrough/kongdev/test
Running 1m test @ https://gateway.company.com/ha/passthrough/kongdev/test
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   250.09ms  418.64ms   1.86s    84.28%
    Req/Sec    43.34     18.16    70.00     78.78%
  1434 requests in 1.00m, 407.31KB read
Requests/sec:     23.87
Transfer/sec:      6.78KB

jeremyjpj0916 commented 5 years ago

@thibaultcha We actually confirmed a slight issue with your patch logic. SSL connections to the backend do reuse a keepalive connection and ephemeral ports in TIMED_WAIT are kept low(20-30). When upstream is HTTP I can still get port exhaustion to occur over http:// and TIMED_WAIT skyrocketed to 26,855. Might be something in the logic maybe that non-ssl upstreams are not eligible for keepalive?

We confirmed this by doing some high volume testing to a fast-http server stood up on a VM. tls exposed on port 443 and 8080 for http.

rsbrisci commented 5 years ago

Just to add - we did test our HTTP/8080 upstream from a Kong instance not running your patches @thibaultcha - and confirmed that Keepalive functions correctly without the patches.

Also, since I don't think I've done it yet - THANK YOU SO MUCH FOR THIS PATCH!! Issues or no; it's still an amazing improvement on a few different fronts.

thibaultcha commented 5 years ago

I think I know what the issue is - should be an easy fix if I am right, but I will likely only have time to address it on Thursday. Thanks for reporting it!

jeremyjpj0916 commented 5 years ago

Glad to hear you think its an easy change, will await the changed patch in the PR itself or as a comment here and we will give it another test. Appreciate the update!

thibaultcha commented 5 years ago

Had some time to look at this today. I could not reproduce the behavior you described, but nonetheless, I updated the patch with the fix I had in mind (plus a couple of minor tweaks).

Also keep in mind that the pool size value given to the keepalive directive contains both plain text and TLS keepalive connections, so make sure that this pool is large enough for your use-case. And consider that due to the SNI cardinality, the pool may fill up much faster than it used to before the patch (e.g. a single HTTPS upstream 10.0.0.1:443 will now take up to n slots in the keepalive pool, n being the number of services it serves, aka n SNIs). So when before we had a pool size of ip * ports, we now have ip * ports * snis.

Let me know how this version of the patch goes with regards to those plain text keepalive connections.

From c47af9e7517f8bc7331539eb3f919581c432798d Mon Sep 17 00:00:00 2001
From: Thibault Charbonnier <thibaultcha@me.com>
Date: Tue, 2 Oct 2018 14:53:21 -0700
Subject: [PATCH] Upstream keepalive: add pool_ssl_name option

This patch adds an option to the 'keepalive' directive of the upstream
keepalive module. This option is only valid when Nginx is compiled with
SSL support. When enabled as such:

    keepalive pool_ssl_name=on;

The upstream keepalive module stores the SNI (Server Name Indication)
for each upstream connection made over SSL, and ensures that connection
reuse can only take effect when using the original SNI the connection
was originally opened with.

In practice, this enables use cases such as:

    proxy_ssl_server_name on;

    upstream my_upstream {
        server ...;

        keepalive 60 pool_ssl_name=on;
    }

    server {
        listen 8000;

        location / {
            proxy_ssl_name     $http_host;
            proxy_http_version 1.1;
            proxy_set_header   Host $http_host;
            proxy_set_header   Connection '';
            proxy_pass         https://my_upstream;
        }
    }

Where the use of 'pool_ssl_name=on' ensures that two proxied requests with
a different Host header are only made over connections matching the
requested SNI.

The 'connections' parameter of the 'keepalive' directive still
determines the maximum number of upstream idle connections when this
option is enabled, regardless of their SNI.
---
 .../ngx_http_upstream_keepalive_module.c      | 64 +++++++++++++++++++
 src/http/ngx_http_upstream.c                  | 48 ++++++++++----
 2 files changed, 100 insertions(+), 12 deletions(-)

diff --git a/src/http/modules/ngx_http_upstream_keepalive_module.c b/src/http/modules/ngx_http_upstream_keepalive_module.c
index 0048e6bc..a761f4c8 100644
--- a/src/http/modules/ngx_http_upstream_keepalive_module.c
+++ b/src/http/modules/ngx_http_upstream_keepalive_module.c
@@ -19,6 +19,10 @@ typedef struct {
     ngx_http_upstream_init_pt          original_init_upstream;
     ngx_http_upstream_init_peer_pt     original_init_peer;

+#if (NGX_HTTP_SSL)
+    ngx_flag_t                         pool_ssl_name;
+#endif
+
 } ngx_http_upstream_keepalive_srv_conf_t;

@@ -31,6 +35,10 @@ typedef struct {
     socklen_t                          socklen;
     ngx_sockaddr_t                     sockaddr;

+#if (NGX_HTTP_SSL)
+    uint32_t                           ssl_name_hash;
+#endif
+
 } ngx_http_upstream_keepalive_cache_t;

@@ -47,6 +55,7 @@ typedef struct {
 #if (NGX_HTTP_SSL)
     ngx_event_set_peer_session_pt      original_set_session;
     ngx_event_save_peer_session_pt     original_save_session;
+    uint32_t                           ssl_name_hash;
 #endif

 } ngx_http_upstream_keepalive_peer_data_t;
@@ -78,7 +87,11 @@ static char *ngx_http_upstream_keepalive(ngx_conf_t *cf, ngx_command_t *cmd,
 static ngx_command_t  ngx_http_upstream_keepalive_commands[] = {

     { ngx_string("keepalive"),
+#if (NGX_HTTP_SSL)
+      NGX_HTTP_UPS_CONF|NGX_CONF_TAKE1|NGX_CONF_TAKE2,
+#else
       NGX_HTTP_UPS_CONF|NGX_CONF_TAKE1,
+#endif
       ngx_http_upstream_keepalive,
       NGX_HTTP_SRV_CONF_OFFSET,
       0,
@@ -194,6 +207,14 @@ ngx_http_upstream_init_keepalive_peer(ngx_http_request_t *r,
     r->upstream->peer.free = ngx_http_upstream_free_keepalive_peer;

 #if (NGX_HTTP_SSL)
+    if (kcf->pool_ssl_name && r->upstream->ssl)  {
+        kp->ssl_name_hash = ngx_crc32_short(r->upstream->ssl_name.data,
+                                            r->upstream->ssl_name.len);
+
+    } else {
+        kp->ssl_name_hash = 0;
+    }
+
     kp->original_set_session = r->upstream->peer.set_session;
     kp->original_save_session = r->upstream->peer.save_session;
     r->upstream->peer.set_session = ngx_http_upstream_keepalive_set_session;
@@ -240,10 +261,22 @@ ngx_http_upstream_get_keepalive_peer(ngx_peer_connection_t *pc, void *data)
                          item->socklen, pc->socklen)
             == 0)
         {
+
+#if (NGX_HTTP_SSL)
+            if (item->ssl_name_hash == 0
+                || item->ssl_name_hash == kp->ssl_name_hash)
+            {
+#endif
+
             ngx_queue_remove(q);
             ngx_queue_insert_head(&kp->conf->free, q);

             goto found;
+
+#if (NGX_HTTP_SSL)
+            }
+#endif
+
         }
     }

@@ -359,6 +392,10 @@ ngx_http_upstream_free_keepalive_peer(ngx_peer_connection_t *pc, void *data,
     item->socklen = pc->socklen;
     ngx_memcpy(&item->sockaddr, pc->sockaddr, pc->socklen);

+#if (NGX_HTTP_SSL)
+    item->ssl_name_hash = kp->ssl_name_hash;
+#endif
+
     if (c->read->ready) {
         ngx_http_upstream_keepalive_close_handler(c->read);
     }
@@ -485,6 +522,10 @@ ngx_http_upstream_keepalive_create_conf(ngx_conf_t *cf)
      *     conf->max_cached = 0;
      */

+#if (NGX_HTTP_SSL)
+    conf->pool_ssl_name = NGX_CONF_UNSET;
+#endif
+
     return conf;
 }

@@ -525,5 +566,28 @@ ngx_http_upstream_keepalive(ngx_conf_t *cf, ngx_command_t *cmd, void *conf)

     uscf->peer.init_upstream = ngx_http_upstream_init_keepalive;

+#if (NGX_HTTP_SSL)
+    kcf->pool_ssl_name = 0;
+
+    if (cf->args->nelts == 3) {
+        if (ngx_strncmp(value[2].data, "pool_ssl_name=", 14) == 0) {
+            if (ngx_strcmp(&value[2].data[14], "on") == 0) {
+                kcf->pool_ssl_name = 1;
+                goto done;
+
+            } else if (ngx_strcmp(&value[2].data[14], "off") == 0) {
+                goto done;
+            }
+        }
+
+        ngx_conf_log_error(NGX_LOG_EMERG, cf, 0,
+                           "invalid value \"%V\" in \"%V\" directive",
+                           &value[2], &cmd->name);
+        return NGX_CONF_ERROR;
+    }
+
+done:
+#endif
+
     return NGX_CONF_OK;
 }
diff --git a/src/http/ngx_http_upstream.c b/src/http/ngx_http_upstream.c
index 2ea521b0..a0518ed2 100644
--- a/src/http/ngx_http_upstream.c
+++ b/src/http/ngx_http_upstream.c
@@ -185,7 +185,9 @@ static void ngx_http_upstream_ssl_init_connection(ngx_http_request_t *,
 static void ngx_http_upstream_ssl_handshake_handler(ngx_connection_t *c);
 static void ngx_http_upstream_ssl_handshake(ngx_http_request_t *,
     ngx_http_upstream_t *u, ngx_connection_t *c);
-static ngx_int_t ngx_http_upstream_ssl_name(ngx_http_request_t *r,
+static ngx_int_t ngx_http_upstream_set_ssl_name(ngx_http_request_t *r,
+    ngx_http_upstream_t *u);
+static ngx_int_t ngx_http_upstream_ssl_name_conn(ngx_http_request_t *r,
     ngx_http_upstream_t *u, ngx_connection_t *c);
 #endif

@@ -776,6 +778,12 @@ found:

 #if (NGX_HTTP_SSL)
     u->ssl_name = uscf->host;
+
+    if (u->ssl && ngx_http_upstream_set_ssl_name(r, u) != NGX_OK) {
+        ngx_http_upstream_finalize_request(r, u,
+                                           NGX_HTTP_INTERNAL_SERVER_ERROR);
+        return;
+    }
 #endif

     if (uscf->peer.init(r, uscf) != NGX_OK) {
@@ -1649,7 +1657,7 @@ ngx_http_upstream_ssl_init_connection(ngx_http_request_t *r,
     u->output.sendfile = 0;

     if (u->conf->ssl_server_name || u->conf->ssl_verify) {
-        if (ngx_http_upstream_ssl_name(r, u, c) != NGX_OK) {
+        if (ngx_http_upstream_ssl_name_conn(r, u, c) != NGX_OK) {
             ngx_http_upstream_finalize_request(r, u,
                                                NGX_HTTP_INTERNAL_SERVER_ERROR);
             return;
@@ -1765,12 +1773,15 @@ failed:

 static ngx_int_t
-ngx_http_upstream_ssl_name(ngx_http_request_t *r, ngx_http_upstream_t *u,
-    ngx_connection_t *c)
+ngx_http_upstream_set_ssl_name(ngx_http_request_t *r, ngx_http_upstream_t *u)
 {
     u_char     *p, *last;
     ngx_str_t   name;

+    if (!u->conf->ssl_server_name && !u->conf->ssl_verify) {
+        return NGX_OK;
+    }
+
     if (u->conf->ssl_name) {
         if (ngx_http_complex_value(r, u->conf->ssl_name, &name) != NGX_OK) {
             return NGX_ERROR;
@@ -1806,20 +1817,37 @@ ngx_http_upstream_ssl_name(ngx_http_request_t *r, ngx_http_upstream_t *u,
         name.len = p - name.data;
     }

+done:
+
+    u->ssl_name = name;
+
+    return NGX_OK;
+}
+
+
+static ngx_int_t
+ngx_http_upstream_ssl_name_conn(ngx_http_request_t *r, ngx_http_upstream_t *u,
+    ngx_connection_t *c)
+{
+#ifdef SSL_CTRL_SET_TLSEXT_HOSTNAME
+
+    u_char     *p;
+    ngx_str_t   name;
+
     if (!u->conf->ssl_server_name) {
-        goto done;
+        return NGX_OK;
     }

-#ifdef SSL_CTRL_SET_TLSEXT_HOSTNAME
+    name = u->ssl_name;

     /* as per RFC 6066, literal IPv4 and IPv6 addresses are not permitted */

     if (name.len == 0 || *name.data == '[') {
-        goto done;
+        return NGX_OK;
     }

     if (ngx_inet_addr(name.data, name.len) != INADDR_NONE) {
-        goto done;
+        return NGX_OK;
     }

     /*
@@ -1850,10 +1878,6 @@ ngx_http_upstream_ssl_name(ngx_http_request_t *r, ngx_http_upstream_t *u,

 #endif

-done:
-
-    u->ssl_name = name;
-
     return NGX_OK;
 }

-- 
2.20.1

jeremyjpj0916 commented 5 years ago

@thibaultcha appreciate the prompt updates! Will be giving this a go tonight and into tomorrow and will share the results as soon as we have them. Thanks again.

We have been running an upstream keepalive pool size of 60 in our test environments for this. Which my understanding is the number of active idle keepalive connections it will allow in a moment in time. Seems reasonable at our current # of unique ip:port:sni getting called during a given hour or so.

jeremyjpj0916 commented 5 years ago

I was too excited so I tested it last 45 mins, well @thibaultcha you one shotted it 🔥 , working perfectly now at 1.1k tps 10 thread 10 connection wrk client to our dev box for 10 mins:

/ $ netstat -a | grep "TIME_WAIT" | wc -l
26
/ $ netstat -a | grep "TIME_WAIT" | wc -l
26
/ $ netstat -a | grep "TIME_WAIT" | wc -l
27
/ $ netstat -a | grep "TIME_WAIT" | wc -l
27
/ $ netstat -a | grep "TIME_WAIT" | wc -l
24
/ $ netstat -a | grep "TIME_WAIT" | wc -l
22
/ $ netstat -a | grep "ESTABLISHED" | wc -l
50
/ $ netstat -a | grep "ESTABLISHED" | wc -l
50
/ $ netstat -a | grep "ESTABLISHED" | wc -l
49
/ $ netstat -a | grep "ESTABLISHED" | wc -l
51
/ $ netstat -a | grep "TIME_WAIT" | wc -l
25
/ $ netstat -a | grep "TIME_WAIT" | wc -l
25
/ $

Will continue testing more tomorrow to 100% confirm but I think this one does it!

jeremyjpj0916 commented 5 years ago

Update: 100% confirmed the new patch provided in the comment fixed it for http and https traffic. Will update my PR accordingly.

thibaultcha commented 5 years ago

@jeremyjpj0916 The patch will ship in a different shape to this repository (and it will also ship in OpenResty, eventually - maybe in the upcoming 1.17 release).

thibaultcha commented 5 years ago

@jeremyjpj0916 And thanks for testing the revised patch, of course!

jeremyjpj0916 commented 5 years ago

@thibaultcha Would you prefer me to close this PR then as you have the ammo needed, or just leave it open until Kong has time to pull the trigger on the additional option in the standard build?

thibaultcha commented 5 years ago

I don't have a preference, this PR is fine for communication/visibility purposes for the moment :)

jeremyjpj0916 commented 5 years ago

Wanted to make one more awesome note here about this, we deployed it to production last night and it has really given us a boost on the p99 latency we were seeing in Kong likely due to upstream tls handshakes and reconnects:

What you see here is a Blue TPS through the gateway with Red being Kong gateway latency, before we were seeing 200+ ms spikes of gateway latency as traffic increased during the day, now its holding around 30ms post keepalive patch fix change. What a big win!

thibaultcha commented 5 years ago

@jeremyjpj0916 Awesome! Thanks for sharing your results (and congrats/thank you on the production deployment as well :clap:). Reiterating that we'll ship this patch in upcoming OpenResty releases, and in coming Kong releases as well in the meantime. Cheers!

jeremyjpj0916 commented 5 years ago

Dropping these full c files in as a reference here, for anyone who needs this additional logic discussed in this topic for Kong 1.3 / Openresty 1.15.8.1 / NGINX 1.15.8 :

nginx-1.15.8-ngx_http_upstream.c.txt nginx-1.15.8-ngx_http_upstream_keepalive_module.c.txt

-Jeremy

thibaultcha commented 5 years ago

I just submitted a patch to the nginx-devel mailing list which addresses this issue in a more holistic way: http://mailman.nginx.org/pipermail/nginx-devel/2019-August/012580.html.

It allows for solving other use-cases and comes with a few bells and whistles. Additionally, it can be leveraged by OpenResty to be dynamically configurable in balancer_by_lua_* (I already have a working PoC of the Lua API locally, implemented as ngx_balancer.set_upstream_pool_key(key)).

While I truly believe the patch to be both safe and a beneficial step forward, I doubt that the patch will be accepted by the NGINX maintainers. If not, then we'll port a minimal version of it to OpenResty's maintained NGINX patch and will implement the above Lua API on top of it.

jeremyjpj0916 commented 5 years ago

@thibaultcha really glad to hear that! Any timeline you would give to how long you will wait on NGINX team to respond vs deciding to carry the torch from OpenResty land? 1-2 months?

Also was pondering the effects of how we are running this keepalive patch right now with Kong as a LB functionality. I know yall added for active health checks to include FQDN with ip:port so active checks against different backends behind say a given F5 fronting a cloud cluster. Which seems to be working even when going to the same ip:port for two different services SNIs.

It uses that https://github.com/ledgetech/lua-resty-http so I am wondering those active health checks I wonder if there is risk there over active keepalive connections on the health checks by ip:port, seems no set_keepalive set, which maybe that was intentional or a stroke of good luck(but a good thing if this http client does not manage by SNI + IP + PORT, if it could manage by using ssl pooling then keepalive for health-checking can be done when Kong manages the connections as a perf improvement later). https://github.com/Kong/lua-resty-healthcheck/blob/master/lib/resty/healthcheck.lua#L757

The reverse proxying still will use the proxy_pass and keepalive management of nginx with the patch so that should be fine from that perspective.

Again really impressive work, thanks for putting this together. I will try to put your patch in the face of those that have strong connections with F5/NGINX to see if traction can be made.

thibaultcha commented 5 years ago

Any timeline you would give to how long you will wait on NGINX team to respond vs deciding to carry the torch from OpenResty land?

Not too long; I am already developing the OpenResty bits, and intend on including them in the upcoming 1.17.x.1 release. The NGINX maintainers will comment on the patch in a matter of days at most, I have no doubts.

If any issue raises around connection pools from the cosocket API, they can be addressed by fixes on our end and will mostly be due to mis-usage of the cosocket API (i.e. by including the SNI in the connection pool key option). In other words the cosocket API already has a connection pool key mechanism like the one proposed by the above patch.

jeremyjpj0916 commented 5 years ago

Glad to hear on the time-frame, I always dig Kong's aggressive timelines 🚀 .

Also good news on the cosocket API as well, sounds like all systems go around fixing this use case as a whole. Fwded your patch to people that hopefully can help push this into NGINX plate. Its an interesting routing case that I am glad you had a fix in mind and we could validate it in the wild over here. Enjoy the long weekend.

jeremyjpj0916 commented 5 years ago

@thibaultcha Welp looks like you were right, on their reply:

No, thank you.  The issues as observed in the tickets linked 
should be resolved by using distinct upstream blocks instead.

At least you tried? 😄 . Sounds like OpenResty or Kong patching is the only way to go here. Probably best in OpenResty because I think this is an important routing feature in modern architecture and those other than just the Kong can benefit from it if its @ the OpenResty Platform layer.

thibaultcha commented 5 years ago

@jeremyjpj0916 Indeed. This is already ongoing.

thibaultcha commented 5 years ago

@jeremyjpj0916 And yeah, at least we tried, even though I was already pretty sure that the answer would be no... If you ask me this truly is a pity, given the patch fixed a number of known issues with no breaking behavior change all the while aligning with typical NGINX idioms. Anyway. At this point, I am re-implementing the keepalive pooling mechanism straight into ngx_http_lua, instead of patching NGINX. This way will actually be a lot more maintainable and much more dynamic. Their loss.

jeremyjpj0916 commented 5 years ago

@thibaultcha Yeah, I don't think the answer of "make a ton of separate upstream blocks" is really a great solution and you put a lot of effort into explaining the patch and the code changes around it to get such a terse reply with little feedback for reasoning or any suggestions for modifications to have it approved and merged. Say someone has TLS ingress that will require the SNI on that handshake to help dictate routing to a cloud cluster fronting 500 various micro-services and want keepalive. So they expect folks to maintain a static nginx.conf with 500 separate upstream blocks defining those individual SNIs for reverse proxying? That feels extremely messy and bulky. +1 for OpenResty though and all apps leveraging it as you said, a great feature being introduced that nginx won't be taking advantage of.

Thanks again for your diligence!

thibaultcha commented 5 years ago

@jeremyjpj0916 Here is the solution I am proposing for solving this limitation in OpenResty: https://github.com/openresty/lua-nginx-module/pull/1600. Not only does it address the issue at hand, it also means that now we'll be able to granularly customize upstream keepalive connection pools on a per-service basis, and even per-connection :)

jeremyjpj0916 commented 5 years ago

@thibaultcha Hopefully the group @ OpenResty agrees the changes are worthwhile. I certainly see it as a huge routing pattern win and a positive move for flexibility. You being one of the maintainers means there already is 1 at the table that has seen first hand the benefits and well versed in the issue 😀 .

Essentially seems the added opts for a pool will enable the pooling by ip:port:sni rather than by ip:port which is the existing standard. As well as pool_size which can help in tuning how many keepalive connections need to be maintained with a given backend.

Seems so from glancing here https://github.com/openresty/lua-resty-core/pull/276/files#diff-c6d3d61f52132e153660e7832e95b88aR367

We probably would personally want to enable the sni pooling by default on every https tx likely so will probably figure out a way to tune that into our current Kong template when the time rolls around. But the granularity built in will certainly help other use cases besides our own I am sure!

Appreciate the status update!

thibaultcha commented 5 years ago

@jeremyjpj0916 Yeah, I discussed implementing this change with agentzh and we are in agreement over it, no problem on that front.

As for the Kong side of things, it would be Kong's responsibility to create and use appropriate connection pools, and we'll make sure everything is tight there (e.g. considering the SNI, but also the TLS client cert in recent Kong versions supporting upstream mTLS). The only options that should be configurable from a user's perspective would be the pool size, idle timeout, and max requests settings. How does that sound to you?

jeremyjpj0916 commented 5 years ago

@thibaultcha So is the idea that if schema = https then kong will assume that the SNI should be accounted for in the connection pooling automatically as safe practice? Or just thinking to expose the option to pool it per service case by case basis (I think this is what you originally envisioned but the:

it would be Kong's responsibility to create and use appropriate connection pools, and we'll make sure everything is tight there (e.g. considering the SNI, but also the TLS client cert in recent Kong versions supporting upstream mTLS)

comment gave me a second pause). Because while I think there is no harm in connection pooling by SNI for any https schema backend, a user may not want to which is why I sadly think it has to be user driven and not just "well this will certainly work for you" but may not be necessary for instance when all their https:// services actually do live on different ip:port pairs 😆 .

Thinking abut it from my perspective I am just trying to work through how we will go from:

  upstream kong_upstream {
      server 0.0.0.1;

      balancer_by_lua_block {
          Kong.balancer()
      }

> if upstream_keepalive > 0 then
    #keepalive_timeout 60s; DEFAULT
    #keepalive_requests 100; DEFAULT
    keepalive_requests 10000;
    keepalive ${{UPSTREAM_KEEPALIVE}} pool_ssl_name=on;
> end
}

(Am aware we can do these keepalive configs by service already too, but sometimes its nice just to do a global and forget it eh 😃).

To being able to support it in new Kong and however that will look, going back to leveraging your bintray source.

Which will mean adjusting the template because now Kong supports it granularly per service and the config of pool_ssl_name=on will be a way of the past. Which granularity will be great for existing consumers but to be safe I would want to likely enable it globally by default for all the https services. Because the way I see it would go now would be this. We upgrade and the pooling now optional on all services, but during runtime we want to make sure we don't break things even temporarily. So my thought is in the upstream block I will want a way to override the service config logic that if schema == https then just assume I want the pooling by SNI for which keepalive connection to establish+re-use. Then once I am on the new upgraded Kong with global override in place(for the time being), I can then write a snippit of code to edit all existing service resources to have the proper updated settings that will give me the right keepalive behavior we run now.

Does that concern make sense? For most Kong users it won't be that complicated as this is new territory and not a breaking change to them but I am in a unique spot here. Maybe I am overcomplicating and there is a simpler alternative for us but I am not seeing it in the late hours out here in SC. Still have slight concerns even in my current idea about taking active proxy traffic and publishing changes in a loop to all service resources 😆 , router rebuild PTSD man. I suppose a change to a service does not initiate a full router rebuild though, but rather a cluster event invalidation on the service id in question to pick up the new config change.

thibaultcha commented 5 years ago

@jeremyjpj0916 The way I envision this for now would be the following:

Kong should implement new global upstream keepalive configuration properties (kong.conf + env variables), namely to configure the pool size, idle timeout, and max requests attributes. This has been requested by several users and customers including yourself already and it makes sense.
When Kong connects to any upstream service via HTTPS, it must pool by <ip>:<port>:<sni?>:<client_cert?> (attributes with ? only considered in the pool if relevant for a given connection). If all of one's upstream services are deployed on different <ip>:<port> tuple as you mention, then this strategy should still result in the same number of keepalive pools (i.e. the <sni> component will be the same for all connections to that service, e.g. <ip-A>:<port-A>:<service-a.com> and <ip-B>:<port-B>:<service-b.com>: two pools for two services). Do you see a different use-case here?
Finally, the Service entity of Kong should receive new attributes to customize the pool size, idle timeout, and max requests values, if a user wishes to override the aforementioned global settings.

In your case, specifying the new global upstream keepalive properties and updating the nginx.conf template to remove all keepalive* directives will be all you should need when upgrading Kong.

Then once I am on the new upgraded Kong with global override in place(for the time being), I can then write a snippit of code to edit all existing service resources to have the proper updated settings that will give me the right keepalive behavior we run now.

Am I understanding correctly here that you mean updating specific Services to customize their keepalive attributes individually? Or did you mean something else?

Also a side note: ideally, updating those attributes for a Service should not trigger a router rebuild. We'll see if we can do something about this when implementing them.

In light of these new developments, I am happy to continue this conversation for as long as you wish, please just make sure to eventually close the PR, thanks!

jeremyjpj0916 commented 5 years ago

@thibaultcha

You nailed it, the global config option covers our use case as you stated perfectly(unless the point I brought up below on migrations causes an issue). If we can do that, then we won't override any existing service level resources and Kong will behave exactly as it does now for us. Really glad the global option is being considered here as well as granularity(best of both worlds imo).

I won't need to write a script to update Kong service resources with the global scenario in mind(as long as the upgrade does not default all services to have settings that will cause the global override pooling to not take effect!(your point 3 above)). Could the fields for existing services just start out as nil post migration which will let default nginx behavior(or my global overrides) on pooling take effect? Seems like the best scenario to me.

Looking forward to not building Kong from scratch 😀 , although its been fun to dig into that side of things a while this solution has been ongoing.

Is it proper to just add to your Openresty PR a Fixes : # value pointing to this github issue so when merged it auto closes this one?

Sounds like this will be a Kong 1.5 thing or so?

thibaultcha commented 5 years ago

Here is a tangentially related feature request for a global timeout configuration property which pretty much falls under the same umbrella: https://github.com/Kong/kong/issues/5061

Could the fields for existing services just start out as nil post migration

Yes, that's the idea :+1:

Is it proper to just add to your Openresty PR a Fixes : # value pointing to this github issue so when merged it auto closes this one?

I much prefer keeping OpenResty and Kong issues/PRs clearly separated from each other, if possible.

Sounds like this will be a Kong 1.5 thing or so?

More like Kong 2.0 or 2.1. Still working on OpenResty 1.17.x.1 for the time being :)

jeremyjpj0916 commented 5 years ago

@thibaultcha gotcha, this answers all my questions. Closing this up then and will follow up when it comes around. Thanks!

Kong / openresty-patches

feat(patch) added upstream keepalive enhancement #40