Update reverse proxy recommendations for Argus and Colossus to support HTTP/2

As established in https://github.com/Joystream/joystream/issues/3808, we should update the documentation of Argus and Colossus nodes to include recommended reverse-proxy setup that supports HTTP/2 connections.

Note that there is already a section that includes an example Nginx configuration in the Argus docs: https://github.com/Joystream/joystream/blob/master/distributor-node/docs/node/index.md#example-nginx-configuration, but it's for HTTP 1.1

Scope of work:

Research possible reverse-proxy settings for at least Nginx and Caddy (+ possibly other reverse-proxy servers) related to HTTP/2
Run some benchmarks to establish the benefits of switching from HTTP 1.1 to HTTP/2 for reverse-proxy connections for both Colossus and Argus, using different reverse-proxy servers and configurations
Come up with a set of optimal, recommended reverse-proxy settings for the nodes, including request/connection limits that would mitigate potential DoS attacks
Update the documentation of both Argus and Colossus w/ the recommended reverse-proxy configuration for HTTP/2

Benchmarking:

Setup

For the purpose of benchmarking the minimum recommended setup is:

Machine A: running joystream node + query node + storage node
Machine B: running joystream node + query node + distributor node
Machine C: running the benchmarks

Provided that:

There is non-zero latency between A-B and B-C (the machines need to be geographically spread out)
Joystream nodes on machine A and B form a single blockchain
Query node on machines A and B connects to a local joystream node and distributor/storage node connect to a local query node

Storage and distribution buckets needs to be set up. In order to do this, following commands can be executed (either on Machine A or Machine B):

export COLOSSUS_1_URL="https://your-storage-node-public-url.com"
export DISTRIBUTOR_1_URL="https://your-distributor-node-public-url.com"

yarn workspace network-tests run-test-scenario initStorageAndDistribution

Now, provided you're already running local joystream and query-node via docker-compose on both machine A and machine B, you can start storage/distributor nodes via:

Machine A:

docker-compose up -d colossus-1

Machine B:

docker-compose up -d distributor-1

In order to do the benchmarking you'll need some mock data objects. To generate them, execute the following command on either machine A or B:

cd distributor-node
./bin/run dev:batchUpload -B 0 -C 10 -S 100 -b static:council -e https://your-storage-node-public-url.com -y

Browser benchmarking

In order to do some benchmarking within a browser, a simple html page can be created:

<html>
    <body>
        <div></div>
        <script>
            let content = ''
            for(i=0; i<100; ++i) {
                content += `<img src="https://your-distributor-node-public-url.com/api/v1/assets/${i}" />`
            }
            document.getElementsByTagName('div')[0].innerHTML = content;
            let start = Date.now();
            window.onload = () => alert(`Loading took: ${Date.now() - start}`);
        </script>
    </body>
</html>

This is just an example containing a script that insert 100 <img> elements into the page (forcing the browser to request 100 different assets from the distributor node) and measure the time it took to load the page.

Remember that the browser cache needs to be disabled or cleared each time after loading the page to prevent client-side caching from affecting the results.

Remember that distributor node also has it's own cache, so on the first page load it will fetch all the assets from the storage node, but on the second page load it will serve the assets from it's own cache. Both ways of serving the assets should be benchmarked (in order to test non-cached cases, either the Argus cache needs to be cleaned, or different assets requested each time).

I recommend to run multiple tests of each scenario and calculate the average/median result, for example:

5x fetch non-cached assets from the distributor node with Client --http/2--> Argus --http/2--> Colossus
5x fetch non-cached assets from the distributor node with Client --http/2--> Argus --http/1.1--> Colossus
5x fetch non-cached assets from the distributor node with Client --http/1.1--> Argus --http/2--> Colossus
5x fetch non-cached assets from the distributor node with Client --http/1.1--> Argus --http/1.1--> Colossus
5x fetch cached assets from the distributor-node with Client --http/1.1--> Argus
5x fetch cached assets from the distributor-node with Client --http/2 --> Argus

Calculate the averages/medians and compare the results.

Command line benchmarking

Tools like h2load can be used to perform benchmarking outside of the browser context:

sudo apt-get install nghttp2-client
URLS=`
for ASSET in {0..1000}
do
    echo "https://your-distributor-node-public-url.com/api/v1/assets/${ASSET}"
done
`
h2load -n10000 -c10 -m20 ${URLS}       # HTTP/2 used by default
h2load -n10000 -c10 -m20 --h1 ${URLS}  # --h1 forces HTTP/1.1

In this example we're requesting the same 1000 assets (0-999) from distributor node over 10 separate connections, using --max-concurrent-streams=20 (for HTTP/1.1 max concurrent streams specifies the number of pipelining requests in-flight), first using http/2 and then http/1.1.

This is a great write up. I propose we make modification to colossus and argus to run in a mode where they do not need to communicate with either a chain rpc endpoint or a query node, by abstracting the communication layer with those services, making it possible to mock that state. This greatly reduces the complexity of deploying a setup for this bench-marking exercise and other future general tests we want to add perform.

Recommendation

Do not use http/2 on edge (reverse_proxy/load balancer) when the upstream hosts only support http/1 (current version of node express.js framework we use for all our server applications). It introduces latency and unusual request bursts to upstreams.

For caddy the server protocols option must be set to only use h1

# Caddyfile global section
{
        servers {
                # Disable http v2 and v3
                protocols h1
        }
}

Example usage https://github.com/mnaamani/storage-production/blob/main/Caddyfile

I found an article that articulates this very well and describes exactly the behavior we see in Gleev and more recently in Pioneer (when we switched QueryNode to use caddy as loadbalancer)

“On closer inspection, we realized that the average number of requests was the same as usual, but the actual flow of requests had become spikier. Instead of a steady flow of requests, we were getting short bursts of a lot of requests. Although we had overprovisioned capacity based on previous traffic patterns, it wasn’t enough to deal with the new request spikes, so responses to requests were delayed and timed out.”

source: https://www.lucidchart.com/techblog/2019/04/10/why-turning-on-http2-was-a-mistake/

Why?

We recently switched our setup for the public query-node endpoint from a simple load balancer (which only supports http/1) to using Caddy as a reverse proxy (caddy by default supports http 1, 2 and 3). There was no immediate noticeable problems. Although there we several complaints that pioneer was usually slow.

After a bug was found in the query-node processor that required patching and re-deploying, several Distributors and Storage providers switched from using their own query-nodes to using the public endpoint while they updated their query-nodes. This made the latency even worse.

After a week long investigation looking at multiple layers in the stack (the graphql-server, connection pooling to postgres database, database query analysis, proxy configuration and health checks config, tweaking machine specs, tuning kernel parameters) I was not able to pinpoint the root cause.

The main problem I was see was that the database was getting large spikes in queries every few seconds, and the load balancer had on average of 1000 concurrent requests.

I finally decided to switch away from the new setup. Changed cloud provider, geographic location, load balancer (that only supports http/1) Everthing started working fine. There was a much lower number of concurrent requests, and latency was gone.

At first I attributed it to better hosting provider and data-center and didn't think much of the fact that the http protocol change could have had a major impact.

Its unlikely that at the moment I changed the setup that a large number of users/clients requests just went away. I'm 99% confident however that using http2 on the edge was root cause of latency. And this applies to all public endpoints including Storage, Distribution, Query-Node, Orion, Faucet.

I will conduct a benchmark test to confirm, but I believe updating all reverse proxy setups (at least the ones using that Caddy) to only use http/1 could prove very impactful.

Initial test run of tests I did seem to actually work against my recommendation.

k6 run --vus 5 --duration 20s bench.js 

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: bench.js
     output: -

  scenarios: (100.00%) 1 scenario, 5 max VUs, 50s max duration (incl. graceful stop):
           * default: 5 looping VUs for 20s (gracefulStop: 30s)

     ✓ status is 200
     ✓ protocol is HTTP/1
     ✗ protocol is HTTP/2
      ↳  0% — ✓ 0 / ✗ 461

     checks.........................: 66.66% ✓ 922       ✗ 461
     data_received..................: 2.0 MB 97 kB/s
     data_sent......................: 249 kB 12 kB/s
     http_req_blocked...............: avg=4.3ms    min=2µs      med=5µs      max=433.81ms p(90)=9µs      p(95)=10µs    
     http_req_connecting............: avg=2.09ms   min=0s       med=0s       max=212.42ms p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=213.77ms min=153.02ms med=220.01ms max=626.99ms p(90)=239.06ms p(95)=253.06ms
       { expected_response:true }...: avg=213.77ms min=153.02ms med=220.01ms max=626.99ms p(90)=239.06ms p(95)=253.06ms
     http_req_failed................: 0.00%  ✓ 0         ✗ 461
     http_req_receiving.............: avg=346.85µs min=31µs     med=118µs    max=3.61ms   p(90)=1.34ms   p(95)=1.84ms  
     http_req_sending...............: avg=34.54µs  min=11µs     med=31µs     max=201µs    p(90)=50µs     p(95)=58µs    
     http_req_tls_handshaking.......: avg=2.18ms   min=0s       med=0s       max=220.26ms p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=213.39ms min=152.85ms med=219.08ms max=626.8ms  p(90)=238.5ms  p(95)=252.9ms 
     http_reqs......................: 461    22.806487/s
     iteration_duration.............: avg=218.51ms min=153.46ms med=220.41ms max=1.01s    p(90)=239.43ms p(95)=253.49ms
     iterations.....................: 461    22.806487/s
     vus............................: 5      min=5       max=5
     vus_max........................: 5      min=5       max=5

running (20.2s), 0/5 VUs, 461 complete and 0 interrupted iterations
default ✓ [======================================] 5 VUs  20s

k6 run --vus 5 --duration 20s bench.js 

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: bench.js
     output: -

  scenarios: (100.00%) 1 scenario, 5 max VUs, 50s max duration (incl. graceful stop):
           * default: 5 looping VUs for 20s (gracefulStop: 30s)

     ✓ status is 200
     ✗ protocol is HTTP/1
      ↳  0% — ✓ 0 / ✗ 490
     ✓ protocol is HTTP/2

     checks.........................: 66.66% ✓ 980       ✗ 490
     data_received..................: 2.0 MB 98 kB/s
     data_sent......................: 220 kB 11 kB/s
     http_req_blocked...............: avg=4.18ms   min=0s       med=1µs      max=597.61ms p(90)=2µs      p(95)=2µs     
     http_req_connecting............: avg=2.03ms   min=0s       med=0s       max=302.14ms p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=200.65ms min=152.69ms med=212.79ms max=479.15ms p(90)=247.44ms p(95)=276.15ms
       { expected_response:true }...: avg=200.65ms min=152.69ms med=212.79ms max=479.15ms p(90)=247.44ms p(95)=276.15ms
     http_req_failed................: 0.00%  ✓ 0         ✗ 490
     http_req_receiving.............: avg=692.61µs min=36µs     med=161.5µs  max=4.34ms   p(90)=2.74ms   p(95)=3.22ms  
     http_req_sending...............: avg=152.72µs min=52µs     med=136.5µs  max=1.82ms   p(90)=217µs    p(95)=249.09µs
     http_req_tls_handshaking.......: avg=2.13ms   min=0s       med=0s       max=294.09ms p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=199.8ms  min=152.48ms med=212.53ms max=477.27ms p(90)=246.88ms p(95)=274.89ms
     http_reqs......................: 490    24.266284/s
     iteration_duration.............: avg=205.27ms min=153.14ms med=213.15ms max=960.8ms  p(90)=248.02ms p(95)=276.49ms
     iterations.....................: 490    24.266284/s
     vus............................: 5      min=5       max=5
     vus_max........................: 5      min=5       max=5

running (20.2s), 0/5 VUs, 490 complete and 0 interrupted iterations
default ✓ [======================================] 5 VUs  20s

However when I switched back to old query-node setup with only http v1 active, the latency issue started appearing again. So now I'm suspecting issue with caddy itself or even possibly the datacenter where the nodes are located.

...back to the drawing board

I conducted further benchmarking tests (a bit more aggressive with queries that are more 'expensive') and was able to cause even the larger cluster suffer.

With open-telemetry collector working better on new setup I was able to gather enough traces/spans to find the underlying problem.

Briefly, the graphql-server only has a connection pool of size 10. When some complex queries are being processed they consume all the connections in the pool, and until they complete all other queries are blocked. This causes normally fast queries also to take a long time to complete.

Will have to investigate why warthog/type-orm is behaving this way for some queries, as other queries despite performing many SQL queries, do so with a single connection.

There are some other issues..

traces show a graphql.resolve span taking several seconds (processing large dataset returned from sql query) during which I think the database connection is not returned to the pool.
when a client closes POST request to graphql-server, the graphql-server doesn't seem to terminate processing the graphql query. (I remember seeing a lot of errors in the caddy log about it trying to write to a closed connection. That log now makes sense)

There are a few more but will not list them here. The slowest queries are ones that are just fetching a huge dataset and not doing any paging. As far as I know our apps have generally well designed queries, so most of these slow ones are coming possible from 3rd-party apps. They are not very frequent, just expensive so caching wouldn't help.

For our general purpose public endpoint we need to just optimize and scale to meet the demands. Horizontally scaling is the best option: Multiple graphql server instances on each host (to use more CPU cores and with each instance we get an additional 10 connections to the postgres DB) behind load balancer.

Although it is simpler to increase pool size and just run a single graphql-server it is likely a single GraphQL query will consume all the connections.

We had this setup in the past in a k8 cluster and I remember that was one of the best deployments with respect to latency. There are some complications with getting this to work correctly with current version of query-node, because the processor can only be configured to make a 'processor sate update' http request to a single endpoint (the one graphql-server). Given that we want to expose some state on processor to be able to monitor it remotely, it might make sense to have graphql-server poll the processor, or better yet make the processor state endpoint public so it can be consumed by pioneer or checkly. For now I will have to make do with a workaround.

I'm also exploring enabling redis caching in type-orm to avoid the server using db connections which would work well, but having problems with setting host/port to function correctly. Documentation is a bit light also in this area so will not use it in production until we evaluate more thoroughly.

With AWS Load Balancer working well now, I will try to enable HTTP/2 on it and see if it causes any issues and report back. I Do believe in general it is not a problem, perhaps caddy's behavior or configuration of health checks was not ideal. Or maybe its just not as performant.

With AWS Load Balancer working well now, I will try to enable HTTP/2 on it

Enabling it unfortunately caused many clients to disconnect and accessing query-node from browser showed:

ERR_HTTP2_PROTOCOL_ERROR

Screen Shot 2023-09-06 at 15 04 39

Enabling it unfortunately caused many clients to disconnect and accessing query-node from browser showed:

As it turns out the AWS Load Balancer's behaviour with HTTP/2 enabled was trying to perform a HTTP/2 Cleartext connection to the upstream (which our graphql-server does not support yet).

ref:

Joystream / joystream