Open ecourreges-orange opened 1 year ago
that is concerning, yes. A few questions:
so far we have used hyper's HTTP client, which has an automatic connection pool, which should have been able to detect the need for more connections. We are not really satisfied with the lack of control there though, and are considering taking over that task from the client
that is concerning, yes. A few questions:
* do you see it slowing down concurrent requests in your benchmarks?
Indeed, after a client concurrence of 10+ there is no performance gain and the CPU is capped at 13% for 150 Req/s in my case, whereas with a home-made node.js, I reach >300 Req/s but the CPU is my bottleneck. Preliminary results show that the Rust router is 5-10x more CPU efficient than my Node.js version, but this connection bottleneck is unfortunately so far a showstopper.
* is the backend connection in HTTP 1 or HTTP/2? With HTTP/2 a lot of concurrent requests can be multiplexed over 1 connection
For now it's https 1. I will try to see if HTTP/2 is supported by my chain of go-wrk -> kong -> tomcat.
so far we have used hyper's HTTP client, which has an automatic connection pool, which should have been able to detect the need for more connections. We are not really satisfied with the lack of control there though, and are considering taking over that task from the client
Maybe you need to create the client with_pool_config and set the max idle connections to more than 0 or 1 which is the default? If it works, then a global env var associated to it would be great. It's worth a try and/or improve hyper before resorting to re-implementing. https://users.rust-lang.org/t/hyper-connection-pool-example/5315
Thanks!
Maybe you need to create the client with_pool_config and set the max idle connections to more than 0 or 1 which is the default? If it works, then a global env var associated to it would be great. It's worth a try and/or improve hyper before resorting to re-implementing. https://users.rust-lang.org/t/hyper-connection-pool-example/5315
This message is from 2016, that pool has since then been automatically integrated to hyper's HTTP client, and so this is the code that's apparently not parallelizing the requests. I'll look into why it's not working for now
@ecourreges-orange we looked into it and the router is using the connection pool, starting multiple backend connections when there are concurrent clients. It is not a 1:1 correspondance (ex: I saw 412 backend connections for 1000 clients in one test) but it should be enough to get some parallelism.
Could you describe a bit more your setup?
netstat
command. Alternatively, you can use ss -pot state established | grep router | grep "<ip:port of the subgraph>"
to see all of the established connections@ecourreges-orange we looked into it and the router is using the connection pool, starting multiple backend connections when there are concurrent clients. It is not a 1:1 correspondance (ex: I saw 412 backend connections for 1000 clients in one test) but it should be enough to get some parallelism.
Could you describe a bit more your setup?
* what are you using as client, and for the nodejs alternative (is it a gateway or a handrolled http proxy)
- The client is go-wrk with concurrency 1 to 80.
The comparison is with a Node.js gateway with code assembled by us from these examples
- are you doing multiple times the same query, or different queries on each request?
Always the same query at each request
* the output of your `netstat` command. Alternatively, you can use `ss -pot state established | grep router | grep "<ip:port of the subgraph>"` to see all of the established connections
Only one connection established at all times:
~/workspace/poc_nboa/Client_Simulator/go-wrk (graphql)$ ss -pot state established | grep 10.118.44.10 0 0 10.8.0.2:44108 10.118.44.10:https users:(("router",pid=53378,fd=34)) timer:(keepalive,29sec,0)
* what is the latency of the subgraph
Around 50ms because it is at least 35ms away on the network.
Concurrency 10:
~/workspace/poc_nboa/Client_Simulator/go-wrk (graphql)$ ./go-wrk -no-vr -M POST -H="Authorization: Bearer $access_token" -H='Accept: application/json' -H='Content-Type: application/json' -body='{ "query": "query { fromCID(cid:\"9042393705\"){ score{FDDFQN{rank score}} }}"}' http://localhost:4000/graphql -d 10
Running 10s test @ http://localhost:4000/graphql
10 goroutine(s) running concurrently
1432 requests in 10.036281715s, 183.20KB read
Requests/sec: 142.68
Transfer/sec: 18.25KB
Avg Req Time: 70.085766ms
Fastest Request: 49.043791ms
Slowest Request: 480.874469ms
Number of Errors: 0
Number of Connect timeouts: 0
Concurrency 40:
~/workspace/poc_nboa/Client_Simulator/go-wrk (graphql)$ ./go-wrk -c 40 -no-vr -M POST -H="Authorization: Bearer $access_token" -H='Accept: application/json' -H='Content-Type: application/json' -body='{ "query": "query { fromCID(cid:\"9042393705\"){ score{FDDFQN{rank score}} }}"}' http://localhost:4000/graphql -d 10
Running 10s test @ http://localhost:4000/graphql
40 goroutine(s) running concurrently
1672 requests in 10.112006354s, 213.90KB read
Requests/sec: 165.35
Transfer/sec: 21.15KB
Avg Req Time: 241.914027ms
Fastest Request: 187.099328ms
Slowest Request: 581.516919ms
Number of Errors: 0
Number of Connect timeouts: 0
Directly on the subgraph with concurrency 40 the go-wrk tool does generate many simultaneous connections:
/workspace/poc_nboa/Client_Simulator/go-wrk (graphql)$ ss -pot state established | grep https
0 0 10.8.0.2:55628 10.117.28.10:https users:(("go-wrk",pid=56079,fd=10)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55602 10.117.28.10:https users:(("go-wrk",pid=56079,fd=8)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55654 10.117.28.10:https users:(("go-wrk",pid=56079,fd=14)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55748 10.117.28.10:https users:(("go-wrk",pid=56079,fd=25)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55768 10.117.28.10:https users:(("go-wrk",pid=56079,fd=29)) timer:(keepalive,12sec,0)
0 939 10.8.0.2:55646 10.117.28.10:https users:(("go-wrk",pid=56079,fd=12)) timer:(on,204ms,0)
0 0 10.8.0.2:55812 10.117.28.10:https users:(("go-wrk",pid=56079,fd=36)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55786 10.117.28.10:https users:(("go-wrk",pid=56079,fd=32)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55720 10.117.28.10:https users:(("go-wrk",pid=56079,fd=21)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55750 10.117.28.10:https users:(("go-wrk",pid=56079,fd=27)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55824 10.117.28.10:https users:(("go-wrk",pid=56079,fd=37)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55856 10.117.28.10:https users:(("go-wrk",pid=56079,fd=40)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55800 10.117.28.10:https users:(("go-wrk",pid=56079,fd=34)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55766 10.117.28.10:https users:(("go-wrk",pid=56079,fd=28)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55676 10.117.28.10:https users:(("go-wrk",pid=56079,fd=17)) timer:(keepalive,12sec,0)
0 0 10.8.0.2:55862 10.117.28.10:https users:(("go-wrk",pid=56079,fd=43)) timer:(keepalive,12sec,0)
...
And it goes faster:
Running 10s test @ https://poc-nboa.rec.api.hbx.geo.infra.ftgroup/valkeygg/v1/graphql
40 goroutine(s) running concurrently
3907 requests in 10.040215678s, 3.68MB read
Requests/sec: 389.14
Transfer/sec: 375.79KB
Avg Req Time: 102.792072ms
Fastest Request: 47.581664ms
Slowest Request: 510.22633ms
Number of Errors: 0
Number of Connect timeouts: 0
* is the router compiled in debug or release mode
release docker image
FROM ghcr.io/apollographql/router:v1.2.1 as router
And also from acargo run
of the latest version I get the same behavior of only one connection
Note that there are no errors in the console of router:
~/workspace/router (dev)$ cargo run -- --supergraph=../graphqlrouter/fedefilm.graphql --config=../graphqlrouter/router.yaml
Finished dev [unoptimized + debuginfo] target(s) in 0.40s
Running `target/debug/router --supergraph=../graphqlrouter/fedefilm.graphql --config=../graphqlrouter/router.yaml`
2022-11-24T15:13:19.340520Z INFO Apollo Router v1.4.0 // (c) Apollo Graph, Inc. // Licensed as ELv2 (https://go.apollo.dev/elv2)
2022-11-24T15:13:19.340631Z INFO Anonymous usage data is gathered to inform Apollo product development. See https://go.apollo.dev/o/privacy for more info.
2022-11-24T15:13:19.784616Z INFO healthcheck endpoint exposed at http://127.0.0.1:8088/health
2022-11-24T15:13:19.785463Z INFO GraphQL endpoint exposed at http://0.0.0.0:4000/graphql 🚀
If my Access Token expires, then I get errors for the subgraph on the router console.
I tried this in the router code to no avail:
--- a/apollo-router/src/services/subgraph_service.rs
+++ b/apollo-router/src/services/subgraph_service.rs
@@ -84,7 +84,7 @@ impl SubgraphService {
Self {
client: ServiceBuilder::new()
.layer(DecompressionLayer::new())
- .service(hyper::Client::builder().build(connector)),
+ .service(hyper::Client::builder().pool_max_idle_per_host(10).build(connector)),
service: Arc::new(service.into()),
}
}
My router config:
---
supergraph:
listen: 0.0.0.0:4000
path: /graphql
headers:
all:
request:
- propagate:
matching: ^authorization.*
include_subgraph_errors:
all: true
subgraphs:
allofilm: true
consofilm: true
telemetry:
tracing:
trace_config:
service_name: "router"
service_namespace: "apollo"
# Optional. Either a float between 0 and 1 or 'always_on' or 'always_off'
sampler: 1.0
# Optional. Use a parent based sampler. This enables remote spans help make a decision on if a span is sampeld or not.
# https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#parentbased
parent_based_sampler: true
# Optional limits
max_attributes_per_event: 10
max_attributes_per_link: 10
max_attributes_per_span: 10
max_events_per_span: 10
max_links_per_span: 10
propagation:
trace_context: true
zipkin:
endpoint: http://mycollector.mydomain.com:9411/api/v2/spans
Response from the subgraph:
{"data":{"fromCID":{"score":{"FDDFQN":null}}},"extensions":{"tracing":{"version":1,"startTime":"2022-11-24T15:27:28.214377Z","endTime":"2022-11-24T15:27:28.252514Z","duration":38140752,"parsing":{"startOffset":116378,"duration":102577},"validation":{"startOffset":241902,"duration":120115},"execution":{"resolvers":[{"path":["fromCID"],"parentType":"Query","returnType":"CID","fieldName":"fromCID","startOffset":303700,"duration":427734},{"path":["fromCID","score"],"parentType":"CID","returnType":"Score","fieldName":"score","startOffset":762149,"duration":36637789},{"path":["fromCID","score","FDDFQN"],"parentType":"Score","returnType":"Scoring","fieldName":"FDDFQN","startOffset":37456428,"duration":622952}]}}}}
Response from the router:
{"data":{"fromCID":{"score":{"FDDFQN":null}}}}
I will try disabling telemetry to see if there is any improvement.
is the connection between the router and subgraph in HTTP 1 or 2?
go-wrk to router is http/2 or http/1 and router to subgraph is http/1. The subgraph is not configured for http/2, and not sure that our gateway would enable it. Same results whatever the setting on go-wrk.
In debug logging router seems to always be reusing the idle connection.
{"timestamp":"2022-11-24T15:59:51.890134Z","level":"DEBUG","message":"reuse idle connection for (\"https\", myserver.mydomain.com)","target":"hyper::client::pool","filename":"/home/ecourreges/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.23/src/client/pool.rs","line_number":250,"span":{"name":"subgraph_request"},"spans":[{"trace_id":"007e38be9468c1e59601b198694f3c29","name":"request"},{"name":"supergraph"},{"name":"execution"},{"name":"fetch"},{"name":"subgraph"},{"name":"subgraph_request"}]}
I see :thinking: a few more questions, if you don't mind:
AFAIK the hyper options like pool_max_idle_per_host
won't be of any help here. I'll have to dig deeper in the connection pool and see what happens. Thankfully, it looks like we'll soon get more control on the client behaviour https://seanmonstar.com/post/701008919383932928/hyper-polish-period
Also, could you try to remove this line ?https://github.com/apollographql/router/blob/f6e0ef8563161b79e4430a860dc18cba1e407798/apollo-router/src/services/subgraph_service.rs#L83
The idea I want to check here, is that hyper's connection pool has a special case for HTTP/2: it will not start multiple concurrent connections if there's already one in the connecting state. And if a connection is already established, there will be a race between getting a connection from the pool and establishing a new one. It might be that with the latency, establishing a new TLS connection (3 full roundtrips) would always be slower than getting a connection from the pool
Wow! Great job, that line did the trick! So in http/1 there is no possible race and it parallelizes properly. I guess you now have an issue to file to hyper ;-) Thanks a lot! I had disabled telemetry with no improvement, and now reactivated it with no problems either. I will now be able to resume some testing with a release compiled version, knowing that there might still be an improvement when http/2 can be re-enabled and the race avoided.
Running 10s test @ http://localhost:4000/graphql
40 goroutine(s) running concurrently
3049 requests in 10.064905637s, 390.06KB read
Requests/sec: 302.93
Transfer/sec: 38.75KB
Avg Req Time: 132.042054ms
Fastest Request: 51.185108ms
Slowest Request: 409.331539ms
Number of Errors: 0
Number of Connect timeouts: 0
~/workspace/poc_nboa/Client_Simulator/go-wrk (graphql)$ ss -pot state established | grep https
0 0 192.168.17.128:51972 10.117.30.10:https users:(("router",pid=60815,fd=103)) timer:(keepalive,58sec,0)
0 0 192.168.17.128:52196 10.117.30.10:https users:(("router",pid=60815,fd=88)) timer:(keepalive,58sec,0)
0 0 192.168.17.128:52310 10.117.30.10:https users:(("router",pid=60815,fd=65)) timer:(keepalive,59sec,0)
0 0 192.168.17.128:52432 10.117.30.10:https users:(("router",pid=60815,fd=74)) timer:(keepalive,1min,0)
0 0 192.168.17.128:52286 10.117.30.10:https users:(("router",pid=60815,fd=25)) timer:(keepalive,59sec,0)
0 0 192.168.17.128:51938 10.117.30.10:https users:(("router",pid=60815,fd=97)) timer:(keepalive,58sec,0)
0 0 192.168.17.128:51822 10.117.30.10:https users:(("router",pid=60815,fd=27)) timer:(keepalive,58sec,0)
718 0 192.168.17.128:52080 10.117.30.10:https users:(("router",pid=60815,fd=66)) timer:(keepalive,58sec,0)
0 0 192.168.17.128:52390 10.117.30.10:https users:(("router",pid=60815,fd=29)) timer:(keepalive,59sec,0)
0 0 192.168.17.128:52422 10.117.30.10:https users:(("router",pid=60815,fd=71)) timer:(keepalive,1min,0)
0 0 192.168.17.128:52374 10.117.30.10:https users:(("router",pid=60815,fd=75)) timer:(keepalive,59sec,0)
0 0 192.168.17.128:52164 10.117.30.10:https users:(("router",pid=60815,fd=81)) timer:(keepalive,58sec,0)
0 0 192.168.17.128:52262 10.117.30.10:https users:(("router",pid=60815,fd=95)) timer:(keepalive,58sec,0)
0 0 192.168.17.128:52030 10.117.30.10:https users:(("router",pid=60815,fd=80)) timer:(keepalive,58sec,0)
0 0 192.168.17.128:52116 10.117.30.10:https users:(("router",pid=60815,fd=77)) timer:(keepalive,58sec,0)
0 0 192.168.17.128:52012 10.117.30.10:https users:(("router",pid=60815,fd=78)) timer:(keepalive,58sec,0)
0 0 192.168.17.128:52184 10.117.30.10:https users:(("router",pid=60815,fd=85)) timer:(keepalive,58sec,0)
0 0 192.168.17.128:52272 10.117.30.10:https users:(("router",pid=60815,fd=99)) timer:(keepalive,58sec,0)
awesome! I think I'll add an option to toggle http/2 support while we're investigating hyper
I have difficulties reproducing it, even by introducing delays in the connector and on packet reads.
Are you absolutely sure the subgraph connection is not done in HTTP/2? I'm asking because with a 50ms latency, on one connection we would only be able to do 20RPS, and here it's around 150RPS, so this feels like there's either pipelining (but I'm not sure hyper's client supports it) or HTTP/2 multiplexing.
you can look at the version by printing response.version()
from https://github.com/apollographql/router/blob/f6e0ef8563161b79e4430a860dc18cba1e407798/apollo-router/src/services/subgraph_service.rs#L171
You are right, especially If disabling HTTP/2 fixes the problem obviously, I should have corrected my previous message, the requests to the subgraph were in HTTP/2, but what got me confused is that I have 3 legs:
alright, that makes more sense then. And that matches an issue I've seen elsewhere, that if in theory a HTTP/2 session can handle a lot f concurrent stream and get all the available bandwidth, since it is TCP under the hood, any delay or packet retransmission would affect all the streams. In those cases the solution would be to open multiple connections with HTTP/2 (which hyper's pool disallow for now), move back to HTTP 1 with one connection per concurrent request, or move to HTTP 3, which does not suffer from the same retransmission issues
Bumping this up as this is a problem for us too and we'd been forced to disable http2 for the same reasons as described in this ticket for quite some time now.
Related: https://github.com/apollographql/router/issues/1956
Bumping this up as this is a problem for us too and we'd been forced to disable http2 for the same reasons as described in this ticket for quite some time now.
@akhilesh-godi How did you disable HTTP/2
to the subgraphs in the router? Is there a configuration setting?
Unfortunately, it looks like Apollo Router has no way to disable HTTP/2 on the client side. The only option I can think of is to make the subgraph endpoints insecure with an http
scheme and hope that Apollo Router's client library doesn't attempt http/2
over an insecure connection. It shouldn't, but we'll see.
Unfortunately, it looks like Apollo Router has no way to disable HTTP/2 on the client side. The only option I can think of is to make the subgraph endpoints insecure with an
http
scheme and hope that Apollo Router's client library doesn't attempthttp/2
over an insecure connection. It shouldn't, but we'll see.
As you can see from the comments above, you can fix this yourself if you are willing to recompile, having disabled this line: https://github.com/apollographql/router/blob/dev/apollo-router/src/services/subgraph_service.rs#L135
I believe @Geal should commit with this line disabled or configurable with a default to http1 until the issue with the underlying lib is fixed. As it is a one liner, and the solution has been found by @Geal, it is not my role to submit a PR.
instead of disabling it entirely, we will look at ways to balance subgraph calls
I have confirmed with certainty that the Apollo Router HTTP/2 to subgraph implementation isn't working optimally. We recently transitioned from Apollo Server to Apollo Router and our metrics showed a regression in latency. It surprised us and so I set out to determine why. Upon finding this issue, I looked at our configuration and network connections out of Apollo Router instances. Indeed, there was only one, regardless of the number of requests handled. Our subgraphs are all endpoints on Google Cloud internal load balancers and there is no way to disable HTTP/2 on them. The documentation states that this must be disabled on the client side (which makes sense). Instead of abandoning Apollo Router, I decided to force HTTP/1.1 but simply making the subgraph endpoints insecure with http
and hoped that the library Router was using would adhere to HTTP/2 being only available over https
. I can confirm that over http
(insecure) indeed HTTP/2 isn't used. When I did this the performance issues went away.
Our setup is the following:
40
pods which connects to sub-graphs all within the cluster.http
instead of https
which causes Apollo Router to not use HTTP/2.There is something unique about our system in that we have GraphQL requests with high latency variation from ms
to greater than 60s
. This high variance is what probably causes us to see this performance issue more than others. In order to replicate the behaviour I suggest you use https://github.com/Shopify/toxiproxy in front of the subgraph endpoints and manipulate the latency (i.e. https://github.com/Shopify/toxiproxy#latency).
We are deciding how to proceed. But one option at this time is we fork Apollo Router and disable HTTP/2
in the source code and build a custom version for our use case. We aren't Rust developers but there appears to be a single location that needs to be updated to remove HTTP/2
support.
Bumping this as we are also seeing this issue.
I know it is not ideal, but considering that this is a fairly fundamental issue (in my opinion the router cannot be used at scale as is) and is a few months old with a known workaround, wouldn't it make sense to temporarily allow http2 to be disabled by some configuration value instead of having a bunch of people make their own forks?
this is my next task so the fix should come soon. I'll add an option for that along with looking into better load balancing
there is now an option to deactivate HTTP/2. I'm reopening this issue though, as I'm still looking for a proper fix
@marc-barry apologies for the delay - we had forked and compiled it ourself after disabling http2.
there is now an option to deactivate HTTP/2. I'm reopening this issue though, as I'm still looking for a proper fix
@Geal The linked MR has "experimental_enable_http2". Might be a wording confusion but is http2 disabled by default, or do you need to add this experimental_enable_http2 config true or false to disable http2?
You have to set it to false
to disable http2, it's enabled by default
It's enabled by default to keep the same default behaviour. Suddenly deactivating http/2 for all users wouldn't work well, and the intent here is to fix it properly soon
@Geal if the router is communicating with the subgraphs using only http(insecure) connection is there need to set experimental_enable_http2 to false.
@bashiru98 you should not need it in that case
@Geal I looked up at the documentation of Apollo router and could not locate any on how to use the new disable option that you have introduced in https://github.com/apollographql/router/pull/2621 , I am new to Rust and to ApolloRouter so could also be a novice question
@pp05
traffic_shaping:
all:
experimental_enable_http2: false # to disable for all subgraphs
subgraphs: # Rules applied to requests from the router to individual subgraphs
products:
experimental_enable_http2: false # only for "products" subgraph
Thank you
@pp05
traffic_shaping: all: experimental_enable_http2: false # to disable for all subgraphs subgraphs: # Rules applied to requests from the router to individual subgraphs products: experimental_enable_http2: false # only for "products" subgraph
@Geal I have been tracking this for a while now, just wanted to check if you have managed to find a workaround to allow enabling of http/2 connections ? Is there an issue logged with hyper ? I also found this PR from yourself to use tower as an alternative library, has this gone any further into testing/peformance testing etc.
@Geal it appears that the experimental_enable_http2
options has been removed and now throws an error in configuration as of the more recent apollo-router releases? Is the issue here fixed, or has this been moved? I still see the option in https://github.com/apollographql/router/blob/d506c4b1a6629c692dacdc86614441e9c77e03ec/docs/shared/router-common-config.mdx#L29 and there appears to have been no mention in the release notes?
the option has been moved in https://github.com/apollographql/router/blob/dev/CHANGELOG.md#http2-cleartext-protocol-h2c-support-for-subgraph-connections-issue-3535
from
experimental_enable_http2: false
to:
experimental_http2: disable # possible options: enable, disable, http2only
The configuration upgrade mechanism was supposed to call it out, but there were some issues in the way it could rewrite an option of that shape.
For a proper fix of this issue, we have to wait for the 1.0 release of the hyper HTTP client (which is in 1.0.0-rc.4 right now)
Is your feature request related to a problem? Please describe. I am benching router with a backend that is a bit far on the network, so I need parallelism of calls to it and I noticed that the router only establishes one https connection to destination. A
netstat -laputen
easily shows that established client input connections on port 4000 scale to the number of simultaneous clients and backend connections to port 443 stay constantly exactly at 1, whether there is traffic or not.Describe the solution you'd like Make it clear if backend connection parallelism is configurable or not.
If it's currently not, it needs to be coded, as the goal of coding it in Rust is to support higher throughput, even for requests to one backend.
I am looking into the code and learning a bit of Rust on the way ;-)
Thanks.