Closed bbassingthwaite closed 9 months ago
Great, thanks for your reporting @bbassingthwaite
Yep, we know that sync.Map
could be optimized, but haven't tested in such an case, sendLocalReply + large CPU uage.
And I totally agree with you, including the test result and optimization idea.
Further, I have ideas about implementation:
Are you interested in this optimization?
@doujiang24 yes that sounds good to me, is this something you are wanting to take on or have time to do? Thanks!
@bbassingthwaite Sorry, I'm busy with other issues. If you're interested, you're welcome to file a PR, I'd be happy to review. Thanks~
Title: Golang HTTP filter performance bottleneck
Description:
I've been doing performance tests of an HTTP Go Plugin that issues a local reply such as:
I've been comparing this to an HTTP Route that has no go plugin and is configured to send a directly reply. The Envoy server is running a 32 core
AMD EPYC 7302P
and I am runningnighthawk
on a separate machine with identical specs.When running a load test with the Go plugin, I get
210,000 req/sec
while tests against the route with a direct response gets over400,000 req/sec
. The Envoy CPU usage on the first test is around 70% while the second test saturates all cores.Digging into the problem further it seems that the usage of the sync.Map to hold requests is causing a large amount of lock contention between the 32 worker threads. I would also guess that as the worker count scales up, the problem will only get worse.
In an effort to prove this theory, I made a change to create an array of
sync.Map
based on the number of cores and then shard requests into buckets based onuintptr(unsafe.Pointer(key)) % coreCount
. Doing the same performance test with this change gets me closer to288,000 req/sec
. In order to reduce lock contention even further, I setcoreCount=4096
and reran the test to get321,000 req/sec
, getting me closer to performance without the go plugin. A 52% increase from baseline.IMO, it would be ideal for us to not have locks in this path at all and there would be a few ways of doing this:
1) Envoy informs the Go Plugin on initialization on the number of workers which would allow us to initialize an array of either
sync.Map
or possibly even use amap
without a lock, since we shouldn't have concurrent writes with each worker getting its own map now. This would also require that calls from Envoy to Go would need to pass the ID of the worker along so it retrieve the map for its worker. This could be the ideal scenario since we can dedicate a map to each worker and not worry about lock contention between multiple workers.2) The second proposal would be similar but would put more onus on the go plugin coordinating the initialization. Each Envoy worker thread would need make a call to the go plugin to perform initialization and would receive an identifier. Subsequent calls from Envoy to Go, would need to pass this identifier along in order to identify the worker thread and therefore its dedicated maps.
I believe option 1 would be preferred since it allows us to have a single identifier that is shared between Envoy and Go and not need to create a 2nd identifier and require a mapping.
I believe this change would get us even closer to the
400,000 req/sec
and would also allow Envoy users with even higher core counts to get better performance.cc @doujiang24
Thanks!