Golang HTTP filter performance bottleneck

bbassingthwaite commented 9 months ago

Title: Golang HTTP filter performance bottleneck

Description:

I've been doing performance tests of an HTTP Go Plugin that issues a local reply such as:

func (f *filter) DecodeHeaders(header api.RequestHeaderMap, endStream bool) api.StatusType {
    f.callbacks.SendLocalReply(responseCode, body, nil, 0, "")
    return api.LocalReply
}

I've been comparing this to an HTTP Route that has no go plugin and is configured to send a directly reply. The Envoy server is running a 32 core AMD EPYC 7302P and I am running nighthawk on a separate machine with identical specs.

When running a load test with the Go plugin, I get 210,000 req/sec while tests against the route with a direct response gets over 400,000 req/sec. The Envoy CPU usage on the first test is around 70% while the second test saturates all cores.

Digging into the problem further it seems that the usage of the sync.Map to hold requests is causing a large amount of lock contention between the 32 worker threads. I would also guess that as the worker count scales up, the problem will only get worse.

In an effort to prove this theory, I made a change to create an array of sync.Map based on the number of cores and then shard requests into buckets based on uintptr(unsafe.Pointer(key)) % coreCount. Doing the same performance test with this change gets me closer to 288,000 req/sec. In order to reduce lock contention even further, I set coreCount=4096 and reran the test to get 321,000 req/sec, getting me closer to performance without the go plugin. A 52% increase from baseline.

IMO, it would be ideal for us to not have locks in this path at all and there would be a few ways of doing this:

1) Envoy informs the Go Plugin on initialization on the number of workers which would allow us to initialize an array of either sync.Map or possibly even use a map without a lock, since we shouldn't have concurrent writes with each worker getting its own map now. This would also require that calls from Envoy to Go would need to pass the ID of the worker along so it retrieve the map for its worker. This could be the ideal scenario since we can dedicate a map to each worker and not worry about lock contention between multiple workers.

2) The second proposal would be similar but would put more onus on the go plugin coordinating the initialization. Each Envoy worker thread would need make a call to the go plugin to perform initialization and would receive an identifier. Subsequent calls from Envoy to Go, would need to pass this identifier along in order to identify the worker thread and therefore its dedicated maps.

I believe option 1 would be preferred since it allows us to have a single identifier that is shared between Envoy and Go and not need to create a 2nd identifier and require a mapping.

I believe this change would get us even closer to the 400,000 req/sec and would also allow Envoy users with even higher core counts to get better performance.

cc @doujiang24

Thanks!

doujiang24 commented 9 months ago

Great, thanks for your reporting @bbassingthwaite

Yep, we know that sync.Map could be optimized, but haven't tested in such an case, sendLocalReply + large CPU uage. And I totally agree with you, including the test result and optimization idea.

Further, I have ideas about implementation:

while Golang is initializing(in the init function), Golang could gets the number of workers in the Envoy through a Envoy C++ function, and initializes request maps with the corresponding number of workers.
while processing request, pass the envoy worker thread id to Golang(in the req_ struct), so that Golang can get the corresponding request map table directly based on the id.

Are you interested in this optimization?

bbassingthwaite commented 9 months ago

@doujiang24 yes that sounds good to me, is this something you are wanting to take on or have time to do? Thanks!

doujiang24 commented 9 months ago

@bbassingthwaite Sorry, I'm busy with other issues. If you're interested, you're welcome to file a PR, I'd be happy to review. Thanks~

envoyproxy / envoy

Golang HTTP filter performance bottleneck #31916