Closed kiranisaac closed 5 years ago
Update: Increasing the per_connection_buffer_limit_bytes fixed the issue. I still think envoy shouldn't segfault in this case
cc @lizan.
Looks like a flow control issue? cc @alyssawilk
At first glance we're getting data from upstream, we go over the limit, send en error response, call doDeferredStreamDestroy, which triggers router cleanup and deletes the upstream request while we're still under the stack of the upstream request. That seems pretty sketchy from a lifetime perspective. I'll have to poke around tomorrow and see what changed and/or if I can reproduce. I know we have tests where we go over the limit and fail and I'm not sure what's different about this set up and the tests that they're passing and this isn't
Oh, I think I know what happens, and it happens when we sendLocalReply after receiving response headers for a gRPC response.
When the router decodes headers, it latches a pointer to the upstream headers, then hands off ownership to the HCM. If the HCM does sendLocalReply after that, the sendLocalReply is_grpc branch creates gRPC compatible response headers, overwriting the headers the HCM has, and thus invalidating the router filter's pointer.
We don't have a response-data-too-large test which has both access logs (accessing the headers after they were destroyed) and gRPC (doing the header overwrite) but I'll get started on one.
@mattklein123 we could fix either by updating the headers in place for the gRPC response path, or by making response headers accessible via callbacks and not having the router latch. I feel like the latter is more future-proof. Thoughts?
@mattklein123 we could fix either by updating the headers in place for the gRPC response path, or by making response headers accessible via callbacks and not having the router latch. I feel like the latter is more future-proof. Thoughts?
Yeah I've wanted to get rid of all headers latching for quite some time and have headers/trailers only be accessible via callbacks like body. More work but definitely the much better change IMO.
@alyssawilk / @mattklein123 : For now we have per_connection_buffer_limit_bytes set to 10MB (in both the listeners and clusters section). Is this the right workaround? Is this a bad idea (performance/memory)? Do we need to set it in both places?
Additionally, is this setting the buffer size or just an upper bound on the buffer size? Our payload size will depend on the page size that is requested by the user. So we don't know how much buffer size is required.
Having a large per connection buffer limit basically makes it easier for your server to OOM, either accidentally by sending many large responses or intentionally if malicious clients try to DoS you.
If you're hitting the response too large callback it means you're using buffering filters (where headers will not be sent to the users until Envoy has received and buffered the whole response) and some of your responses are too large, so if the bug were fixed you'd be failing to proxy your own responses. So figuring out how large your responses actually are and capping the buffers there may be optimal.
The gRPC JSON transcoded is a buffering filter, if the transcoded gRPC call is an unary call, it will buffer until upstream sends the whole response and trailers.
On Thu, May 2, 2019 at 10:12 AM alyssawilk notifications@github.com wrote:
Having a large per connection buffer limit basically makes it easier for your server to OOM, either accidentally by sending many large responses or intentionally if malicious clients try to DoS you.
If you're hitting the response too large callback it means you're using buffering filters (where headers will not be sent to the users until Envoy has received and buffered the whole response) and some of your responses are too large, so if the bug were fixed you'd be failing to proxy your own responses. So figuring out how large your responses actually are and capping the buffers there may be optimal.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/6744#issuecomment-488754986, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHYB3ZACYOKXESJ3YETUADPTMOGXANCNFSM4HJGKRHA .
Title: Envoy is segfaulting
Description: Envoy is segfaulting. I am sending http1 GET requests. Envoy has grpc transcoding setup. Backend service is a grpc service.
I am running envoyproxy/envoy-alpine-debug:v1.10.0 within a docker container on a VM.
Repro steps:
Envoy config here