Envoy Segfaults - Githubissues

kiranisaac commented 5 years ago

Title: Envoy is segfaulting

Description: Envoy is segfaulting. I am sending http1 GET requests. Envoy has grpc transcoding setup. Backend service is a grpc service.

I am running envoyproxy/envoy-alpine-debug:v1.10.0 within a docker container on a VM.

Repro steps:

Include sample requests, environment, etc. All data and inputs required to reproduce the bug.

[2019-04-29 19:13:28.720][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #29: [0x7f9f0deefcf8] 
[2019-04-29 19:13:28.720][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #28: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::_FUN() [0xfa90e5] 
[2019-04-29 19:13:28.717][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #27: Envoy::Server::WorkerImpl::threadRoutine() [0xa8d10e] 
[2019-04-29 19:13:28.715][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #26: Envoy::Event::DispatcherImpl::run() [0xa9226a] 
[2019-04-29 19:13:28.712][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #25: event_base_loop [0xdca5c8] 
[2019-04-29 19:13:28.709][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #24: event_process_active [0xdc89bf] 
[2019-04-29 19:13:28.706][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #23: event_process_active_single_queue.isra.31 [0xdc8489] 
[2019-04-29 19:13:28.703][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #22: Envoy::Event::FileEventImpl::assignEvents()::{lambda()#1}::_FUN() [0xa92bea] 
[2019-04-29 19:13:28.701][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #21: Envoy::Network::ConnectionImpl::onFileEvent() [0xa98f2a] 
[2019-04-29 19:13:28.698][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #20: Envoy::Network::ConnectionImpl::onReadReady() [0xa9881a] 
[2019-04-29 19:13:28.695][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #19: Envoy::Network::FilterManagerImpl::onContinueReading() [0xa9bfac] 
[2019-04-29 19:13:28.692][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #18: Envoy::Http::CodecClient::CodecReadFilter::onData() [0xbf0f9d] 
[2019-04-29 19:13:28.690][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #17: Envoy::Http::CodecClient::onData() [0xbf0dc1] 
[2019-04-29 19:13:28.687][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #16: Envoy::Http::Http2::ConnectionImpl::dispatch() [0xc5d6a7] 
[2019-04-29 19:13:28.684][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #15: nghttp2_session_mem_recv [0xc66a5b] 
[2019-04-29 19:13:28.681][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #14: nghttp2_session_on_data_received [0xc61f20] 
[2019-04-29 19:13:28.679][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #13: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0xc5c4a1] 
[2019-04-29 19:13:28.676][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #12: Envoy::Http::ConnectionManagerImpl::ActiveStream::encodeData() [0xc4c36b] 
[2019-04-29 19:13:28.673][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #11: Envoy::Http::ConnectionManagerImpl::ActiveStreamFilterBase::commonHandleAfterDataCallback() [0xc499aa] 
[2019-04-29 19:13:28.670][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #10: Envoy::Http::ConnectionManagerImpl::ActiveStreamFilterBase::commonHandleBufferData() [0xc40170] 
[2019-04-29 19:13:28.668][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #9: Envoy::Http::ConnectionManagerImpl::ActiveStreamEncoderFilter::responseDataTooLarge() [0xc40347] 
[2019-04-29 19:13:28.665][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #8: Envoy::Http::ConnectionManagerImpl::doEndStream() [0xc3fa40] 
[2019-04-29 19:13:28.662][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #7: Envoy::Http::ConnectionManagerImpl::doDeferredStreamDestroy() [0xc3f91e] 
[2019-04-29 19:13:28.659][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #6: Envoy::Router::Filter::cleanup() [0xc02c9f] 
[2019-04-29 19:13:28.657][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #5: Envoy::Router::Filter::UpstreamRequest::~UpstreamRequest() [0xc0b601] 
[2019-04-29 19:13:28.654][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #4: Envoy::Router::Filter::UpstreamRequest::~UpstreamRequest() [0xc0b48f] 
[2019-04-29 19:13:28.651][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #3: Envoy::Extensions::AccessLoggers::File::FileAccessLog::log() [0xe3bc85] 
[2019-04-29 19:13:28.648][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #2: _ZNK5Envoy9AccessLog13FormatterImpl6formatB5cxx11ERKNS_4Http9HeaderMapES5_S5_RKNS_10StreamInfo10StreamInfoE [0xd5945b] 
[2019-04-29 19:13:28.645][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:73] #1: _ZNK5Envoy9AccessLog23ResponseHeaderFormatter6formatB5cxx11ERKNS_4Http9HeaderMapES5_S5_RKNS_10StreamInfo10StreamInfoE [0xd5aa08] 
[2019-04-29 19:13:28.642][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #0: [0x7f9f0defa4b0] 
[2019-04-29 19:13:28.642][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:69] Backtrace (use tools/stack_decode.py to get line numbers): 
[2019-04-29 19:13:28.642][21][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:81] Caught Segmentation fault, suspect faulting address 0x35d0150 
[2019-04-29 19:13:27.809][1][info][config] [source/server/listener_manager_impl.cc:1006] all dependencies initialized. starting workers

Envoy config here


# specify an admin endpoint for Envoy
admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 10000 }

static_resources:
  listeners:
  - name: listener1
    address:
      socket_address: { address: 0.0.0.0, port_value: 8080 }
    filter_chains:
      # We want Envoy support SSL.
      - tls_context:
              common_tls_context:
                alpn_protocols: h2,http/1.1
                tls_certificates:
                  # certificate chain - these could be self signed certificates
                  - certificate_chain:
                      filename: /certs/server.crt
                    private_key:
                      filename: /certs/server.key
        # The filters we want to add
        filters:
          - name: envoy.http_connection_manager
            config:
              stat_prefix: grpc_json
              codec_type: AUTO
              # idle time for the connection from google load balancer
              idle_timeout: 720s
              # drain timeout for the connection from the google load balancer
              drain_timeout: 140s
              # stream idle timeout for the connection from the google load balancer
              stream_idle_timeout: 720s
              server_name: myserver
              route_config:
                name: local_route
                virtual_hosts:
                - name: local_service
                  domains: ["*"]
                  routes:
                  - match:
                      prefix: "/"
                    route:
                      # this cluster name should be defined in the clusters section
                      cluster: my_service
                      timeout: 110s
                      retry_policy:
                        # retry only on connection failures
                        retry_on: connect-failure,refused-stream
                        num_retries: 2
                        num_retries: 2
                  - match:
                      # this is the grpc route. Grpc requests are actually http2 requests to the
                      # fully qualified name of the service.
                      # This should match the name of the service of the in the proto file
                      prefix: "/com.my.service/"
                    route:
                      timeout: 110s
                       # this cluster name should be defined in the clusters section
                      cluster: my_service
                      retry_policy:
                        # retry only on connection failures
                        retry_on: connect-failure,refused-stream
                        num_retries: 2
                        num_retries: 2
                # we would like remove all three headers. But envoy won't remove grpc-status header
                response_headers_to_remove: ["grpc-encoding", "grpc-accept-encoding"]
              http_filters:
              # configure envoy to respond to health check
              - name: envoy.health_check
                config:
                    # we want envoy to handle to health check
                    pass_through_mode: false
                    cluster_min_healthy_percentages:
                      # we require 100% of cluster 'my_service' to healthy
                      my_service:
                        value: 100
                    headers:
                    - name: ":path"
                      exact_match: "/health"
              # we want envoy to accept json body and support REST endpoints
              - name: envoy.grpc_json_transcoder
                config:
                  # Provide the descriptor file generated by protobuf
                  proto_descriptor: "/etc/envoy/pcs-model.desc"
                  # What services need envoy to support
                  services: ["com.my.service"]
                  print_options:
                    add_whitespace: true
                    always_print_primitive_fields: false
                    always_print_enums_as_ints: false
                    preserve_proto_field_names: true

              - name: envoy.router
                config:
                    suppress_envoy_headers: true
                    upstream_log:
                        name: envoy.file_access_log
                        filter:
                            not_health_check_filter: {}
                        config:
                            path: /host/tmp/request_log.txt

  # backends backing our envoy
  clusters:
  # this must match the name cluster name provided in the routes section
  - name: my_service
    connect_timeout: 1.25s
    # static route
    type: static
    http2_protocol_options: {}
    # localhost is our cluster
    hosts:
    - socket_address:
        address: 127.0.0.1
        port_value: 8000
    # configure health check for the cluster
    health_checks:
    # health check section deleted

stats_sinks:
  - name: "envoy.statsd"
    config:
      address:
        socket_address:
          protocol: "UDP"
          address: "127.0.0.1"
          port_value: "8125"

stats_flush_interval:
  seconds: 60

watchdog:
  kill_timeout: 90s
  multikill_timeout: 90s

kiranisaac commented 5 years ago

Update: Increasing the per_connection_buffer_limit_bytes fixed the issue. I still think envoy shouldn't segfault in this case

mattklein123 commented 5 years ago

cc @lizan.

lizan commented 5 years ago

Looks like a flow control issue? cc @alyssawilk

alyssawilk commented 5 years ago

At first glance we're getting data from upstream, we go over the limit, send en error response, call doDeferredStreamDestroy, which triggers router cleanup and deletes the upstream request while we're still under the stack of the upstream request. That seems pretty sketchy from a lifetime perspective. I'll have to poke around tomorrow and see what changed and/or if I can reproduce. I know we have tests where we go over the limit and fail and I'm not sure what's different about this set up and the tests that they're passing and this isn't

alyssawilk commented 5 years ago

Oh, I think I know what happens, and it happens when we sendLocalReply after receiving response headers for a gRPC response.

When the router decodes headers, it latches a pointer to the upstream headers, then hands off ownership to the HCM. If the HCM does sendLocalReply after that, the sendLocalReply is_grpc branch creates gRPC compatible response headers, overwriting the headers the HCM has, and thus invalidating the router filter's pointer.

We don't have a response-data-too-large test which has both access logs (accessing the headers after they were destroyed) and gRPC (doing the header overwrite) but I'll get started on one.

@mattklein123 we could fix either by updating the headers in place for the gRPC response path, or by making response headers accessible via callbacks and not having the router latch. I feel like the latter is more future-proof. Thoughts?

mattklein123 commented 5 years ago

@mattklein123 we could fix either by updating the headers in place for the gRPC response path, or by making response headers accessible via callbacks and not having the router latch. I feel like the latter is more future-proof. Thoughts?

Yeah I've wanted to get rid of all headers latching for quite some time and have headers/trailers only be accessible via callbacks like body. More work but definitely the much better change IMO.

kiranisaac commented 5 years ago

@alyssawilk / @mattklein123 : For now we have per_connection_buffer_limit_bytes set to 10MB (in both the listeners and clusters section). Is this the right workaround? Is this a bad idea (performance/memory)? Do we need to set it in both places?

Additionally, is this setting the buffer size or just an upper bound on the buffer size? Our payload size will depend on the page size that is requested by the user. So we don't know how much buffer size is required.

alyssawilk commented 5 years ago

Having a large per connection buffer limit basically makes it easier for your server to OOM, either accidentally by sending many large responses or intentionally if malicious clients try to DoS you.

If you're hitting the response too large callback it means you're using buffering filters (where headers will not be sent to the users until Envoy has received and buffered the whole response) and some of your responses are too large, so if the bug were fixed you'd be failing to proxy your own responses. So figuring out how large your responses actually are and capping the buffers there may be optimal.

lizan commented 5 years ago

The gRPC JSON transcoded is a buffering filter, if the transcoded gRPC call is an unary call, it will buffer until upstream sends the whole response and trailers.

On Thu, May 2, 2019 at 10:12 AM alyssawilk notifications@github.com wrote:

Having a large per connection buffer limit basically makes it easier for your server to OOM, either accidentally by sending many large responses or intentionally if malicious clients try to DoS you.

If you're hitting the response too large callback it means you're using buffering filters (where headers will not be sent to the users until Envoy has received and buffered the whole response) and some of your responses are too large, so if the bug were fixed you'd be failing to proxy your own responses. So figuring out how large your responses actually are and capping the buffers there may be optimal.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/6744#issuecomment-488754986, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHYB3ZACYOKXESJ3YETUADPTMOGXANCNFSM4HJGKRHA .

envoyproxy / envoy

Envoy Segfaults #6744