envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.14k stars 4.83k forks source link

Spurious segmentation fault in flushMetricsToSinks -> MetricSnapshotImpl #10179

Open jmuia opened 4 years ago

jmuia commented 4 years ago

Title: Spurious segmentation fault in flushMetricsToSinks -> MetricSnapshotImpl

Description: Envoy occasionally crashes when flushing metrics, which is unexpected.

Repro steps: Unsure how to reproduce. It is seen spuriously (usually <10 times per day in production fleet).

It has been observed in:

Config:

# stats configuration

stats_flush_interval: 30s
stats_sinks:
  - name: "envoy.dog_statsd"
    config:
      address:
        socket_address:
          address: "127.0.0.1"
          port_value: 8200
stats_config:
  stats_tags:
    - tag_name: "envoy_dst_zone"
      regex: 'cluster\.(?:[^.]+\.)?zone\.[-a-z0-9]+\.(([-a-z0-9]+)?\.)'
    - tag_name: "envoy_src_zone"
      regex: 'cluster\.(?:[^.]+\.)?zone\.(([-a-z0-9]+)?\.)'

Call Stack: From the 1.12.2 build described above:

[2020-02-26 18:12:18.680][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:83] Caught Segmentation fault, suspect faulting address 0x0
[2020-02-26 18:12:18.681][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:70] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-02-26 18:12:18.681][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:71] Envoy version: c3c7d68a4844af3cb125330ec6527243571aa822/1.12.2.2/Clean/RELEASE/BoringSSL
[2020-02-26 18:12:18.682][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #0: __restore_rt [0x7f54b8ab6390]
[2020-02-26 18:12:18.688][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #1: Envoy::Server::MetricSnapshotImpl::MetricSnapshotImpl() [0x55fe11a57415]
[2020-02-26 18:12:18.690][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #2: Envoy::Server::InstanceUtil::flushMetricsToSinks() [0x55fe11a57b32]
[2020-02-26 18:12:18.693][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #3: Envoy::Server::InstanceImpl::flushStatsInternal() [0x55fe11a57d7f]
[2020-02-26 18:12:18.696][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #4: Envoy::Stats::ThreadLocalStoreImpl::mergeInternal() [0x55fe11ac41b5]
[2020-02-26 18:12:18.699][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #5: std::_Function_handler<>::_M_invoke() [0x55fe11ac422b]
[2020-02-26 18:12:18.702][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #6: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x55fe11a9ae23]
[2020-02-26 18:12:18.705][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #7: event_process_active_single_queue.isra.33 [0x55fe11db34b0]
[2020-02-26 18:12:18.707][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #8: event_base_loop [0x55fe11db3b7f]
[2020-02-26 18:12:18.710][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #9: Envoy::Server::InstanceImpl::run() [0x55fe11a60eba]
[2020-02-26 18:12:18.713][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #10: Envoy::MainCommonBase::run() [0x55fe1144abfc]
[2020-02-26 18:12:18.716][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #11: main [0x55fe11412453]
[2020-02-26 18:12:18.716][2563][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #12: __libc_start_main [0x7f54b86fb830]
mattklein123 commented 4 years ago

cc @jmarantz @ramaraochavali there must be some timing issue in the histogram merging code?

ramaraochavali commented 4 years ago

I think it is not histogram merge code, If I am reading this correctly It is actually failing at completion callback https://github.com/envoyproxy/envoy/blob/master/source/server/server.cc#L182 which is flushStatsInternal. Are you thinking it could have been caused by some timing issue with when the completion callback is called?

mattklein123 commented 4 years ago

Yeah there must be some type of lifetime issue during the merge/update sequence. I haven't looked in detail though.

junr03 commented 4 years ago

@mattklein123 and I were chatting earlier, and he made me aware of this issue. We are seeing a crash thanks to an assertion on HistogramStatisticsImpl::refresh https://github.com/lyft/envoy-mobile/issues/688.

@ramaraochavali let me know if you think these might be related. Given that most people deploying server-side compile assertions out, but we (envoy mobile) don't it might be the case that this issue would hit an assertion before segfaulting if assertions were compiled.

As I mentioned in https://github.com/lyft/envoy-mobile/issues/688#issuecomment-597747745 I can't quite reason why either of the two assertions on refresh would ever be hit. lmk if you have any thoughts.

ramaraochavali commented 4 years ago

@junr03 To me this one looks little different i.e. it seems to be happening after histogram merge is done and we actually flush the stats some where here https://github.com/envoyproxy/envoy/blob/master/source/server/server.cc#L167

As I mentioned in lyft/envoy-mobile#688 (comment) I can't quite reason why either of the two assertions on refresh would ever be hit. lmk if you have any thoughts.

Hmm.. AFAICT, this should not definitely happen for Quantiles. They are fixed and should be same always unless hist_approx_quantile is some how messing up with the data passed in, during previous merge? At other places we always iterate by supported_quantiles so may not be creating issue? and was there an upgrade of libcircllhist in the recent past?

junr03 commented 4 years ago

@ramaraochavali I am responding in https://github.com/lyft/envoy-mobile/issues/688 to not pollute this with Envoy Mobile details.