Open jmuia opened 4 years ago
cc @jmarantz @ramaraochavali there must be some timing issue in the histogram merging code?
I think it is not histogram merge code, If I am reading this correctly It is actually failing at completion callback https://github.com/envoyproxy/envoy/blob/master/source/server/server.cc#L182 which is flushStatsInternal
. Are you thinking it could have been caused by some timing issue with when the completion callback is called?
Yeah there must be some type of lifetime issue during the merge/update sequence. I haven't looked in detail though.
@mattklein123 and I were chatting earlier, and he made me aware of this issue. We are seeing a crash thanks to an assertion on HistogramStatisticsImpl::refresh
https://github.com/lyft/envoy-mobile/issues/688.
@ramaraochavali let me know if you think these might be related. Given that most people deploying server-side compile assertions out, but we (envoy mobile) don't it might be the case that this issue would hit an assertion before segfaulting if assertions were compiled.
As I mentioned in https://github.com/lyft/envoy-mobile/issues/688#issuecomment-597747745 I can't quite reason why either of the two assertions on refresh would ever be hit. lmk if you have any thoughts.
@junr03 To me this one looks little different i.e. it seems to be happening after histogram merge is done and we actually flush the stats some where here https://github.com/envoyproxy/envoy/blob/master/source/server/server.cc#L167
As I mentioned in lyft/envoy-mobile#688 (comment) I can't quite reason why either of the two assertions on refresh would ever be hit. lmk if you have any thoughts.
Hmm.. AFAICT, this should not definitely happen for Quantiles. They are fixed and should be same always unless hist_approx_quantile
is some how messing up with the data passed in, during previous merge? At other places we always iterate by supported_quantiles so may not be creating issue? and was there an upgrade of libcircllhist
in the recent past?
@ramaraochavali I am responding in https://github.com/lyft/envoy-mobile/issues/688 to not pollute this with Envoy Mobile details.
Title: Spurious segmentation fault in flushMetricsToSinks -> MetricSnapshotImpl
Description: Envoy occasionally crashes when flushing metrics, which is unexpected.
Repro steps: Unsure how to reproduce. It is seen spuriously (usually <10 times per day in production fleet).
It has been observed in:
1.11.2
+ https://github.com/envoyproxy/envoy/pull/9121 + internal patch1.12.2
+ https://github.com/envoyproxy/envoy/pull/9121 + https://github.com/envoyproxy/envoy/pull/9509Config:
Call Stack: From the 1.12.2 build described above: