envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.05k stars 4.82k forks source link

Reduce `warning - grpc Config Stream closed: 14` noise #14591

Closed mandarjog closed 3 years ago

mandarjog commented 3 years ago

When an XDS server terminates connection to Envoy, Envoy logs at a warning level.

https://github.com/envoyproxy/envoy/blob/28e8d7711b91d6fa3b76827441d4285a6cd75717/source/common/config/grpc_stream.h#L100

This results in an error message like the following. 14 == Unavailable

warning envoy config StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream

or 5 == NotFound

warning envoy config StreamAggregatedResources gRPC config stream closed: 5, Traffic Director configuration was not found for network "VPC_NAME" in project "PROJECT_NUMBER".
..
warning envoy config StreamAggregatedResources gRPC config stream closed: 5, Requested entity was not found.

Problem At least some of the statuses (14, Unavailable) are retryable and do not require user intervention unless there are repeated failures. Management servers (like istiod) issue a reset every 10-30 mins to get even load distribution, and it causes a warning message to be emitted in envoy logs. In the vast majority of the cases this warning is not actionable.

We have seen customers alarmed by needless "Warnings" in proxy logs. Customers take warnings seriously.

Possible solutions

  1. Drop the disconnect down to INFO, so that when users run proxy with WARN they are not alarmed by these messages.
  2. Be more sophisticated. grpc_stream already uses backoff_strategy, which means that if after a full back off cycle Envoy cannot connect, then it should be a warning. This also means that the status and message captured on the first failure needs to be stored along with its timestamp. Emit all that info if the error persists after a backoff cycle.
  3. Classify grpc status, some could always be warnings and some should follow (1)/(2).

@kyessenov @PiotrSikora

Example output from istio-proxy:1.9-pre

2021-01-08T23:31:25.648672Z info    sds resource:default pushed key/cert pair to proxy
2021-01-08T23:31:25.648717Z info    sds Dynamic push for secret default
2021-01-08T23:37:44.608472Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,
2021-01-09T00:09:59.313367Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,
2021-01-09T00:41:35.746643Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,
2021-01-09T01:08:54.651912Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,
2021-01-09T01:41:32.345960Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,
2021-01-09T02:09:29.151835Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,
2021-01-09T02:37:37.909292Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,
2021-01-09T03:06:46.651828Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,
2021-01-09T03:39:07.444015Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,
2021-01-09T04:12:15.233814Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,
2021-01-09T04:37:55.879321Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,
2021-01-09T04:37:56.188695Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
2021-01-09T04:37:57.703562Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: local reset
2021-01-09T04:37:59.232159Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: local reset
2021-01-09T04:38:02.711652Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: local reset
2021-01-09T05:06:05.443284Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 13,

istio-proxy-1.9-pre

2021-01-10T02:39:56.243603Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 0,
2021-01-10T02:39:56.536558Z info    xdsproxy    Envoy ADS stream established
2021-01-10T02:39:56.536715Z info    xdsproxy    connecting to upstream XDS server: istiod-latest.istio-system.svc:15012
2021-01-10T03:08:27.842505Z info    xdsproxy    disconnected from XDS server: istiod-latest.istio-system.svc:15012
2021-01-10T03:08:27.842644Z warn    xdsproxy    unregister stream
2021-01-10T03:08:27.843154Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 0,
2021-01-10T03:08:28.186165Z info    xdsproxy    Envoy ADS stream established
2021-01-10T03:08:28.186337Z info    xdsproxy    connecting to upstream XDS server: istiod-latest.istio-system.svc:15012
2021-01-10T03:36:42.072427Z info    xdsproxy    disconnected from XDS server: istiod-latest.istio-system.svc:15012
2021-01-10T03:36:42.072524Z warn    xdsproxy    unregister stream
2021-01-10T03:36:42.073011Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 0,
mattklein123 commented 3 years ago

+1, I've seen this also and it's confusing. I think the log line as is should actually be debug, it's not very useful. I would suggest some more intelligent logging that would be a warning and also with a more useful error message. Something like your (3) above. cc @ramaraochavali who also might have an opinion on this and is familiar with this code.

ramaraochavali commented 3 years ago

+1 (3) seems like a good option

PiotrSikora commented 3 years ago

While I agree that the current message is quite confusing to the users, I think the fact that we reconnected to a control plane (possibly even a different instance than before) should be visible in logs at the default log level.

In the past, we saw errors that were a result of a reconnect to the control plane (e.g. when hash was non-deterministic), which would be extremely hard to figure out if the reconnect wasn't visible in the logs and easily correlated with those errors, so that noise provides a lot of value.

kyessenov commented 3 years ago

I think we need some control plane identifier. The transport address (IP) is also hard to debug.

jessicayuen commented 3 years ago

This warning logs are also causing a lot of noise in our current system. Any concerns if we currently move it to a debug/info level log while making the classification strategy more sophisticated?

tbarrella commented 3 years ago

I'm thinking of retrying #14616 and would like to align on the logic first.

The overall proposed approach is

Here is a ranked exhaustive decision tree table Close status Log level Log message Saved close status
Ok debug default clear
Urgent warn default clear
Different from saved close status warn previous since set
New (nothing is saved) debug default set
Same as previous (<30 s) debug default keep
Same as previous (>30 s) warn since keep
Possible log messages Type Message
default gRPC config stream closed: {status}, {message}
since gRPC config stream closed since {duration}ms ago: {status}, {message}
previous since gRPC config stream closed: {status}, {message} (previously {close_status}, {close_message} since {duration}ms ago)

@mattklein123 @ramaraochavali any thoughts?

ramaraochavali commented 3 years ago

This looks good to me.

mattklein123 commented 3 years ago

At a high level this looks like the right track to me. I have a couple of questions: 1) Can you clarify urgent vs. non-urgent statuses? I'm not completely clear on that. 2) > If the status is non-urgent and has been returned for more than 30s, add "since ..." to the log message (and use warn log level)

Perhaps this should just be logged every 30s vs. repeatedly after 30s? Is that what you mean?

mandarjog commented 3 years ago

In this model the most frequent “Grpc stream closed” will appear as debug.

I think retryable and non-retryable or actionable or not-actionable is the classification. I don’t think “urgent” is what we are trying to classify.

tbarrella commented 3 years ago

Thank you!

  1. mandarjog clarified this
  2. I think at first it would be simpler and not much worse to log on each failed retry after 30s. With the current implementation of JitteredExponentialBackOffStrategy, that should be on average every 15s after the initial 30s passes. Otherwise, either we'd be managing two timestamps (when the first failure happened, and when we last warned), or the time delta in the message wouldn't include the full duration since the first failure happened
mattklein123 commented 3 years ago

re: (1) I'm still confused. Can you clarify which gRPC statuses are in which category? re: (2) does the backoff need to be configured? Or is it always on?

tbarrella commented 3 years ago
  1. DeadlineExceeded, Internal, and Unavailable are considered retriable
  2. Right now it's always on and not configurable
mattklein123 commented 3 years ago

OK sounds good.