Closed mandarjog closed 3 years ago
+1, I've seen this also and it's confusing. I think the log line as is should actually be debug, it's not very useful. I would suggest some more intelligent logging that would be a warning and also with a more useful error message. Something like your (3) above. cc @ramaraochavali who also might have an opinion on this and is familiar with this code.
+1 (3) seems like a good option
While I agree that the current message is quite confusing to the users, I think the fact that we reconnected to a control plane (possibly even a different instance than before) should be visible in logs at the default log level.
In the past, we saw errors that were a result of a reconnect to the control plane (e.g. when hash was non-deterministic), which would be extremely hard to figure out if the reconnect wasn't visible in the logs and easily correlated with those errors, so that noise provides a lot of value.
I think we need some control plane identifier. The transport address (IP) is also hard to debug.
This warning logs are also causing a lot of noise in our current system. Any concerns if we currently move it to a debug/info level log while making the classification strategy more sophisticated?
I'm thinking of retrying #14616 and would like to align on the logic first.
The overall proposed approach is
Ok
, use debug log levelDeadlineExceeded
/Internal
/Unavailable
) and different from the last non-urgent status, add "(previously ... since ...)" to the log message (and use warn log level)Here is a ranked exhaustive decision tree table | Close status | Log level | Log message | Saved close status |
---|---|---|---|---|
Ok | debug | default | clear | |
Urgent | warn | default | clear | |
Different from saved close status | warn | previous since | set | |
New (nothing is saved) | debug | default | set | |
Same as previous (<30 s) | debug | default | keep | |
Same as previous (>30 s) | warn | since | keep |
Possible log messages | Type | Message |
---|---|---|
default | gRPC config stream closed: {status}, {message} | |
since | gRPC config stream closed since {duration}ms ago: {status}, {message} | |
previous since | gRPC config stream closed: {status}, {message} (previously {close_status}, {close_message} since {duration}ms ago) |
@mattklein123 @ramaraochavali any thoughts?
This looks good to me.
At a high level this looks like the right track to me. I have a couple of questions: 1) Can you clarify urgent vs. non-urgent statuses? I'm not completely clear on that. 2) > If the status is non-urgent and has been returned for more than 30s, add "since ..." to the log message (and use warn log level)
Perhaps this should just be logged every 30s vs. repeatedly after 30s? Is that what you mean?
In this model the most frequent “Grpc stream closed” will appear as debug.
I think retryable and non-retryable or actionable or not-actionable is the classification. I don’t think “urgent” is what we are trying to classify.
Thank you!
JitteredExponentialBackOffStrategy
, that should be on average every 15s after the initial 30s passes. Otherwise, either we'd be managing two timestamps (when the first failure happened, and when we last warned), or the time delta in the message wouldn't include the full duration since the first failure happenedre: (1) I'm still confused. Can you clarify which gRPC statuses are in which category? re: (2) does the backoff need to be configured? Or is it always on?
DeadlineExceeded
, Internal
, and Unavailable
are considered retriableOK sounds good.
When an XDS server terminates connection to Envoy, Envoy logs at a warning level.
https://github.com/envoyproxy/envoy/blob/28e8d7711b91d6fa3b76827441d4285a6cd75717/source/common/config/grpc_stream.h#L100
This results in an error message like the following.
14 == Unavailable
or
5 == NotFound
Problem At least some of the statuses (14, Unavailable) are retryable and do not require user intervention unless there are repeated failures. Management servers (like istiod) issue a reset every 10-30 mins to get even load distribution, and it causes a warning message to be emitted in envoy logs. In the vast majority of the cases this warning is not actionable.
We have seen customers alarmed by needless "Warnings" in proxy logs. Customers take warnings seriously.
Possible solutions
INFO
, so that when users run proxy withWARN
they are not alarmed by these messages.grpc_stream
already uses backoff_strategy, which means that if after a full back off cycle Envoy cannot connect, then it should be a warning. This also means that the status and message captured on the first failure needs to be stored along with its timestamp. Emit all that info if the error persists after a backoff cycle.@kyessenov @PiotrSikora
Example output from istio-proxy:1.9-pre
istio-proxy-1.9-pre