This PR proposes the following change in behavior:
Any stream errors in send and recv are retried by the client.
Any stream creation errors are retried by client
A malformed/unsupported typeurl causes the response channel to be closed and orchestrator will close the downstream watches. We earlier did the work for the orchestrator to close the watches when response channel is closed. This is not expected to cause any thundering herd issue.
The retry eventually succeeds or keeps incrementing a stat. We can add alarms around the stat.
TTL expiry can still cancel the top level context and the retry stops. The cancel flow has been demonstrated in the integration tests.
Grpc concepts on which the impl relies:
grpc takes care of connection management. Once the connection is established, it needs no work from the client to reconnect on backend closures, creating new cx on goaway frames, and any other cx related scenario. The connection is picked back up when the backend starts responding.
Streams are ephemeral and even when cx is closed using ctx cancel, stream creation fails with a grpc status code 14, similar to when backend is unavailable. We make sure to cancel stream creation when ctx is cancelled.
Retrying new streams on failure is not expensive since stream creation is based on cx characteristics, however we do intend to add backoff functionality. This is intended for a separate PR.
Once a stream has an error, it becomes aborted and cannot be reused. In our case both recv and send have to use the same stream and need coordination to bail out.
This PR proposes the following change in behavior:
Grpc concepts on which the impl relies: