Ambiguity around which connection errors are retriable

paddybyers commented 10 months ago

A recent issue in ably-cocoa highlighted the problem that a specific TLS-related connection error response was being treated as fatal instead of retriable. If this error occurs in a connection attempt, the connection would then move to failed.

However, there are valid arguments for some TLS failures not being retried. Suppose for example that a platform only supports TLS 1.1, and the Ably endpoint only admits TLS 1.2+, then you're going to get a TLS handshake error. If you keep retrying in that situation, then it's still never going to succeed in connecting. So there might be an argument to detect that specific case, and not attempt retrying; however, the risk with that is that you also exclude things that you really do want to retry, which is the case we have in the issue linked. So we have to err on the side of retrying in any situation where the possible underlying cause is ambiguous, and then we have to try to eliminate the possibility of things like protocol incompatibility by other means.

Similar arbitrary decisions as to what is retried are made in ably-java. The spec is currently ambiguous about what errors should be retried. The language of RTN14d is also quite confusing. It attempts to be prescriptive about what’s a recoverable error “(ie a network failure, a timeout such as RTN14c, or a disconnected response, other than a token failure RTN14b)“, whilst not really describing what should be considered a “network failure”.

sync-by-unito[bot] commented 10 months ago

➤ Automation for Jira commented:

The link to the corresponding Jira issue is https://ably.atlassian.net/browse/SDK-3927

paddybyers commented 10 months ago

Internal discussion: https://github.com/ably/realtime/issues/5283

paddybyers commented 10 months ago

cc @lawrence-forooghian

SimonWoolf commented 10 months ago

The language of RTN14d is also quite confusing. It attempts to be prescriptive about what’s a recoverable error “(ie a network failure, a timeout such as RTN14c, or a disconnected response, other than a token failure RTN14b)“, whilst not really describing what should be considered a “network failure”.

This should definitely be improved, yes. The spec is clearer about what failures should be retried to a fallback host: RSC15l, including RSC15l1: "host unresolvable or unreachable". I'd have thought TLS issues definitely fall under that. (That's a spec item about the rest api but is incorporated into the realtime api by RTN17f, which adds DISCONNECTED with 5xx).

The set of failures that should be retried to a fallback host is by definition a subset of the set of failures that should be retried. So perhaps RTN14d should reference RTN17f, and additionally list any conditions that should result in a retry but without changing the endpoint endpoint (which I guess think is just 4xx disconnecteds?)

paddybyers commented 10 months ago

So we're clear that the philosophy intended in the spec, perhaps able to be improved, is correct:

If Ably specifically gives you an error response, and that error implies that you shouldn't retry, then you respect that. In every other situation, you retry. So:

any error response that is not from an Ably realtime endpoint (which is determined by having an Ably error code, either in the response body or a header, or is a ProtocolMessage containing an error), is retriable;
an error response that is from an Ably endpoint:
- is a token error if it has a code in the token range, and should be retried after a token renewal sequence;
- otherwise, if it has a 4xx status code, is non-retriable
- otherwise, is retriable.

This will mean that some connection problems (such as the TLS protocol version incompatibility example above) will give rise to indefinite doomed retries, but this is an accepted penalty for giving ourselves the best chance of connecting after transient errors.

There are also some responses from non-realtime elements of the infrastructure that we want clients to respect, such as rate-limiting responses from the WAF, but there might be others. We need to decide how to address these in the spec.

lawrence-forooghian commented 10 months ago

@paddybyers I think I mentioned in our Slack chat that it wasn't clear to me what it meant by "an error response" in the context of a Realtime client using a WebSocket transport. Does it imply that the client needs to pay attention to the response code and headers received in the WebSocket handshake?

lawrence-forooghian commented 10 months ago

(@paddybyers when I mentioned this to you, you rephrased the logic in terms of WebSocket API close and error events.)

paddybyers commented 10 months ago

it wasn't clear to me what it meant by "an error response" in the context of a Realtime client using a WebSocket transport. Does it imply that the client needs to pay attention to the response code and headers received in the WebSocket handshake?

I attempted to reflect that in what I wrote above, "or is a ProtocolMessage containing an error".

If realtime is trying to tell you something, and you're a ws connection, then it does it by sending a ProtocolMessage.

If you get a ws-level event (close, error), without having received a prior ProtocolMessage, then you have to treat that as a retriable error.

A ProtocolMessage whose action is error, and that does not have a channel, is treated as a connection-level error indication, and the rules above apply, ie if msg.error.statusCode is 4xx then it's non-retriable.

A ProtocolMessage whose action is error and does have a channel, is treated as a channel-level error indication.

A ProtoclMessage whose action is disconnected is by definition retriable.

lawrence-forooghian commented 10 months ago

A ProtocolMessage whose action is error, and that does not have a channel, is treated as a connection-level error indication, and the rules above apply, ie if msg.error.statusCode is 4xx then it's non-retriable.

A ProtocolMessage whose action is error and does have a channel, is treated as a channel-level error indication.

A ProtoclMessage whose action is disconnected is by definition retriable.

We already have spec points that describe expected behaviour upon receiving ERROR and DISCONNECTED ProtocolMessages. Do you believe that the spec needs expanding in this sense beyond what's already there?

paddybyers commented 10 months ago

We already have spec points that describe expected behaviour upon receiving ERROR and DISCONNECTED ProtocolMessages. Do you believe that the spec needs expanding in this sense beyond what's already there?

Without trawling through the detail now, I think the answer is:

the handling of disconnections and declined connections, in the case that there is a ProtocolMessage, is well defined;
the possible ambiguity of what do do with an error or close event, when there was no ProtocolMessage, is what we need to check

ably / specification

Ambiguity around which connection errors are retriable #171