centrifugal / centrifugo

Scalable real-time messaging server in a language-agnostic way. Self-hosted alternative to Pubnub, Pusher, Ably. Set up once and forever.
https://centrifugal.dev
Apache License 2.0
8.4k stars 594 forks source link

Audit error sources and identify ones which are really fatal. #55

Closed banks closed 7 years ago

banks commented 8 years ago

Since my comment a few days ago, centrifugo now disconnects with reconnect: false for almost any error that occurs while processing a message. The vast majority of these are not fatal errors and should not prevent the client from retrying.

For example, now I my client connects and the call to subscribe fails due to a temporary redis networking glitch, the client is permanently disconnected.

When trying to figure out how different errors should behave I realise that there are 3 different error scopes:

  1. session level errors
    • errors that should/will never be recovered from by this client in this state
  2. connection/transport level errors
    • physical: websocket is closed/fails
    • connection protocol issue
    • unsupported protocol version (should be fatal)
  3. individual api call failure
    • bad request (don't retry API call, connection OK)
    • transient backend failure (retry API call, connection OK*)

In case of 3, the retry option actually applies to API call not whole connection in most cases. I put * next to connection OK in last case because it's arguable that you should disconnect them here. The benefit of disconnect is that a network partition where one centrifugo host is unable to toal to backend but others are will recover itself. But in general it means disconnect/reconnect overhead for no reason if the connection failure was just temporary. Could have hybrid that keeps track of how many successive API calls failed on a connection and disconnect after 2 or 3 consecutive ones? That's out of scope anyway for now current behaviour is OK as long as we have defined ways to handle each type and tell the difference.

Finally, one option would be to return distinct error codes for each type of error and let client (ours or end-user) choose to retry or not.

That might make some sense, however the downside is that there will be long-lived clients out there and you risk having them retry indefinitely when you introduce a new fatal error into server later (or vice-versa) because they don't know.

So I like the current model of returning advice on whether client should retry requests and/or reconnect in each case. As that is future proof.

FZambia commented 8 years ago

As we already discussed in gitter chat error handling can really be better.

I am preparing v1.3.0 release at moment and for now commited small change that must help to prevent permanent disconnect in case of internal server error.

FZambia commented 8 years ago

Now in v1.3.2 errors can contain optional advice (fix or retry). fix means that developer using Centrifugo did something wrong, i.e. using unknown namespace in channels or channels too long - so developer should fix this in order things to work properly. retry signals temporary error that could theoretically be successful if client try executing command again. These advices do not handled by javascript client automatically but included in error context.

disconnect command with reconnect boolean flag used to control client's connection - disconnect forever or disconnect with reconnect. ErrInternalServerError results in sending disconnect command with reconnect.

FZambia commented 7 years ago

I think this is not actual anymore - many improvements here have been done, some will be released with Centrifugo v1.6.0, let's close this.