anycable / anycable-go

AnyCable real-time server
https://anycable.io
MIT License
375 stars 65 forks source link

Connection to the gRPC server gets stuck #79

Closed avlazarov closed 4 years ago

avlazarov commented 5 years ago

AnyCable-Go version: 0.6.3 AnyCable gem version: 0.6.3 (same anycable-rails version) gRPC gem version: 1.20.0 nginx version: 1.17.3

What did you do?

  1. Use nginx + grpc module setup. Gist link
  2. Run two instance of gRPC servers via bundle exec anycable --rpc_host 0.0.0.0:50052 and bundle exec anycable --rpc_host 0.0.0.0:50051
  3. Run an instance of anycable-go via anycable-go --headers=origin,cookie --debug=true --rpc_host=localhost:50050
  4. Subscribe to a channel. Nothing fancy here, using the JS ActionCable.subscribe.
  5. Perform actions on the subscription periodically every 10 seconds in JS – subscription.perform 'do_stuff'.
  6. Stop both gRPC anycable instances without de-registering any from nginx.
  7. On the next do_stuff action, the anycable-go server receives error 502 from nginx since both gRPC servers are gone.

What did you expect to happen?

The anycable-go server to raise an error similar to when no connection to the gRPC is available (Perform error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure,) and retry communicating with the gRPC server on the next do_stuff action.

What actually happened?

After 7., no attempt to send requests to the gRPC server are made (nothing is logged in the anycable-go server and nothing is available in the nginx access log), even if the gRPC servers are started up again. Meanwhile, the client get successful ping messages and can receive broadcasts and through the WS.

If another client gets to subscribe to the same channel they'll either 1) get an error forcing them to reconnect when the gRPC servers are all down or 2) successfully subscribe and perform actions when the gRPC servers are up. The first client will still remain "stuck" however.

Bottom line is that performing actions on a subscription after getting error 502 blocks all new actions from being performed by the anycable-go server for a particular client/subscription.

Could you please give some directions on how to deal with this scenario? One possibility is to 'ack' for actions on the client side and reconnect altogether, but it adds some complexity.

sponomarev commented 5 years ago

Hey @avlazarov! Have you tried to play with grpc_read_timeout configuration? What happens when the default timeout, 60s, passes?

sponomarev commented 5 years ago

I assume that AnycableGo is not aware of you RPC servers went done because nginx still keeps the connection because of the grpc_read_timeout and grpc_send_timeout directives.

palkan commented 5 years ago

I assume that AnycableGo is not aware of you RPC servers went done

As I understood, other clients (new connections) work fine, i.e., gRPC connectivity is restored.

The problem is that the first one, the one that "caught" the broken connection, is getting stuck:

Bottom line is that performing actions on a subscription after getting error 502 blocks all new actions from being performed by the anycable-go server for a particular client/subscription.

@avlazarov Right?

And that's strange: if other clients could successfully perform an action, the first one should do this as well on the next attempt, since they uses the same grpc pool.

avlazarov commented 5 years ago

@palkan Yes, the odd part is that even when the next client makes a series of successful actions, the first one remains stuck. If instead of error 502 I totally shutdown nginx (causing refused connection), anycable-go will perform the operations, print errors but once nginx is back again, the gRPC servers will correctly receive the actions and the client will no longer be stuck.

palkan commented 5 years ago

I'll try to reproduce it locally and come back when I find something.

bibendi commented 4 years ago

I've tried to reproduce it at that simple chat application, but unfortunately (or fortunately) couldn't experience the problem.

palkan commented 4 years ago

@avlazarov Please, take a look the @bibendi 's PR above. We couldn't reproduce the problem. Are we missing something?

avlazarov commented 4 years ago

@palkan Sorry, I can't reproduce it after upgrading from Ubuntu 16.04 to 18.04. It might have been something related to that specific version of Nginx for Ubuntu, or I might have misconfigured something else in Nginx that I have not noticed.