Open dkryptr opened 3 years ago
Here's what I've tracked down:
frontend.keepAliveMaxConnectionAge
defaults to 5 minutes (it wasn't set prior to it being added, so it defaulted to infinity in the GRPC Go SDK).
MaxConnectionAge is a duration for the maximum amount of time a connection may exist before it will be closed by sending a GoAway.
grpclib
, which uses h2
(an HTTP/2 library). h2 doesn't support gracefully closing connections. I believe that explains this issue.WORKAROUND
Set frontend.keepAliveMaxConnectionAge
in the dynamic config to a large value (not sure how to set it to infinity like it defaulted to pre-1.9.0 release):
frontend.keepAliveMaxConnectionAge:
- value: 87600h # 10 years
Chad, you are correct with your investigation, however the fact that you are seeing errors on the client side is unexpected. We've added this configuration to initiate healthy rotation of connections and to better re-balance them over time.
According to grpc documetation for keep-alive, when MaxConnectionAge
is reached, GoAway
signal is sent to the client and if client doesn't close the connection within MaxConnectionAgeGrace
period only then connection will be forcibly closed. The idea behind it is that grace period should be longer than the long poll request time, which should result in clean connection closure.
Is it possible that Python implementation of grpc that you are using is not handling GoAway
signal properly? (this issue?) My expectation is that client should finish ongoing requests using an old connection, and route new requests into sub-channels that are using new connection.
What do you think is the right thing to do here? My opinion is that we should aim for an actual fix in python, because server defaults seem to be reasonable according to our usage profile and gRPC spec and I don't think we should be changing them. Meanwhile overriding MaxConnectionAge
on the server to a large or unlimited value sounds like a good mitigation strategy.
I agree with you. A couple of options are:
As we've continued migrating existing applications to temporal using the python-sdk, it looks like we also need to override the config option frontend.keepAliveTime
. Temporal server sets it to 1 minute. Some of our activities take longer than 60 seconds and we get the same Connection lost
error. Setting this to a longer time fixes the issue.
I get a connection issue when running my worker against the latest version of Temporal: