Add a PB_NATS_CLIENT_RECONNECT_DELAY env var which default to the ack_timeout if not set.
Disable the publish buffer for java nats connection clients (rpc client connection only). This will let us avoid publishing rpc requests while reconnecting or the cluster is down to avoid stale messages. Instead, we wait the reconnect delay, which by default is the duration of an ack timeout, so we essentially say a reconnect is like the server dropped a message. Pretty reasonable and helps us fail quickly.
Avoid memoizing the java nats client if we were unable to get a connection the first time (only happens when cluster is down). With this change, we'll keep trying until we get a connection, and blow up if we can't. Again, pretty reasonable.
We weren't setting max reconnect in pure ruby nats, so this makes max reconnect attempts configurable so java and pure ruby will remain in *NSYNC 🕺 .
Only thing that isn't nice is the subtle deviations in behavior between java and ruby. I think it might be better to split the client into a base class and then have ruby specific code and java specific code like conurrent ruby does. But that can come as a later cut.
Real life testing: I started a warehouse server process and an irb session. I had a loop where I would send an rpc request to the server and wait for a reply like normal. They were both connected to the simplest nats cluster config.
First I tested reconnects: as requests were flowing, I would turn one gnatsd server in the cluster off at a time and watch to make sure the reconnect worked great. Many reconnects happened instantly. When the reconnect was slower, the client waited and the request completed successfully. Perfect!
Then I tested a downtime scenario: as requests were flowing, I turned off both gnatsd servers to ensure that the server and client could reconnect. If the client reconnected before 3 reconnect delays occur, the rpc request completed without error, otherwise, an error was raised like expected. Neat! The server continued processing messages like it should. Hooray!
This makes reconnects much nicer:
PB_NATS_CLIENT_RECONNECT_DELAY
env var which default to theack_timeout
if not set.Only thing that isn't nice is the subtle deviations in behavior between java and ruby. I think it might be better to split the client into a base class and then have ruby specific code and java specific code like conurrent ruby does. But that can come as a later cut.
Real life testing: I started a warehouse server process and an irb session. I had a loop where I would send an rpc request to the server and wait for a reply like normal. They were both connected to the simplest nats cluster config.
First I tested reconnects: as requests were flowing, I would turn one gnatsd server in the cluster off at a time and watch to make sure the reconnect worked great. Many reconnects happened instantly. When the reconnect was slower, the client waited and the request completed successfully. Perfect!
Then I tested a downtime scenario: as requests were flowing, I turned off both gnatsd servers to ensure that the server and client could reconnect. If the client reconnected before 3 reconnect delays occur, the rpc request completed without error, otherwise, an error was raised like expected. Neat! The server continued processing messages like it should. Hooray!
cc @quixoten @abrandoned @mmmries