choria-io / go-choria

Backplane Development Framework and Server hosting Choria Agents, Networks, Federations and Streaming Data
Apache License 2.0
86 stars 30 forks source link

SRV records do not appear to be reloaded by agent before reconnecting to broker #1167

Open sshipway opened 3 years ago

sshipway commented 3 years ago

Running choria 0.16.0 RPM under centos 7 with mcollective plugins, installed by puppet module

We recently moved our Choria brokers, and added new SRV records for the new hosts, removing the old ones. However, when the old brokers were shut down, forcing agents to reconnect to another available broker, agents did not reload the new SRV records and instead continued to try to connect to only the old hosts rather than the new ones, failing to connect as the old hosts were gone.

Restarting the agents caused the SRV to be re-read and connection continued to the new brokers as expected.

Choria agents should not cache the SRV records for the broker addresses, but should instead re-query DNS each time they attempt to reconnect, in case of changes to the list of brokers.

To duplicate

Possibly connected to the DNS TTL for the domain?

ripienaar commented 3 years ago

Agree and this is a known issue. We use the NATS go package that does not let us update names like that :(

I will have a chat with the authors and see if there is something we can do but as it is it’s a bit orthogonal - the nats package doesn’t support SRV at all so choria does the lookup and configure the package but on reconnect we have no way to do so.

ripienaar commented 3 years ago

In your exact scenario I could perhaps improve the situation - nats can learn about new hosts on its own. So when you expanded the cluster to 2 nodes the connected ones could have known it’s there for reconnect purposes - but I don’t think we enable that behaviour.

Anyway it’s a long standing pain. Will try again with the authors if we can do something.

sshipway commented 3 years ago

Thanks for the info on this - I know its a bit of an edge case but its a definite caveat that you need to be aware of when migrating.