Nats-wrapper causing delayed nats-server start up

corlettb commented 1 year ago

Release: nats/51

We had one of our development environments get into a bit of a state. We suspect that we may have had some spot terminations in aws during a deployment.

This left the environment without both nats server (2 node nats cluster was used).

Further repeated deployments to the environment would fail with the following errors:

Task 1907101 | 15:31:29 | Updating instance cc-worker: cc-worker/4f479695-0a38-4fd8-b521-977b99e74130 (0) (canary) (00:07:25)
                        L Error: 'cc-worker/4f479695-0a38-4fd8-b521-977b99e74130 (0)' is not running after update. Review logs for failed jobs: metrics-discovery-registrar
Task 1907101 | 15:31:29 | Updating instance nats: nats/0fb59f8e-287f-4510-bad1-5f4678064ee9 (0) (canary) (00:07:25)
                        L Error: 'nats/0fb59f8e-287f-4510-bad1-5f4678064ee9 (0)' is not running after update. Review logs for failed jobs: nats-tls-healthcheck, metrics-discovery-registrar
Task 1907101 | 15:31:30 | Updating instance diego-api: diego-api/0a333a85-5b12-476e-891d-d915c4c9076e (0) (canary) (00:07:26)
                        L Error: 'diego-api/0a333a85-5b12-476e-891d-d915c4c9076e (0)' is not running after update. Review logs for failed jobs: metrics-discovery-registrar
Task 1907101 | 15:31:32 | Updating instance uaa: uaa/141976a2-ce60-4da8-ac91-b103936c3a06 (0) (canary) (00:07:28)
                        L Error: 'uaa/141976a2-ce60-4da8-ac91-b103936c3a06 (0)' is not running after update. Review logs for failed jobs: route_registrar, metrics-discovery-registrar
Task 1907101 | 15:31:49 | Updating instance scheduler: scheduler/4f98a14c-ee03-4b0d-a2d3-805c92b59201 (0) (canary) (00:07:45)
                        L Error: 'scheduler/4f98a14c-ee03-4b0d-a2d3-805c92b59201 (0)' is not running after update. Review logs for failed jobs: service-discovery-controller, metrics-discovery-registrar
Task 1907101 | 15:32:41 | Updating instance api: api/21cf57a1-1d0e-43f5-b391-6673a448cf8f (0) (canary) (00:08:37)
                        L Error: 'api/21cf57a1-1d0e-43f5-b391-6673a448cf8f (0)' is not running after update. Review logs for failed jobs: route_registrar, metrics-discovery-registrar
Task 1907101 | 15:32:41 | Error: 'cc-worker/4f479695-0a38-4fd8-b521-977b99e74130 (0)' is not running after update. Review logs for failed jobs: metrics-discovery-registrar

After investigating this turned out to be a nats 'not available' issue.

When bosh was trying to start the first nats node, the nats-wrapper would not start nats-server for 22-28 minutes. This resulted in a deployment timeout that could not be resolved via bosh.

Example nats-wrapper logs from this incident.

{"timestamp":"2022-12-21T15:24:08.440126166Z","level":"info","source":"nats-migrate-server","message":"nats-migrate-server.signalled-nats","data":{}}
{"timestamp":"2022-12-21T15:52:38.097023664Z","level":"error","source":"nats-migrate-server","message":"nats-migrate-server.ignoring-machine-due-to-connection-error","data":{"error":"Error connecting: dial tcp 10.0.17.4:4224: connect: connection timed out","url":"489171d6-4bbd-439f-8211-ea32b5a97279.nats.service.cf.internal:4224"}}
{"timestamp":"2022-12-21T15:52:38.097995262Z","level":"info","source":"nats-migrate-server","message":"nats-migrate-server.started-nats","data":{}}
{"timestamp":"2022-12-21T15:52:38.101318011Z","level":"info","source":"nats-migrate-server","message":"nats-migrate-server.started","data":{}}

This was solved in the environment by hacking the config, removing the second node and retrying the deployment.

I reproduced this issue in a different environment by modifying:

/var/vcap/jobs/nats-tls/config/migrator-config.json

and replacing the second referenced server with an unused ip address.

vcap      1388  1374  0 09:13 ?        00:00:00 /var/vcap/packages/nats-v2-migrate/bin/nats-wrapper --config-file /var/vcap/jobs/nats-tls/config/migrator-config.json
vcap      3936  1388  0 09:35 ?        00:00:00 /var/vcap/packages/nats-server/bin/nats-server -c /var/vcap/jobs/nats-tls/config/nats-tls.conf

09:13-09:35 = 22 minutes

After a quick glance it might be nice if we could add a:

net.Dialer{Timeout ...

to

https://github.com/cloudfoundry/nats-release/blob/ce128befca7b7a0881d19e9f6fe81b3399b9a9e3/src/code.cloudfoundry.org/nats-v2-migrate/natsinfo/nats_version.go#L63

with a sensible timeout.

Let me know your thoughts.

ameowlia commented 1 year ago

Hi @corlettb,

That sounds like a good suggestion. Would you mind PRing it in?

corlettb commented 1 year ago

I've create this pull request.

cloudfoundry / nats-release

Nats-wrapper causing delayed nats-server start up #49