We had one of our development environments get into a bit of a state. We suspect that we may have had some spot terminations in aws during a deployment.
This left the environment without both nats server (2 node nats cluster was used).
Further repeated deployments to the environment would fail with the following errors:
Task 1907101 | 15:31:29 | Updating instance cc-worker: cc-worker/4f479695-0a38-4fd8-b521-977b99e74130 (0) (canary) (00:07:25)
L Error: 'cc-worker/4f479695-0a38-4fd8-b521-977b99e74130 (0)' is not running after update. Review logs for failed jobs: metrics-discovery-registrar
Task 1907101 | 15:31:29 | Updating instance nats: nats/0fb59f8e-287f-4510-bad1-5f4678064ee9 (0) (canary) (00:07:25)
L Error: 'nats/0fb59f8e-287f-4510-bad1-5f4678064ee9 (0)' is not running after update. Review logs for failed jobs: nats-tls-healthcheck, metrics-discovery-registrar
Task 1907101 | 15:31:30 | Updating instance diego-api: diego-api/0a333a85-5b12-476e-891d-d915c4c9076e (0) (canary) (00:07:26)
L Error: 'diego-api/0a333a85-5b12-476e-891d-d915c4c9076e (0)' is not running after update. Review logs for failed jobs: metrics-discovery-registrar
Task 1907101 | 15:31:32 | Updating instance uaa: uaa/141976a2-ce60-4da8-ac91-b103936c3a06 (0) (canary) (00:07:28)
L Error: 'uaa/141976a2-ce60-4da8-ac91-b103936c3a06 (0)' is not running after update. Review logs for failed jobs: route_registrar, metrics-discovery-registrar
Task 1907101 | 15:31:49 | Updating instance scheduler: scheduler/4f98a14c-ee03-4b0d-a2d3-805c92b59201 (0) (canary) (00:07:45)
L Error: 'scheduler/4f98a14c-ee03-4b0d-a2d3-805c92b59201 (0)' is not running after update. Review logs for failed jobs: service-discovery-controller, metrics-discovery-registrar
Task 1907101 | 15:32:41 | Updating instance api: api/21cf57a1-1d0e-43f5-b391-6673a448cf8f (0) (canary) (00:08:37)
L Error: 'api/21cf57a1-1d0e-43f5-b391-6673a448cf8f (0)' is not running after update. Review logs for failed jobs: route_registrar, metrics-discovery-registrar
Task 1907101 | 15:32:41 | Error: 'cc-worker/4f479695-0a38-4fd8-b521-977b99e74130 (0)' is not running after update. Review logs for failed jobs: metrics-discovery-registrar
After investigating this turned out to be a nats 'not available' issue.
When bosh was trying to start the first nats node, the nats-wrapper would not start nats-server for 22-28 minutes. This resulted in a deployment timeout that could not be resolved via bosh.
Release: nats/51
We had one of our development environments get into a bit of a state. We suspect that we may have had some spot terminations in aws during a deployment.
This left the environment without both nats server (2 node nats cluster was used).
Further repeated deployments to the environment would fail with the following errors:
After investigating this turned out to be a nats 'not available' issue.
When bosh was trying to start the first nats node, the nats-wrapper would not start nats-server for 22-28 minutes. This resulted in a deployment timeout that could not be resolved via bosh.
Example nats-wrapper logs from this incident.
This was solved in the environment by hacking the config, removing the second node and retrying the deployment.
I reproduced this issue in a different environment by modifying:
/var/vcap/jobs/nats-tls/config/migrator-config.json
and replacing the second referenced server with an unused ip address.
09:13-09:35 = 22 minutes
After a quick glance it might be nice if we could add a:
net.Dialer{Timeout ...
to
https://github.com/cloudfoundry/nats-release/blob/ce128befca7b7a0881d19e9f6fe81b3399b9a9e3/src/code.cloudfoundry.org/nats-v2-migrate/natsinfo/nats_version.go#L63
with a sensible timeout.
Let me know your thoughts.