cloudfoundry / bosh-deployment

Collection of BOSH manifests referenced by cloudfoundry/docs-bosh
Apache License 2.0
135 stars 233 forks source link

Intermittent "timed out pinging VM" - odd packets to 169.254.0.2/53 leaving VM ? #445

Closed ae-govau closed 1 year ago

ae-govau commented 2 years ago

For the past few months we've been having regular intermittent issues with new versions of BOSH deployment whereby sometimes during compilation, or somethings during regular VM creation we'll get an error:

Timed out pinging VM 'xxx' with agent yyy' after 600 seconds

I can't recall the specific commit when we first observed it, other than that we have been regularly tracking against this repo (BOSH deployment) for a number of years, and until recently the BOSH aspects have been extremely reliable and predictable. Now, I'd say that VM creation is failing approx 5-10% of the time.

We are running with vSphere, in an environment with no outbound internet access.

Our workaround has simply been to re-run our Concourse pipeline job, until the overall deploy completes, which it eventually always does.

Today while waiting I think I've finally been able to find some interesting data to correlate. Some network logging outside of the VMs has given us a clue - at about the same time that the troublesome VMs are starting up, we're seeing some interesting dropped packets - and we're only observing these at the same time that we've seeing the error messages:

DROP UDP <ip from BOSH subnet>/<ephemeral port> ->169.254.0.2/53

I'm pretty sure that destination is a BOSH DNS local listening address, and that this traffic should never be leaving the VM?

I don't know exactly what happens when a BOSH VM is first created, but I'm wondering if there is some kind of race condition where some component in the stemcell tries to look something up via DNS that it expects to find in the local BOSH DNS, but maybe this is happening before some route adjustments are made to prevent these from leaving, and thus it never gets it's bootstrap info, and the director never hears from it? I'm not a network person, so this is just speculation.

Is anyone else seeing similar issues recently, and does anyone have suggestions on debugging this issue? It wasn't clear to me whether I should file it here, or the BOSH repo, or the BOSH agent repo, or the stemcell builder repo!

ae-govau commented 2 years ago

Looks like bosh-dns adds that address to the local loopback as part of starting up: https://github.com/cloudfoundry/bosh-dns-release/blob/cfcedb5a40ea449626edbd26bcc774ef0131be9f/jobs/bosh-dns/templates/bosh_dns_ctl.erb#L42

and I think this tells the VM to use the name server: https://github.com/cloudfoundry/bosh-dns-release/blob/e4d1779df8730cef21cfb8bf8a888b159337050e/src/bosh-dns/dns/nameserverconfig/monitor/monitor.go#L26

Although perhaps this is a red herring, because presumably the bosh-dns add-on is not even deployed yet if the agent can't communicate with the director?

beyhan commented 2 years ago

The release v274.0.0 of BOSH Director upgraded the NATS version and changed the logic how VM agents authentication with NATs works. I suspect that change more than bosh-dns. The release was published on 25th of August. Do you see anything in the NATs logs about the new VM agent trying to connect to NATs?

ae-govau commented 2 years ago

Yes, it was around that time we started seeing issues. Our setup is a little non-standard - we use one primary BOSH director that ends up deploying a number of other directors, and each of those child directors has NATS and blobstore on a separate VM to BOSH itself. We did need to make a number of changes to our operator files at that time to accommodate the NATS version update, and perhaps we missed something. The odd thing is that it usually works, but a small percentage fail. I believe we have observed this issue when deploying from both the primary (more standard) director and the child directors.

The log messages on NATS contain a number of warnings in the form:

[WRN] User "CN = default.director.bosh-internal" found in connect proto, but user required from cert
[WRN] User "CN = default.hm.bosh-internal" found in connect proto, but user required from cert
[WRN] User "C=USA, O=Cloud Foundry, CN=<uuid>.agent.bosh-internal" found in connect proto, but user required from cert

these seem pretty constant both when we see this issue, and when we don't, so I'm not sure it's related, however on the basis of that message, I did find: https://github.com/cloudfoundry/bosh-dns-release/commit/ffff97e194a072ad3cba3a76ba6657241f184835 which matches somewhat closely what we see (although of course we aren't using the Docker CPI).

rkoster commented 1 year ago

To further debug could you try and get a copy of the agent logs: https://bosh.io/docs/tips/#unresponsive-vm-logs

There is an issue that has been reported here which also causes unresponsive agents. The agent log should help determine if you are running into the same issue: https://github.com/cloudfoundry/bosh-linux-stemcell-builder/issues/261.

There already is PR addressing the above issue being reviewed here: https://github.com/cloudfoundry/bosh-agent/pull/298

ae-govau commented 1 year ago

Thanks @rkoster - reading that issue, that seems very plausible. I'm glad it wasn't just jammy, as I think we saw the same on bionic at around the same time.

Reading those instructions for getting the VM logs, that's going to be difficult for me to follow in our env, but once that PR is merged, I'll be happy to report back success or failure here - we nearly always see at least one or more failures per environment that we update when there's a new stemcell (which we regulary update to).

ae-govau commented 1 year ago

We've just updated to stemcell Jammy 1.75 - and successfully updated all environments without seeing this issue occur once. I'm assuming that must have the newer BOSH agent with this fix.

So nice to leave the update unattended and be reliable again. :)

I'll make this as Closed - thanks for your help.