bosh-dns agent stops responding breaking all resolves of internal addresses

cloudfoundry / bosh-dns-release

BOSH DNS release

Apache License 2.0

18 stars 37 forks source link

bosh-dns agent stops responding breaking all resolves of internal addresses #55

Closed andrew-edgar closed 4 years ago

andrew-edgar commented 4 years ago

This is with the latest bosh-dns-release 1.19.0 as well as 1.18.0

On our "scheduler" VM in a busy environment the bosh-dns agent stops responding.

This fails ...

dig auctioneer.service.cf.internal @169.254.0.2 +short

When this happens the auctioneer fails to send any work to the diego cells causing a complete cf push outage.

To fix the problem we monit restart bosh-dns and it recovers.

There are no errors in the bosh-dns logs.

What is seen in the auctioneer is this ...

{"timestamp":"2020-03-11T14:55:35.997032456Z","level":"error","source":"auctioneer","message":"auctioneer.auction.failed-to-commit","data":{"cell-guid":"b52d4499-f402-4b95-8ccc-75f95c8b518d","error":"Post https://b52d4499-f402-4b95-8ccc-75f95c8b518d.cell.service.cf.internal:1801/work: dial tcp: i/o timeout","session":"79627"}}

I know this says "dial tcp: i/o timeout" but we have proven by manual stopping the bosh-dns.

This seems to be the problem during a bosh deployment.

cf-gitbot commented 4 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/171750308

The labels on this github issue will be updated when the story is started.

beyhan commented 4 years ago

We observed simular issues with 1.18.0, but we are not sure whether bosh-dns 1.17.0 doesn't have the same issue. @andrew-edgar Can you confirm that bosh-dns 1.17.0 doesn't have this issue?

I started this Slack conversation: https://cloudfoundry.slack.com/archives/C02HPPYQ2/p1583764488097700, but it is better to switch the communication to this issue.

cjnosal commented 4 years ago

@andrew-edgar can you redeploy with the log_level set to DEBUG for bosh-dns and bosh-dns-healthserver jobs in the runtime config and provide:

output of dig @169.254.0.2 $DOMAIN on the affected VM
logs of the bosh-dns job on the affected VM
logs of the bosh-dns-healthserver job on the VM $DOMAIN should resolve to
how frequent/reproducible/consistent the issue is

cjnosal commented 4 years ago

I was able to reproduce the issue in an environment with debug logging and identified a mutex deadlock in https://github.com/cloudfoundry/bosh-dns-release/blob/master/src/bosh-dns/dns/server/records/record_set.go#L129-L139

Due to public methods calling each other, a handler thread looking up an internal domain would end up acquiring the record read lock multiple times. If a write lock is queued in between (due to records.json updating as a result of a bosh deploy) then the RecordSet becomes deadlocked.

I'm aiming to have a fix pushed today, and will cut a new release once it passes through our CI.

andrew-edgar commented 4 years ago

This is awesome! Thanks for digging into this more and finding the problems!

cjnosal commented 4 years ago

Fixed in 1.20.0

beyhan commented 4 years ago

This is really awesome and many thanks for fixing it!