Closed andrew-edgar closed 4 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/171750308
The labels on this github issue will be updated when the story is started.
We observed simular issues with 1.18.0
, but we are not sure whether bosh-dns 1.17.0
doesn't have the same issue. @andrew-edgar Can you confirm that bosh-dns 1.17.0
doesn't have this issue?
I started this Slack conversation: https://cloudfoundry.slack.com/archives/C02HPPYQ2/p1583764488097700, but it is better to switch the communication to this issue.
@andrew-edgar can you redeploy with the log_level set to DEBUG for bosh-dns and bosh-dns-healthserver jobs in the runtime config and provide:
I was able to reproduce the issue in an environment with debug logging and identified a mutex deadlock in https://github.com/cloudfoundry/bosh-dns-release/blob/master/src/bosh-dns/dns/server/records/record_set.go#L129-L139
Due to public methods calling each other, a handler thread looking up an internal domain would end up acquiring the record read lock multiple times. If a write lock is queued in between (due to records.json
updating as a result of a bosh deploy
) then the RecordSet becomes deadlocked.
I'm aiming to have a fix pushed today, and will cut a new release once it passes through our CI.
This is awesome! Thanks for digging into this more and finding the problems!
Fixed in 1.20.0
This is really awesome and many thanks for fixing it!
This is with the latest bosh-dns-release 1.19.0 as well as 1.18.0
On our "scheduler" VM in a busy environment the bosh-dns agent stops responding.
This fails ...
When this happens the auctioneer fails to send any work to the diego cells causing a complete cf push outage.
To fix the problem we monit restart bosh-dns and it recovers.
There are no errors in the bosh-dns logs.
What is seen in the auctioneer is this ...
I know this says "dial tcp: i/o timeout" but we have proven by manual stopping the bosh-dns.
This seems to be the problem during a bosh deployment.