Closed ianbashford closed 4 years ago
Damn.
Can you log into the container (docker exec -ti dnscrypt-server /bin/bash
) and see if unbound
(listening to 127.0.0.1:553
) is still responding?
drill -p 553 example.com @127.0.0.1
Thanks for the input -- everything has now stabilised.
[Heisenberg in full effect - as soon as I was there with the intention of measuring this carefully, it started working].
On the image I've attached, I've highlighted in red the times it was correctly functioning, following a docker restart; I'm guessing that there are peaks due to retries when it was broken. The restart I did yesterday around 17.00 seemed to fix it... It was interesting to see that for the majority of the time, the anonymous service worked, so that points to an issue with unbound (monitors are all here: https://status.dnscrypt.uk/ )
Thanks for the input, I'll close this off as things have stabilised now.
Sorry -- just noticed you did a push at that time and that was what fixed it --- thank you! Was it the unbound downgrade do you think?
Thanks a lot for your quick report and for the follow up!
Since relaying kept working, the issue was indeed very likely to be in Unbound.
The serve-stale feature is new and was enabled for the first time. It has been disabled in the new push. Thanks a lot for reporting that the service is back to being stable after this.
Keeping the serve-stale feature disabled is no big deal anyway, since encrypted-dns-server already has a similar feature. We can totally wait a little bit before trying it again.
Thanks again for your help!
Confirmed, this is Unbound's serve-stale
feature.
I was still running the previous image on scaleway-ams
. While the service kept working, I just noticed that the number of used descriptors got really high since the update, along with the number of inflight queries and offline responses.
I just logged into the container and changed serve-expired-client-timeout: 1800
to serve-expired-client-timeout: 0
in order to disable the feature. And boom, everything immediately got back to normal.
noticed that the number of used descriptors got really high
What command did you use for that?
This is one of the metrics exported to Prometheus.
...
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 18
...
process_max_fds is the file descriptor limit, not the high mark - is that right?
Yes, this is the limit for the process. The high mark is visible on the graph :)
With Unbound 1.11, has anyone experiences if serve-stale (e.g. serve-expired-client-timeout: 1800) is working better now?
I didn't try it again since encrypted-dns-server
supports serve-stale
on its own.
The recent commits don't seem to mention anything about changes having been made to that feature. But it may still be worth a new try.
I'm continuing to have problems with my servers since the push yesterday. They start fine, but in under an hour they simply stop responding to queries. Restarting the docker container fixes it.
There's nothing obvious like CPU/disk, and it's affecting two different cloud service providers in exactly the same way. Both systems are normally very stable.
It's the straight docker image, no customisations, created with
Can you suggest anyways I can try to find the fault? I'm a bit stumped...