Service Deteriorates since March 20th push

DNSCrypt / dnscrypt-server-docker

A Docker image for a non-censoring, non-logging, DNSSEC-capable, DNSCrypt-enabled DNS resolver

https://dnscrypt.info

ISC License

670 stars 135 forks source link

Service Deteriorates since March 20th push #80

Closed ianbashford closed 4 years ago

ianbashford commented 4 years ago

I'm continuing to have problems with my servers since the push yesterday. They start fine, but in under an hour they simply stop responding to queries. Restarting the docker container fixes it.
There's nothing obvious like CPU/disk, and it's affecting two different cloud service providers in exactly the same way. Both systems are normally very stable.

It's the straight docker image, no customisations, created with

docker run --name=dnscrypt-server -p 443:443/udp -p 443:443/tcp --net=host \
--ulimit nofile=90000:90000 --restart=unless-stopped \
-v /home/docker/keys:/opt/encrypted-dns/etc/keys \
-v /home/docker/lists:/opt/encrypted-dns/etc/lists \
jedisct1/dnscrypt-server init -A -N v.dnscrypt.uk  -E 104.238.186.192:443,[2001:19f0:7402:1574:5400:02ff:fe66:2cff]:443 -M 0.0.0.0:9100

Can you suggest anyways I can try to find the fault? I'm a bit stumped...

jedisct1 commented 4 years ago

Damn.

Can you log into the container (docker exec -ti dnscrypt-server /bin/bash) and see if unbound (listening to 127.0.0.1:553) is still responding?

drill -p 553 example.com @127.0.0.1

ianbashford commented 4 years ago

Thanks for the input -- everything has now stabilised.
[Heisenberg in full effect - as soon as I was there with the intention of measuring this carefully, it started working].

On the image I've attached, I've highlighted in red the times it was correctly functioning, following a docker restart; I'm guessing that there are peaks due to retries when it was broken. The restart I did yesterday around 17.00 seemed to fix it... It was interesting to see that for the majority of the time, the anonymous service worked, so that points to an issue with unbound (monitors are all here: https://status.dnscrypt.uk/ )

Thanks for the input, I'll close this off as things have stabilised now.

Screenshot 2020-03-22 at 10 21 30

ianbashford commented 4 years ago

Sorry -- just noticed you did a push at that time and that was what fixed it --- thank you! Was it the unbound downgrade do you think?

jedisct1 commented 4 years ago

Thanks a lot for your quick report and for the follow up!

Since relaying kept working, the issue was indeed very likely to be in Unbound.

The serve-stale feature is new and was enabled for the first time. It has been disabled in the new push. Thanks a lot for reporting that the service is back to being stable after this.

Keeping the serve-stale feature disabled is no big deal anyway, since encrypted-dns-server already has a similar feature. We can totally wait a little bit before trying it again.

Thanks again for your help!

jedisct1 commented 4 years ago

Confirmed, this is Unbound's serve-stale feature.

I was still running the previous image on scaleway-ams. While the service kept working, I just noticed that the number of used descriptors got really high since the update, along with the number of inflight queries and offline responses.

I just logged into the container and changed serve-expired-client-timeout: 1800 to serve-expired-client-timeout: 0 in order to disable the feature. And boom, everything immediately got back to normal.

mibere commented 4 years ago

noticed that the number of used descriptors got really high

What command did you use for that?

jedisct1 commented 4 years ago

This is one of the metrics exported to Prometheus.

jedisct1 commented 4 years ago

...
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 18
...

ianbashford commented 4 years ago

process_max_fds is the file descriptor limit, not the high mark - is that right?

jedisct1 commented 4 years ago

Yes, this is the limit for the process. The high mark is visible on the graph :)

mibere commented 4 years ago

With Unbound 1.11, has anyone experiences if serve-stale (e.g. serve-expired-client-timeout: 1800) is working better now?

jedisct1 commented 4 years ago

I didn't try it again since encrypted-dns-server supports serve-stale on its own.

The recent commits don't seem to mention anything about changes having been made to that feature. But it may still be worth a new try.