DNSCrypt / dnscrypt-proxy

dnscrypt-proxy 2 - A flexible DNS proxy, with support for encrypted DNS protocols.
https://dnscrypt.info
ISC License
11.28k stars 1k forks source link

Memory leak #1622

Closed defkev closed 3 years ago

defkev commented 3 years ago

Who is the bug affecting?

Me

What is affected by this bug?

Memory util

When does this occur?

Linear

Where does it happen?

On the server running dnscrypt

How do we replicate the issue?

Run dnscrypt-proxy using the (mostly) default configuration for +20 days

Non-default config settings:

# since i run two instances of dnscrypt i can't use the systemd socket
listen_addresses = ['127.0.0.2:53']

ipv6_servers = true
require_dnssec = true

# was 240 minutes, this seems to alleviate the problem at least somewhat
cert_refresh_delay = 1440

dnscrypt_ephemeral_keys = true

# unbound does caching so no point to cache for the cache
cache = false

Expected behavior (i.e. solution)

Not leaking memory

Other Comments

dnscrypt-proxy version 2.0.44 (from epel) on CentOS 7 core as a systemd service

I've seen this for quite some time now since upgrading to dnscrypt-proxy2

EDIT: I used to run dnscrypt-proxy2 from epel til the problem started, then moved over to the release page in hopes to fix it with a more recent version but since moved back to epel as the behavior was the same.

Memory util will grow linear: mem

till the system eventually runs out of (available) memory causing dnscrypt-proxy to stall the CPU causing queries to become sluggish and unresponsive: cpu

Charts plot the last 30 days

Effectively what #1352 and #1580 have reported.

This the main instance running (as of now) 104 public-resolvers (ipv4, ipv6, doh, dnscrypt, dnssec, no logging, no filter) with the default rotate of 4 hours and the default lb strategy. I can somewhat alleviate the problem (see the last ~5 days) by a) disabling doh_servers (which limits the list to ~61 resolvers) and b) increase the cert_refresh_delay to 24 hours, tho i haven't tested this long term yet but suspect that (at least the latter) only delays the issue.

The problem can always be solved by restarting dnscrypt-proxy, as seen in the charts.

Oddly enough, i run a backup instance of dnscrypt-proxy on the same server using a static list of resolvers (cloudflare ipv4 and ipv6) with the exact same runtime which doesn't have the problem.

Which leaves me to assume that the memory used by the resolvers doesn't get properly recycled by go. I haven't looked at the code yet so this is but a wild guess.

I'd appreciate any advice next to having crontab restart the service every xx days

Cheers

lifenjoiner commented 3 years ago

I'm only PC user on Windows haven't experienced this issue yet. After reading your long term observation, I'm wondering:

  1. v2.0.45 has been released, is it still have this issue? Could you kindly have a test?
  2. Do you have any clue that we can follow to dig out why this happens? Error messages, crash dumps, etc., any will be good. Maybe only who reproduces this issue can dig it out. And why the peak goes up in the middle range (3 to 12 days after your restart) would be a good breakthrough point.
  3. You'd better provide your detailed configuration (remove privacy information), in case of anyone would try to reproduce the issue.
defkev commented 3 years ago
  1. I run dnscrypt-proxy from EPEL, 2.0.45 hasn't hit upstream yet, but i do recall seeing this since at least 2.0.3x (from the release page and epel)
  2. The closest i have currently come to is increasing the cert_refresh_delay to 24hours, i do recall reading a blog post (or whatever it was) about it some time ago that certs shouldn't be older than that. dnscrypt itself doesn't report anything out the ordinary (no errors, no crashes)
  3. Will update the OP to include the non-defaults

The system is nothing fancy just a headless router doing NAT (ipv4) and router advertising (ipv6) along iptables, DNS and some VPN (wireguard) for a small network of ~50 clients dnscrypt is frontfaced with unbound doing caching/transparent-proxying and blacklisting (mainly to block MS metrics collection from clients on the network)

I do recall having intermediate DNSSEC validation errors from the main instance using random resolvers (some do this better than others) which is why i added the backup instance of dnscrypt-proxy with a static resolver to use as a fallback, which has (mostly) resolved the validation problems.

jedisct1 commented 3 years ago

So that we don't try to chase an issue that doesn't exist any more, could you try 2.0.45, if possible official builds (so that it doesn't use an old Go version either) ?

Does that graph show virtual memory or active memory?

defkev commented 3 years ago

I'll keep an eye on epel for when 2.0.45 drops.

2.0.44 was build against golang 1.14.12 according to the build log on Koji Current in epel is 1.15.5 so i suspect 2.0.45 will be build against that or newer

Graph shows real physical memory as reported by hrSWRunPerfMem taken from /proc

jedisct1 commented 3 years ago

You can get the official builds here