NLnetLabs / nsd

The NLnet Labs Name Server Daemon (NSD) is an authoritative, RFC compliant DNS nameserver.
https://nlnetlabs.nl/nsd
BSD 3-Clause "New" or "Revised" License
462 stars 105 forks source link

NSD v 4.2.2 - fork failed: Cannot allocate memory #46

Open geertverheyen opened 5 years ago

geertverheyen commented 5 years ago

After OS upgrade (centos 7.7) and NSD upgrade (v4.1.24-2 to v.4.2.2-1) we encounter issues with NSD on our authoritative nameserver: nsd[5346]: warning: server 16222 died unexpectedly, restarting nsd[5346]: error: fork failed: Cannot allocate memory nsd[5346]: warning: process 16222 terminated with status 9

kernel: nsd invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 kernel: nsd cpuset=/ mems_allowed=0-1 kernel: CPU: 23 PID: 11340 Comm: nsd Not tainted 3.10.0-1062.4.1.el7.x86_64 #1

The NSD package installed is from the EPEL repository.

The EPEL NSD SPEC file list the options used for v4.2.2-1: --enable-recvmmsg --enable-packed --enable-memclean --enable-zone-stats

I wonder if either of these options (--enable-packed --enable-memclean) could be the cause of our issue ?

geertverheyen commented 5 years ago

Furthermore, we noticed a significant increase of CPU usage by NSD during an IXFR transfer, compared to the CPU usage of the previous NSD version (v4.1.24)

wcawijngaards commented 5 years ago

The enable-packed, makes it use less memory, but the unaligned accesses are a little slower. But only very little, in speed, I believe. But it should indeed be slower and use less memory.

The memclean also, during ixfr processing, performs cleanup that is not needed except for debug. That takes time too.

But I cannot say how that would make it very slow, or even noticeable. Perhaps for very large zones. Or how that would affect total memory usage.

SvenVD-be commented 5 years ago

We have ixfr updates every x seconds. Perhaps the --enable-memclean can not handle the rate in combination with a rather large zone of more then 4M records?

What we noticed is that nsd runs for some time, half an hour, to hours and then suddenly uses up all available memory + swap in the server after which oom-killer kills it ..

wcawijngaards commented 5 years ago

Perhaps an IXFR fails to work right, logs it was not clean and needs to fall back to AXFR and then it has to perform an AXFR. This uses more memory when forking. Or maybe the memclean feature, as you suggest.

wcawijngaards commented 5 years ago

The linux oom killer is also heuristic based, and NSD has specific behaviour. It is possible to adjust the OOM behaviour to allow more, because NSD will share a lot of the memory after fork. That can add more flexibility when you get close to the memory limits of the system, and then it can still work, because the oom defaults would make it fail, but because NSD shares so much memory between the forks it could still fit in the memory of the machine. This is more a workaround, not so much an explanation or fix for memory usage.

wcawijngaards commented 5 years ago

There is a setting for OOM to disable it in /proc somewhere, but a quick search reveals that you can also adjust it in systemd startup files and so on, https://unix.stackexchange.com/questions/58872/how-to-set-oom-killer-adjustments-for-daemons-permanently/409486 Perhaps this link can help with the workaround to allow more memory usage by NSD.