NLnetLabs / nsd

The NLnet Labs Name Server Daemon (NSD) is an authoritative, RFC compliant DNS nameserver.
https://nlnetlabs.nl/nsd
BSD 3-Clause "New" or "Revised" License
462 stars 105 forks source link

nsd verification processing hangs, activity stopped for 20-30 minutes #338

Open ttyS4 opened 5 months ago

ttyS4 commented 5 months ago

hi nsd folks,

There is a place where nsd is used for verification. (Because of ixfr related issues it is on 4.9.1-1 now running on debian 12, compiled a package in a debian12 chroot using official debian packages, basically a backport.)

A new zone is generated every 10 minutes and knot signs the zone then nsd does verification and distributes the zone (notify-out + xfr).

nsd[32438]: notify for xy. from ::1 serial 1718515802
nsd[22942]: xfrd: zone xy committed "received update to serial 1718515802 at 2024-06-16T07:30:28 from ::1@52"
nsd[22943]: zone xy. received update to serial 1718515802 at 2024-06-16T07:30:28 from ::1@52 of 7204 bytes in 7.9e-05 seconds
nsd[22943]: verify: started verifier for zone xy (pid 35409)
...
nsd[22943]: verify: verifier for zone xy (pid 35409) exited with 0
nsd[22942]: zone xy serial 1718515202 is updated to 1718515802
nsd[35663]: ixfr for xy. from IP1
nsd[35663]: ixfr for xy. from IP2
...
nsd[22942]: xfrd: zone xy: received notify response error .... from IP6

However today we saw no follow-up after the verifier exited with 0. We see nsd[4819]: handle_child_command: read: Connection reset by peer like 20 minutes after the verification finished. Then normal activity is resumed and:

nsd[22942]: zone xy serial 1718516403 is updated to 1718517002

message follows.

Notify messages were received (and logged) while in this state, but no progress.

Would you think that upgrade to 4.10 could help? Is this a known issue or something that needs further investigation?

Regards, Tamás

wtoorop commented 5 months ago

Hi Tamas, I don't think upgrading to 4.10 would make a difference in this case, but perhaps the 20 minutes timeout (in which NSD stays in reload mode) could be reduced by setting verifier-timeout: value to something reasonable; like 200% the time it takes the script to verify the zone or so.

wtoorop commented 5 months ago

But I still want to look into the specific case (by manual code instpection) that the process already exited, but that NSD is still reading what the verifier is writing to stdout and stderr.

ttyS4 commented 5 months ago

If you need any info from us, just let us know. (I can also try to collect data for you as long as it is considered safe.)

ttyS4 commented 1 month ago

This issue happened again.

# grep -E 'handle_child_command|Broken' /var/log/daemon.log
Oct 22 04:10:31 myhost nsd[27260]: handle_child_command: read: Connection reset by peer
Oct 22 05:48:55 myhost nsd[16206]: svrmain: problems sending quit to child 8223 command: Broken pipe
Oct 22 05:48:55 myhost nsd[16206]: handle_child_command: read: Connection reset by peer
Oct 22 05:48:55 myhost nsd[16206]: svrmain: problems sending quit to child 8223 command: Broken pipe
Oct 22 06:11:01 myhost nsd[24647]: handle_child_command: read: Connection reset by peer