NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
3.08k stars 351 forks source link

Occasional abort related to dnssec-trigger #24

Open pemensik opened 5 years ago

pemensik commented 5 years ago

It seems this issue is related to some action done by dnssec-trigger we are using. It happens sometime on Fedora, bug #1667387. We were not able to find a reason for it, maybe you could help us?

(gdb) where
#0  0x00007f8f19bb053f in raise () from /lib64/libc.so.6
#1  0x00007f8f19b9a895 in abort () from /lib64/libc.so.6
#2  0x00007f8f19bf3927 in __libc_message () from /lib64/libc.so.6
#3  0x00007f8f19c870f5 in __fortify_fail_abort () from /lib64/libc.so.6
#4  0x00007f8f19c87127 in __fortify_fail () from /lib64/libc.so.6
#5  0x00007f8f19c850e6 in __chk_fail () from /lib64/libc.so.6
#6  0x00005583269dca0d in memmove (__len=<optimized out>, __src=<optimized out>, __dest=0x7fffad5ea930)
    at /usr/include/bits/string_fortified.h:40
#7  xfr_probe_send_probe (xfr=xfr@entry=0x5583287edea0, env=env@entry=0x5583288151c8, timeout=timeout@entry=100)
    at services/authzone.c:5793
#8  0x00005583269dd760 in xfr_probe_send_or_end (xfr=xfr@entry=0x5583287edea0, env=env@entry=0x5583288151c8)
    at services/authzone.c:6074
#9  0x00005583269ddf6d in xfr_probe_send_or_end (env=0x5583288151c8, xfr=0x5583287edea0) at services/authzone.c:6147
#10 auth_xfer_probe_lookup_callback (arg=0x5583287edea0, rcode=<optimized out>, buf=0x0, sec=<optimized out>, 
    why_bogus=<optimized out>, was_ratelimited=<optimized out>) at services/authzone.c:6147
#11 0x000055832699b70e in mesh_state_cleanup (mstate=0x558328d644c0) at services/mesh.c:744
#12 0x000055832699d0d3 in mesh_state_delete (qstate=<optimized out>) at services/mesh.c:795
#13 mesh_state_delete (qstate=<optimized out>) at services/mesh.c:761
#14 0x000055832699d219 in mesh_delete_helper (n=<optimized out>) at services/mesh.c:293
#15 mesh_delete_all (mesh=0x558328e60860) at services/mesh.c:293
#16 0x0000558326973ce6 in do_flush_requestlist (worker=0x558328813d10, ssl=0x7fffad5eb260) at daemon/remote.c:2947
#17 execute_cmd (rc=rc@entry=0x5583286efa40, ssl=ssl@entry=0x7fffad5eb260, 
    cmd=cmd@entry=0x7fffad5eae20 " flush_requestlist", worker=0x558328813d10) at daemon/remote.c:2947
#18 0x0000558326974b6f in handle_req (rc=rc@entry=0x5583286efa40, res=res@entry=0x7fffad5eb260, s=<optimized out>)
    at daemon/remote.c:3095
#19 0x0000558326974c72 in remote_control_callback (c=0x5583287e0b40, arg=arg@entry=0x558328eb9e20, err=err@entry=0, 
    rep=rep@entry=0x0) at daemon/remote.c:3177
#20 0x0000558326974f45 in remote_accept_callback (err=0, rep=<optimized out>, arg=0x5583286efa40, c=<optimized out>)
    at daemon/remote.c:521
#21 remote_accept_callback (c=<optimized out>, arg=0x5583286efa40, err=<optimized out>, rep=<optimized out>)
    at daemon/remote.c:443
#22 0x00007f8f1a3d6031 in event_process_active_single_queue () from /lib64/libevent-2.1.so.6
#23 0x00007f8f1a3d67c7 in event_base_loop () from /lib64/libevent-2.1.so.6
#24 0x0000558326a14530 in comm_base_dispatch (b=<optimized out>) at util/netevent.c:244
#25 0x00005583269796ed in worker_work (worker=<optimized out>) at daemon/worker.c:1884
#26 0x000055832696d26b in daemon_fork (daemon=0x5583286bb260) at daemon/daemon.c:666
#27 0x0000558326968c6f in run_daemon (need_pidfile=1, log_default_identity=0x7fffad5ecdf3 "unbound", debug_mode=1, 
    cmdline_verbose=0, cfgfile=0x558326a2fe1f "/etc/unbound/unbound.conf") at daemon/unbound.c:652
#28 main (argc=<optimized out>, argv=<optimized out>) at daemon/unbound.c:749
(gdb) frame 7
#7  xfr_probe_send_probe (xfr=xfr@entry=0x5583287edea0, env=env@entry=0x5583288151c8, timeout=timeout@entry=100)
    at services/authzone.c:5793
5793            memmove(&addr, &xfr->task_probe->scan_addr->addr, addrlen);
(gdb) p *xfr->task_probe->scan_addr
$1 = {next = 0x558328f599b0, addr = {ss_family = 161, 
    __ss_padding = "\000\000\000\000\000\000\360\\\361(\203U\000\000 \022~(\203U\000\000\200\000\000\000\000\000\000\000\340@\241&\203U\000\000`\205\353(\203U\000\000\377\377\377\377", '\000' <repeats 12 times>, "\377\377\377\377\000\000\000\000\360\364\200(\203U", '\000' <repeats 34 times>, "\001\000\000\000\000\000\000", __ss_align = 56728}, 
  addrlen = 151524}
(gdb) p *xfr->task_probe->scan_addr->next
$2 = {next = 0x0, addr = {ss_family = 177, 
    __ss_padding = "\000\000\000\000\000\000`\215\323\031\217\177\000\000pT\365(\203U\000\000\200\000\000\000\000\000\000\000\340@\241&\203U\000\000p\261\355(\203U\000\000\377\377\377\377", '\000' <repeats 12 times>, "\377\377\377\377\000\000\000\000\360\364\200(\203U", '\000' <repeats 34 times>, "\001\000\000\000\000\000\000", __ss_align = 56727}, 
  addrlen = 862559}
(gdb) p *xfr->task_probe->scan_target
$3 = {next = 0x5583287ee120, host = 0x5583287ee100 "k.root-servers.net", file = 0x0, http = 0, ixfr = 1, 
  allow_notify = 0, ssl = 0, port = 0, list = 0x0}
(gdb)

Happens often in version 1.8.3, but I think it happens sometime in 1.9.x sometime.

wcawijngaards commented 5 years ago

This looks like a problem that is fixed in one of these issues. https://github.com/NLnetLabs/unbound/commit/474afc9016d34a98537a97cc94e14d329c7d8aeb https://github.com/NLnetLabs/unbound/commit/c6369e9ffa59eb5a9f714f57810ab6ed389866b7 specifically the first one one address type that could cause it to fail.

So, updating to the code repository version may fix that problem. It is the auth-zone code, and not really dnssec-trigger. dnssec-trigger is also active I guess, during network start. Thanks for the stacktrace and report, by the way.

wcawijngaards commented 5 years ago

In any case, the new https://github.com/NLnetLabs/unbound/commit/c26fc8494538cba37ad15f13c1e80cb48fea3d0b auth zone logging code could be useful to diagnose the trouble more clearly if the existing fixes do not cover the issues you are facing. They log at verbosity high levels, with more information why unbound is doing lookups. (edit to fix commit link)

pemensik commented 5 years ago

Thanks for points to release. Unfortunately, it crashed to me again recently with version containing all those commits. Again on root zone check. Backtrace on bugzilla comment

wcawijngaards commented 5 years ago

Added a new commit that fixes the scan_addr that is referenced in the memmove from the backtrace, so that it is zeroed when the list is freed. That should stop that from getting used if the lookups fail and it points to the result of a previous scan. But I did not reproduce the issue, so although this fixes elements from the failure and stacktrace, I cannot say for sure. Thanks for the detailed stack traces, by the way, those are very helpful.