Closed iruzanov closed 1 year ago
Hello,
Thanks for the backtrace! This seems to be the same as #411 and #439 and the backtrace will hopefully help narrow down the cause of the corruption.
In this case w->next_waiting
is supposed to be casted to struct pending_tcp*
because of the w->on_tcp_waiting_list
check above.
So the error is not there per say; rather at an earlier point the w struct was corrupted.
~I also see the cause of the crash for the other issues, namely:~
reuse = {node = {parent = 0xdc94953a0,
left = 0x3287d0 <rbtree_null_node>, right = 0x3287d0 <rbtree_null_node>, key = 0x0, color = 1 '\001'}
~there is a node in the reuse tree with key=NULL
. That is the case that shouldn't be happening in the first place.~
No scratch that, this is just a node and having key=NULL
just means that this node is not part of the tree.
Btw, can you somehow reproduce the issue reliably-ish?
I can see in my case that the key=NULL too:
(gdb) print w->outnet->tcp_free.reuse.node
$27 = {parent = 0xdc94953a0, left = 0x3287d0
By the way, i have another core dump where the key is NULL: Core was generated by `/usr/local/sbin/unbound -c /srvs/unbound/etc/unbound/unbound.conf'. Program terminated with signal SIGSEGV, Segmentation fault.
174 if(w1->id < w2->id) [Current thread is 1 (LWP 100527)] (gdb) bt
at ./services/outside_network.c:400
reply_info=<optimized out>) at ./services/outside_network.c:1125
Backtrace stopped: Cannot access memory at address 0x7fffdf1f7000 (gdb) (gdb) fr 1
241 if ((r = rbtree->cmp(data->key, node->key)) == 0) {
(gdb) print *node
$14 = {parent = 0xf0686c928, left = 0x3287d0
This node is appearing in while(node != RBTREE_NULL) loop (source is util/rbtree.c in rbtree_insert() function) and program is abnormally terminating when we call rbtree->cmp(data->key, node->key) in frame 0.
Btw, can you somehow reproduce the issue reliably-ish?
It would be hard. Unbound crashes under massive TCP-requests when server generating huge amount of iterative requests to upstreams. So i will need to research this traffic first to reproduce the issue in my sandbox.
There is a possible fix on master branch (https://github.com/NLnetLabs/unbound/commit/ff6b527184b33ffe1e2b643db8a32fae8061fc5a) for this. It would be great if you could test and provide feedback!
Hi @iruzanov, just checking if you were able to test with the aforementioned fix.
Hello colleagues! Please excuse me for long delay! I'm just very "loaded" by other tasks right now ;)
In will definitely test your patch on my load Unbound resolvers in this week. And then i will report you about results! Thank you very much for help!
Today i have patched one of my loaded Unbound resolvers. My plan is:
Thank you very much for big help!
Unbound patched on one of the most loaded servers is still running without any core dumps. Looks good. In days i'm going to patch still more loaded servers.
Thanks for letting us know, very good news! Normally would you expect a core dump by now?
Hi! The server works fine for 4 days without core dumps. Super! ;) And i'm going to apply your patch to list of another servers. In next week i will let you know about success! (i hope so ;)
I have patched still three Unbound resolvers that are running under heavy load. No one of the processes are crased with core dump. I would like you to give me still more time to observe the patched software and then i will let you know with my final conclusion.
Hello colleagues!
Unfortunately i have detected core dumps from two Unbound processes.
[New LWP 100571]
[New LWP 148594]
[New LWP 100646]
[New LWP 101330]
[New LWP 101361]
[New LWP 101577]
[New LWP 101603]
[New LWP 101611]
[New LWP 101613]
[New LWP 101619]
[New LWP 101620]
[New LWP 101631]
[New LWP 101649]
[New LWP 101653]
[New LWP 101672]
[New LWP 101674]
[New LWP 101689]
[New LWP 101693]
[New LWP 101702]
[New LWP 101719]
[New LWP 101721]
Core was generated by `/usr/local/sbin/unbound -c /srvs/unbound/etc/unbound/unbound.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000031344a in reuse_id_cmp (key1=0x7fffdfffda68, key2=0x0) at ./services/outside_network.c:178
178 if(w1->id < w2->id)
[Current thread is 1 (LWP 100571)]
(gdb) bt
#0 0x000000000031344a in reuse_id_cmp (key1=0x7fffdfffda68, key2=0x0) at ./services/outside_network.c:178
#1 0x00000000003156e9 in rbtree_find_less_equal (rbtree=0xdc7262a78, key=0x7fffdfffda68, result=<optimized out>) at ./util/rbtree.c:527
#2 rbtree_search (rbtree=0xdc7262a78, key=0x7fffdfffda68) at ./util/rbtree.c:285
#3 reuse_tcp_by_id_find (reuse=0xdc72629a0, id=6060) at ./services/outside_network.c:425
#4 reuse_tcp_select_id (reuse=0xdc72629a0, outnet=<optimized out>) at ./services/outside_network.c:2116
#5 0x0000000000314a38 in tcp_select_id (outnet=0xd40aa3000, reuse=0xdc72629a0) at ./services/outside_network.c:2092
#6 use_free_buffer (outnet=0xd40aa3000) at ./services/outside_network.c:745
#7 0x000000000031437f in outnet_tcp_cb (c=<optimized out>, arg=<optimized out>, error=<optimized out>, reply_info=<optimized out>)
at ./services/outside_network.c:1150
#8 0x000000080045e2b1 in ?? () from /usr/local/lib/libevent-2.1.so.7
#9 0x000000080045a09c in event_base_loop () from /usr/local/lib/libevent-2.1.so.7
#10 0x000000000024dc84 in thread_start (arg=0x8014ba800) at ./util/ub_event.c:280
#11 0x0000000800780fac in ?? () from /lib/libthr.so.3
#12 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdfffe000
(gdb) list
173
174 int reuse_id_cmp(const void* key1, const void* key2)
175 {
176 struct waiting_tcp* w1 = (struct waiting_tcp*)key1;
177 struct waiting_tcp* w2 = (struct waiting_tcp*)key2;
178 if(w1->id < w2->id)
179 return -1;
180 if(w1->id > w2->id)
181 return 1;
182 return 0;
I.e. we have deal with the same issue related to key2=NULL:
(gdb) print *w2
Cannot access memory at address 0x0
[New LWP 101107]
[New LWP 104932]
[New LWP 100498]
[New LWP 100601]
[New LWP 100606]
[New LWP 100622]
[New LWP 100719]
[New LWP 100915]
[New LWP 100936]
[New LWP 100997]
[New LWP 101099]
[New LWP 101113]
[New LWP 101115]
[New LWP 101128]
[New LWP 101132]
[New LWP 101138]
[New LWP 101146]
[New LWP 101161]
[New LWP 101171]
[New LWP 101178]
[New LWP 101196]
Core was generated by `/usr/local/sbin/unbound -c /srvs/unbound/etc/unbound/unbound.conf'.
Program terminated with signal SIGABRT, Aborted.
#0 0x0000000800955c2a in thr_kill () from /lib/libc.so.7
[Current thread is 1 (LWP 101107)]
(gdb) bt
#0 0x0000000800955c2a in thr_kill () from /lib/libc.so.7
#1 0x0000000800954084 in raise () from /lib/libc.so.7
#2 0x00000008008ca279 in abort () from /lib/libc.so.7
#3 0x0000000800464641 in ?? () from /usr/local/lib/libevent-2.1.so.7
#4 0x0000000800464939 in event_errx () from /usr/local/lib/libevent-2.1.so.7
#5 0x000000080045ec54 in evmap_io_del_ () from /usr/local/lib/libevent-2.1.so.7
#6 0x0000000800457e8f in event_del_nolock_ () from /usr/local/lib/libevent-2.1.so.7
#7 0x000000080045ada8 in event_del () from /usr/local/lib/libevent-2.1.so.7
#8 0x000000000030e2bb in ub_event_del (ev=<optimized out>) at ./util/ub_event.c:395
#9 comm_point_close (c=0xdd6d3cc00) at ./util/netevent.c:3860
#10 0x000000000031601b in decommission_pending_tcp (outnet=<optimized out>, pend=0xdd6721d80) at ./services/outside_network.c:967
#11 0x0000000000314836 in reuse_cb_and_decommission (outnet=0x18af3, pend=0x6, error=-2) at ./services/outside_network.c:1011
#12 0x0000000000317901 in outnet_tcptimer (arg=0xe9a748500) at ./services/outside_network.c:2058
#13 0x000000080045e0ed in ?? () from /usr/local/lib/libevent-2.1.so.7
#14 0x000000080045a09c in event_base_loop () from /usr/local/lib/libevent-2.1.so.7
#15 0x000000000024dc84 in thread_start (arg=0x801543000) at ./util/ub_event.c:280
#16 0x0000000800780fac in ?? () from /lib/libthr.so.3
#17 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdedf5000
2058 reuse_cb_and_decommission(outnet, pend, NETEVENT_TIMEOUT); (gdb) list 2053 waiting_tcp_callback(w, NULL, NETEVENT_TIMEOUT, NULL); 2054 waiting_tcp_delete(w); 2055 } else { 2056 / it was in use / 2057 struct pending_tcp pend=(struct pending_tcp)w->next_waiting; 2058 reuse_cb_and_decommission(outnet, pend, NETEVENT_TIMEOUT); 2059 } 2060 use_free_buffer(outnet); 2061 } 2062
(gdb) print (struct pending_tcp) pend Cannot access memory at address 0x6
(gdb) print (struct pending_tcp) w->next_waiting
$11 = {next_free = 0xdd67214c0, pi = 0xd0184b0c0, c = 0xdd6d3cc00, query = 0x0, reuse = {node = {parent = 0xdd6721b20, left = 0x328890
I have applied the patch that is on resource: https://github.com/NLnetLabs/unbound/commit/ff6b527184b33ffe1e2b643db8a32fae8061fc5a
Oh, bad news but thanks for the information as always! I'll start looking into this again right away. Any chance that you can make out any similarities traffic-wise I guess between the two crashes?
I will dig Unbound logs. May be i will find some common points to describe under what traffic the program has crashed. Also i will recompile libevent and Unbound itself with -O0 flag. To get the maximum debug info.
Would you be able to share your config file?
Also do you happen to issue unbound-control
commands to unbound?
My config file: server: verbosity: 1 extended-statistics: yes
num-threads: 20
interface: 127.0.0.1
interface: ::1
interface: 193.x.x.x
interface: 2001:x:x:x:x:x:x:x
interface: 62.x.x.x
interface: 62.x.x.x
interface: 195.x.x.x
interface: 195.x.x.x
interface: 2001:x:x::x
interface: 2001:x:x::x
interface: 2a0c:x:x::x
interface: 2a0c:x:x::x
outgoing-interface: 193.x.x.x outgoing-interface: 2001:x:x:x:x:x:x:x outgoing-range: 32768 outgoing-num-tcp: 2048 incoming-num-tcp: 1024
so-reuseport: yes # we are running under FreeBSD 12.2 with SO_REUSEPORT_LB
msg-cache-size: 1024m
msg-cache-slabs: 16
num-queries-per-thread: 2048
rrset-cache-size: 2048m
rrset-cache-slabs: 16
infra-cache-numhosts: 100000
do-ip4: yes
do-ip6: yes
do-udp: yes
do-tcp: yes
do-daemonize: yes
access-control: 0.0.0.0/0 allow
access-control: 127.0.0.0/8 allow
access-control: ::0/0 allow
access-control: ::1 allow
chroot: "/srvs/unbound"
username: "bind"
directory: "/srvs/unbound/etc/unbound"
use-syslog: yes
pidfile: "/srvs/unbound/var/run/unbound/unbound.pid"
hide-identity: no
hide-version: yes
identity: "res1"
aggressive-nsec: yes
do-not-query-address: ::1
do-not-query-localhost: no
ratelimit: 1000 minimal-responses: no module-config: "respip validator iterator"
auto-trust-anchor-file: "/srvs/unbound/etc/unbound/autokey/auto.key"
neg-cache-size: 100m
deny-any: yes
serve-expired: yes serve-expired-ttl: 14400 prefetch: yes prefetch-key: yes rpz: name: "blocklist." # name of the policy zone master: 194.x.x.x # address of the name server to transfer from rpz: name: "operator." master: 194.x.x.x
python:
dnstap: dnstap-enable: yes dnstap-socket-path: "/tmp/dnstap.sock" dnstap-send-identity: yes dnstap-send-version: yes dnstap-identity: "res1" dnstap-log-client-query-messages: yes dnstap-log-client-response-messages: yes dnstap-log-forwarder-query-messages: no dnstap-log-forwarder-response-messages: no dnstap-log-resolver-query-messages: no dnstap-log-resolver-response-messages: no
remote-control: control-enable: yes control-interface: ::1 control-use-cert: "yes" server-key-file: "/srvs/unbound/etc/unbound/unbound_server.key" server-cert-file: "/srvs/unbound/etc/unbound/unbound_server.pem" control-key-file: "/srvs/unbound/etc/unbound/unbound_control.key" control-cert-file: "/srvs/unbound/etc/unbound/unbound_control.pem"
stub-zone: name: "." stub-prime: yes stub-addr: 127.12.12.12 stub-addr: 193.x.x.x
What about traffic similarity - i have grep-ed Unbound logs on two servers. The log records (tons of the records) before Unbound was crashing to core dump are look like:
first server: [...] Jun 18 18:50:07 res1 unbound[28935]: [28935:11] error: read (in tcp s): Connection refused for 2a01:5b40:ac3::1 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:1] error: read (in tcp s): Connection refused for 2a01:5b40:ac2::1 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:1] error: read (in tcp s): Connection refused for 151.249.125.2 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:11] error: read (in tcp s): Connection refused for 2a01:5b40:ac1::1 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:10] error: read (in tcp s): Connection refused for 151.249.125.2 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:c] error: read (in tcp s): Connection refused for 151.249.124.1 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:9] error: read (in tcp s): Connection refused for 2a01:5b40:ac3::1 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:1] error: read (in tcp s): Connection refused for 2a01:5b40:ac3::1 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:a] error: read (in tcp s): Connection refused for 151.249.125.2 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:5] error: read (in tcp s): Connection refused for 2a01:5b40:ac2::1 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:1] error: read (in tcp s): Connection refused for 151.249.126.3 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:11] error: read (in tcp s): Connection refused for 2a01:5b40:ac2::1 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:1] error: read (in tcp s): Connection refused for 151.249.125.2 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:10] error: read (in tcp s): Connection refused for 2a01:5b40:ac3::1 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:11] error: read (in tcp s): Connection refused for 2a01:5b40:ac1::1 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:6] error: read (in tcp s): Connection refused for 2a01:5b40:ac3::1 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:5] error: read (in tcp s): Connection refused for 151.249.126.3 port 53 Jun 18 18:50:07 res1 unbound[28935]: [28935:a] error: read (in tcp s): Connection refused for 151.249.125.2 port 53 [...]
second server: [...] Jun 20 15:52:42 res2 syslogd: last message repeated 1 times Jun 20 15:52:44 res2 unbound[14692]: [14692:11] error: read (in tcp s): Connection refused for 199.249.112.1 port 53 Jun 20 15:52:44 res2 syslogd: last message repeated 1 times Jun 20 15:52:44 res2 unbound[14692]: [14692:6] error: read (in tcp s): Connection refused for 199.249.127.1 port 53 Jun 20 15:52:44 res2 syslogd: last message repeated 1 times Jun 20 15:52:46 res2 unbound[14692]: [14692:c] error: read (in tcp s): Connection refused for 199.249.112.1 port 53 Jun 20 15:52:46 res2 syslogd: last message repeated 2 times Jun 20 15:52:47 res2 unbound[14692]: [14692:a] error: read (in tcp s): Connection refused for 199.249.112.1 port 53 Jun 20 15:52:47 res2 syslogd: last message repeated 1 times Jun 20 15:52:48 res2 unbound[14692]: [14692:10] error: read (in tcp s): Connection refused for 199.249.112.1 port 53 Jun 20 15:52:48 res2 syslogd: last message repeated 1 times Jun 20 15:52:49 res2 unbound[14692]: [14692:0] error: read (in tcp s): Connection refused for 199.249.120.1 port 53 Jun 20 15:52:49 res2 syslogd: last message repeated 1 times Jun 20 15:52:50 res2 unbound[14692]: [14692:e] error: read (in tcp s): Connection refused for 199.249.120.1 port 53 Jun 20 15:52:51 res2 syslogd: last message repeated 2 times Jun 20 15:52:53 res2 unbound[14692]: [14692:a] error: read (in tcp r): Connection reset by peer for 199.81.212.13 port 53 Jun 20 15:52:53 res2 unbound[14692]: [14692:10] notice: ratelimit exceeded corp.ds.fedex.com. 1000 query pwn90150.corp.ds.fedex.com. IN A Jun 20 15:52:55 res2 unbound[14692]: [14692:1] error: read (in tcp s): Connection refused for 199.249.112.1 port 53 Jun 20 15:52:56 res2 syslogd: last message repeated 1 times Jun 20 15:52:56 res2 unbound[14692]: [14692:5] error: read (in tcp r): Connection reset by peer for 199.81.212.13 port 53 Jun 20 15:52:56 res2 unbound[14692]: [14692:11] error: read (in tcp r): Connection reset by peer for 199.81.212.13 port 53 Jun 20 15:52:57 res2 unbound[14692]: [14692:2] error: read (in tcp s): Connection refused for 199.249.112.1 port 53 Jun 20 15:52:57 res2 unbound[14692]: [14692:13] error: read (in tcp r): Connection reset by peer for 204.135.12.100 port 53 Jun 20 15:52:59 res2 unbound[14692]: [14692:5] error: read (in tcp r): Connection reset by peer for 199.81.212.13 port 53 Jun 20 15:53:03 res2 unbound[14692]: [14692:9] error: read (in tcp s): Connection refused for 199.249.112.1 port 53 Jun 20 15:53:03 res2 syslogd: last message repeated 3 times Jun 20 15:53:05 res2 unbound[14692]: [14692:a] error: read (in tcp s): Connection refused for 199.249.112.1 port 53 Jun 20 15:53:06 res2 syslogd: last message repeated 1 times Jun 20 15:53:09 res2 unbound[14692]: [14692:7] error: read (in tcp s): Connection refused for 199.249.112.1 port 53 Jun 20 15:53:10 res2 unbound[14692]: [14692:7] error: read (in tcp s): Connection refused for 199.249.120.1 port 53 [...]
I.e. these are tons of oubound requests to upstreams (probably fake upstreams received from crafted DNS packets from spooffed clients or intruders) by TCP protocol. The code responding for this functionality is in outside_network.c source file.
Hello, Colleagues!
Last weekend i've detected that Unbound was crashed and there is interesting core dump: Core was generated by `/usr/local/sbin/unbound -c /srvs/unbound/etc/unbound/unbound.conf'. Program terminated with signal SIGSEGV, Segmentation fault.
179 if(w1->id < w2->id) [Current thread is 1 (LWP 101705)] (gdb) bt
Backtrace stopped: Cannot access memory at address 0x7fffdeff6000
(gdb) fr 1
530 r = rbtree->cmp(key, node->key);
(gdb) print node
$1 = {parent = 0xe8626cea8, left = 0xec64aab28, right = 0xee7b31428, key = 0x0, color = 1 '\001'}
(gdb) print node->parent
$2 = {parent = 0xeca76ada8, left = 0xf27fdcd28, right = 0xede50e5a8, key = 0xe8626ce80, color = 0 '\000'}
(gdb) print node->left
$6 = {parent = 0xede50e5a8, left = 0x73abc0
As we can see there are nodes colored with BLACK with not NULL keys and one node colored with RED with NULL key.
And still one thing - our traffic includes tons of queries on which Unbound reply with ServFail status.
Do you mean that you see an increase of SERVFAIL responses with the affected versions or that your usual traffic includes a lot of SERVFAIL responses?
The second - usual traffic includes a lot of SERVFAIL responses.
Hello, colleagues! I've done still more research about the nodes with NULL-pointer keys. And for these purposes i wrote small function rbtree_traverse() to trap the code points where such nodes are already present: void rbtree_traverse(rbnode_type* node) { if(!node || node == RBTREE_NULL) return; if(node->key == NULL) abort(); rbtree_traverse(node->left); rbtree_traverse(node->right); }
I call this function from reuse_tcp_find() lines 524 and 525 (while() loop) of services/outside_network.c: rbtree_traverse(result); rbtree_traverse(((struct reuse_tcp*)result)->tree_by_id.root);
And this might be interesting that program crashes on the second call, i.e. when we traverse tcp reuse tree with casting by (struct reuse_tcp):
(gdb) print ((struct reuse_tcp*)result)->tree_by_id.root
$5 = {parent = 0x73afe0
But for (rbnode_type)result everything ok:
(gdb) print result
$2 = {parent = 0xddb2f1d20, left = 0x73afe0
I'm not sure that i dig the code properly but i hope this might be helpfull for you.
Hi Igor, right on time!
The end result of the issue is that there seem to be nodes in the tree with NULL keys but the issue is manifesting way before this part of the code is reached.
I have merged some recent fixes about this issue (PR #513) and it would be great if you could test and provide feedback!
Excellent news! :) And i will gladly test your fixes! I hope this will solve the problem. So i will start apply your 9 fixes right now. And next week i will report you about first results!
Hello Colleagues!
In previous week i have patched 6 servers. These are working FINE for now! ;) So my plan is to upgrade Unbound on all other servers to monitor my service during this week. And in next monday i will report you about results.
Hi Igor, these are some good news for now! Waiting for your results and then we are planning on moving on with a release. In the mean time, can you verify that the patched servers have considerable TCP traffic to upstreams, so that we know the different stream reuse codepaths are used? Thanks!
Yes, i will! But still i would like to ask you about the following logs on all patched servers: Aug 2 17:07:06 unbound[3701]: [3701:6] error: internal error: reuse_tcp_insert: in lru list without key Aug 2 17:07:06 unbound[3701]: [3701:0] error: internal error: reuse_tcp_insert: in lru list without key Aug 2 17:07:06 unbound[3701]: [3701:d] error: internal error: reuse_tcp_insert: in lru list without key Aug 2 17:07:06 unbound[3701]: [3701:c] error: internal error: reuse_tcp_insert: in lru list without key [...]
Is it the normal situation?
Hello, collegues! In previous monday and tuesday i have patched all of my Unbound servers. And they work without any crash dumps! Great patches! ;) What about considerable tcp traffic to upstreams - i can see the traffic on all of my patched servers too.
Hi Igor, great news! These patches will be part of the upcoming 1.13.2 version coming later this week.
The error message is a bit worrying since it reveals a situation that shouldn't happen normally. But the non-crashing part seems to indicate that this may happen during a callback and no extra harm at this point (like https://github.com/NLnetLabs/unbound/blob/8e538dcaa8df2d0fab8ff3dcf94ac1f972450b66/services/outside_network.c#L1264, or https://github.com/NLnetLabs/unbound/blob/8e538dcaa8df2d0fab8ff3dcf94ac1f972450b66/services/outside_network.c#L2590).
Unfortunately I haven't come across this in any of my tests to provide more useful information.
Before debugging further I would wait for you to test the full 1.13.2 release because as I understand, you test 1.13.1 with patches from the related commits currently. If these error messages keep appearing you can continue to debug further.
If you want to debug further you can configure unbound with --enable-debug
. This will enable the various log_assert()
we have in place during developing and they may shed more light in your case. It is usually not recommended to run like that in production since log_assert will just terminate unbound with a log message when a condition is not met.
This WILL NOT produce a core dump. If you prefer that, you can change https://github.com/NLnetLabs/unbound/blob/8e538dcaa8df2d0fab8ff3dcf94ac1f972450b66/util/log.h#L222-L226 to just a normal assert like a few lines above.
Thank you very much! I will update Unbound to 1.13.2 release first and then i will debug the program with your proposed asserts. I think this way more accurate. And i might even use abort() to produce core dumps at the program point that we are asserting ;)
Well, tomorrow i will be in vacation untill august 29. On monday august 30 i will return to start step 1: upgrade Unbound to 1.13.2. I hope this release will be available ;)
Actually it was released not more than 30 minutes ago :)
Thank you for your help thus far and enjoy your vacation!
Hello colleagues!
Its my pleasure to use Unbound 1.13.2. And the program is working fine but one more bug i've detected one month ago. Unbound is crashing via SIGSEGV. So i have recompiled the program with the following flags: "-g -O0 -fsanitize=address -fno-omit-frame-pointer". Below i put full stacktrace of the problem thread catched by libasan:
(gdb) thr 3 [Switching to thread 3 (Thread 0x7f1ae971e700 (LWP 3271756))]
50 return ret; (gdb) bt full
set = {__val = {18446744067266838239, 139753671451101, 106790115319360, 72197347709311613, 139753671629156, 3271753, 206158430216, 139753562427008,
139753562426816, 7012368136350825216, 139753676056768, 139753390149632, 139753562430192, 139753562427046, 139753678510688, 106858908091888}}
pid = <optimized out>
tid = <optimized out>
ret = <optimized out>
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x7f1aefef47e1 <__interceptor_free(void*)+209>,
sa_sigaction = 0x7f1aefef47e1 <__interceptor_free(void*)+209>}, sa_mask = {__val = {4838403, 5068870, 5070683, 5101591, 5103337, 5063904,
4431218, 6578138, 139753562428096, 106858908091888, 139753562426760, 8, 139753562428096, 139753562428096, 139753670522756, 139753562427504}},
sa_flags = -1183483136, sa_restorer = 0x6130074205b0}
sigs = {__val = {32, 0 <repeats 15 times>}}
No locals.
No locals.
at ../../../../libsanitizer/asan/asan_report.cc:181
buffer_copy = <optimized out>
is_write=is_write@entry=false, access_size=access_size@entry=8, exp=0, fatal=true) at ../../../../libsanitizer/asan/asan_report.cc:397
in_report = {error_report_lock_ = {<No data fields>}, static current_error_ = {kind = __asan::kErrorKindGeneric, {Base = {scariness = {score = 51,
descr = "8-byte-read-heap-use-after-free\000\032\177\000\000\210\317q\351\032\177\000\000\001\000\000\000\000\000\000\000\016\325\335\360\032\177\000\000\230\002\000\000\000\000\000\000P.\341\357\032\177\000\000\320\321\377\360\032\177\000\000\210\317q\351\032\177\000\000\204\317q\351\032\177\000\000h9\306\360\032\177\000\000 \320q\351\032\177\000\000\000\000\000\000\000\000\000\000\360\205\340\357\032\177\000\000\063!@\000\000\000\000\000\070\333\347\303\000\000\000\000H\030@\000\000\000\000\000P.\341\357\032\177\000\000l\237\017\003\000\000\000\000P\320q\351\032\177\000\000@\320q\351\032\177\000\000\200H\377\360\032\177\000\000\017\000\000\000\000\000\000\000"...}, tid = 3}, DeadlySignal = {<__asan::ErrorBase> = {scariness = {score = 51,
descr = "8-byte-read-heap-use-after-free\000\032\177\000\000\210\317q\351\032\177\000\000\001\000\000\000\000\000\000\000\016\325\335\360\032\177\000\000\230\002\000\000\000\000\000\000P.\341\357\032\177\000\000\320\321\377\360\032\177\000\000\210\317q\351\032\177\000\000\204\317q\351\032\177\000\000h9\306\360\032\177\000\000 \320q\351\032\177\000\000\000\000\000\000\000\000\000\000\360\205\340\357\032\177\000\000\063!@\000\000\000\000\000\070\333\347\303\000\000\000\000H\030@\000\000\000\000\000P.\341\357\032\177\000\000l\237\017\003\000\000\000\000P\320q\351\032\177\000\000@\320q\351\032\177\000\000\200H\377\360\032\177\000\000\017\000\000\000\000\000\000\000"...}, tid = 3}, signal = {siginfo = 0x2, context = 0x6130074205f0, addr = 3, pc = 3,
sp = 5440349067550916617, bp = 106858908091888, is_memory_access = 48, write_flag = __sanitizer::SignalContext::UNKNOWN}},
DoubleFree = {<__asan::ErrorBase> = {scariness = {score = 51,
descr = "8-byte-read-heap-use-after-free\000\032\177\000\000\210\317q\351\032\177\000\000\001\000\000\000\000\000\000\000\016\325\335\360\032\177\000\000\230\002\000\000\000\000\000\000P.\341\357\032\177\000\000\320\321\377\360\032\177\000\000\210\317q\351\032\177\000\000\204\317q\351\032\177\000\000h9\306\360\032\177\000\000 \320q\351\032\177\000\000\000\000\000\000\000\000\000\000\360\205\340\357\032\177\000\000\063!@\000\000\000\000\000\070\333\347\303\000\000\000\000H\030@\000\000\000\000\000P.\341\357\032\177\000\000l\237\017\003\000\000\000\000P\320q\351\032\177\000\000@\320q\351\032\177\000\000\200H\377\360\032\177\000\000\017\000\000\000\000\000\000\000"...}, tid = 3}, second_free_stack = 0x2, addr_description = {addr = 106858908091888,
alloc_tid = 3, free_tid = 3, alloc_stack_id = 427819017, free_stack_id = 1266679975, chunk_access = {bad_addr = 106858908091888,
offset = 48, chunk_begin = 106858908091840, chunk_size = 355, access_type = 2, alloc_type = 1}}},
NewDeleteSizeMismatch = {<__asan::ErrorBase> = {scariness = {score = 51,
descr = "8-byte-read-heap-use-after-free\000\032\177\000\000\210\317q\351\032\177\000\000\001\000\000\000\000\000\000\000\016\325\335\360\032\177\000\000\230\002\000\000\000\000\000\000P.\341\357\032\177\000\000\320\321\377\360\032\177\000\000\210\317q\351\032\177\000\000\204\317q\351\032\177\000\000h9\306\360\032\177\000\000 \320q\351\032\177\000\000\000\000\000\000\000\000\000\000\360\205\340\357\032\177\000\000\063!@\000\000\000\000\000\070\333\347\303\000\000\000\000H\030@\000\000\000\000\000P.\341\357\032\177\000\000l\237\017\003\000\000\000\000P\320q\351\032\177\000\000@\320q\351\032\177\000\000\200H\377\360\032\177\000\000\017\000\000\000\000\000\000\000"...}, tid = 3}, free_stack = 0x2, addr_description = {addr = 106858908091888, alloc_tid = 3,
free_tid = 3, alloc_stack_id = 427819017, free_stack_id = 1266679975, chunk_access = {bad_addr = 106858908091888, offset = 48,
chunk_begin = 106858908091840, chunk_size = 355, access_type = 2, alloc_type = 1}}, delete_size = 7012368136350825216},
FreeNotMalloced = {<__asan::ErrorBase> = {scariness = {score = 51,
descr = "8-byte-read-heap-use-after-free\000\032\177\000\000\210\317q\351\032\177\000\000\001\000\000\000\000\000\000\000\016\325\335\360\032\177\000\000\230\002\000\000\000\000\000\000P.\341\357\032\177\000\000\320\321\377\360\032\177\000\000\210\317q\351\032\177\000\000\204\317q\351\032\177\000\000h9\306\360\032\177\000\000 \320q\351\032\177\000\000\000\000\000\000\000\000\000\000\360\205\340\357\032\177\000\000\063!@\000\000\000\000\000\070\333\347\303\000\000\000\000H\030@\000\000\000\000\000P.\341\357\032\177\000\000l\237\017\003\000\000\000\000P\320q\351\032\177\000\000@\320q\351\032\177\000\000\200H\377\360\032\177\000\000\017\000\000\000\000\000\000\000"...}, tid = 3}, free_stack = 0x2, addr_description = {data = {kind = 121767408, {shadow = {
addr = 3, kind = (__asan::kShadowKindGap | __asan::kShadowKindHigh), shadow_byte = 0 '\000'}, heap = {addr = 3, alloc_tid = 3,
free_tid = 5440349067550916617, alloc_stack_id = 121767408, free_stack_id = 24880, chunk_access = {bad_addr = 48,
offset = 106858908091840, chunk_begin = 355, chunk_size = 5432374, access_type = 0, alloc_type = 0}}, stack = {addr = 3, tid = 3,
--Type
bp = 139753562430240
pc = <optimized out>
local_stack = 139753562430384
sp = 139753562430224
parent = 0x61300ade2fc0
w = 0x61300ade2fc0
node = 0x61300ade2fe8
store = {root = 0x61300ade2fe8, count = 2, cmp = 0x632871 <reuse_id_cmp>}
--Type
tm = 0x6040089ac910
__func__ = "comm_timer_callback"
max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0) at event.c:1691
evcb_callback = <optimized out>
res = 1
ev = 0x60c0246c97c0
evcb = <optimized out>
count = 2
__func__ = "event_process_active_single_queue"
activeq = <optimized out>
i = 0
c = 0
tv = {tv_sec = 8936595, tv_usec = 347428}
maxcb = 2147483647
endtime = 0x0
limit_after_prio = 2147483647
activeq = <optimized out>
i = <optimized out>
c = <optimized out>
endtime = <optimized out>
tv = <optimized out>
maxcb = <optimized out>
limit_after_prio = <optimized out>
n = <optimized out>
evsel = 0x7f1aef965b20 <epollops>
tv = {tv_sec = 0, tv_usec = 880}
tv_p = <optimized out>
res = <optimized out>
done = 0
retval = 0
__func__ = "event_base_loop"
No locals.
retval = 32767
No locals.
worker = 0x62300000e100
port_num = 3
ret = <optimized out>
pd = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139753562433280, 5629422553894277874, 140735375915374, 140735375915375, 0, 139753562431296,
-5752667792546409742, -5752671231268433166}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0,
canceltype = 0}}}
not_first_call = <optimized out>
No locals. (gdb)
Now, if we will dig the point of the thread where Unbound has terminated by sanitizer: (gdb) bt
at ../../../../libsanitizer/asan/asan_report.cc:181
is_write=is_write@entry=false, access_size=access_size@entry=8, exp=0, fatal=true) at ../../../../libsanitizer/asan/asan_report.cc:397
max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0) at event.c:1691
(gdb) fr 7
577 for (node = node->right; node->left != RBTREE_NULL; node = node->left); (gdb) list 572 { 573 rbnode_type parent; 574 575 if (node->right != RBTREE_NULL) { 576 / One right, then keep on going left... */ 577 for (node = node->right; node->left != RBTREE_NULL; node = node->left); 578 } else { 579 parent = node->parent; 580 while (parent != RBTREE_NULL && node == parent->right) { 581 node = parent;
so we will see that the thread is crashing in for() (line 577 in util/rbtree.c) loop. As i can understand there free() was called somewhere before this loop where the operation node = node->right has provoked program crash. Could you please see the stack trace i put above? Maybe it will prompt the correct place in your code with root cause.
Big thanks in advance!
Hi @iruzanov! I am currently working on some race condition bugs (related to TCP code) and this one seems like one of them. I was planning to update you here when they were solved but good that you already reported. Can I count you in for testing? Also, it would be interesting to see how often you get those, so that we know what to expect for reproducibility in production.
Sure, i am totally at your service! ;) What about of the rate of the program crashes - at less loaded sites Unbound works well. It might be one core dump during a month. And at more loaded sites where a lot of TCP-traffic too, i can detect one or two crashes per a day.
Hello, @gthess!
A couple of days ago i have received email with subject "Unbound DoS vulnerability (only with debugging enabled!)" from your colleagues. In short - this is the warning about to use "--disable-debug --disable-checking" when compiling Unbound to avoid resolver termination when it receives crafted packet from upstream. It is the subject of versions 1.13.x and 1.14.0 of the program. So my question is - does it relates to race conditions in TCP-code that you have mentioned above? And are there some ready fixes to test by myself in the Unbound (1.13.2) running on my servers? ;)
Using '--disable-debug --disable-checking' explicitly turns off debugging (if you are not sure what is happening with the configure line in your environment). If you don't specify them at all, debugging is off by default. It does not relate to the race condition. It happened to be found during testing for the race condition bug. Apart from that I don't have complete fixes for the race condition yet.
Thanks for the detailed clarification! And now my Unbound resolvers are working without deep debug. But i am ready to recompile the program at any time that you will demend.
Hi @iruzanov, The above PR addresses the issues I was seeing with the race condition. It will be merged into master somewhere this week but if you would like to test already you can build from that branch in the meantime.
Hello @gthess!
I will gladly test your fixpack mantioned above, BIG thank you! And in next week i will report you about my first observations for Unbound on a couple of my loaded servers.
Just a heads up that the branch is now merged to master which also includes other (relevant) fixes.
ohh, ok! So i need to recompile my first patched Unbound again. Because i have fetched master branch with fixpacks from #612 yesterday :)
I would just go for the master branch as is since it contains other fixes. For this case, https://github.com/NLnetLabs/unbound/commit/8e76eb95a0d4fcb8bdfe4a6b5d84f279b71d8bc1 addresses a dnstap issue that could trigger something similar.
Thank you @gthess! I have compiled Unbound from master branch once again some hours ago. And my plan is to observe the server with the patched program next week and then i would like to upgrade some other of my loaded servers. I will report you about my first results next week!
Hello @gthess!
My first resolver with new code from master branch works fine during last week. And today i have upgraded next two of my loaded resolvers with the latest code. I will monitor these servers during 5-7 days and then i will report you about status of upgrading procedure.
Hi, @gthess! I gladly report you about stable work of my three patched very loaded resolvers for the last 7 days. So i've made decision to patch next three ones on next Monday and monitor them up to friday.
That's always good to hear, thanks for always reporting back!
Hello, Wouter!
I am actively using unbound-1.13.1 (with our DNSTAP patches, issue #367). And sometimes my unbound is crashing under highload, massive recursive TCP-requests. Any abnormal terminations caused by services/outside_network.c code. And now i have one of such core dumps: (gdb) bt
0 0x0000000800955c2a in thr_kill () from /lib/libc.so.7
1 0x0000000800954084 in raise () from /lib/libc.so.7
2 0x00000008008ca279 in abort () from /lib/libc.so.7
3 0x0000000800464641 in ?? () from /usr/local/lib/libevent-2.1.so.7
4 0x0000000800464939 in event_errx () from /usr/local/lib/libevent-2.1.so.7
5 0x000000080045ec54 in evmap_iodel () from /usr/local/lib/libevent-2.1.so.7
6 0x0000000800457e8f in event_delnolock () from /usr/local/lib/libevent-2.1.so.7
7 0x000000080045ada8 in event_del () from /usr/local/lib/libevent-2.1.so.7
8 0x000000000030e25b in ub_event_del (ev=) at ./util/ub_event.c:395
9 comm_point_close (c=0xdc97b7c00) at ./util/netevent.c:3860
10 0x0000000000315bab in decommission_pending_tcp (outnet=, pend=0xdc9494980)
11 0x00000000003147d6 in reuse_cb_and_decommission (outnet=0x18e75, pend=0x6, error=-2)
12 0x0000000000317491 in outnet_tcptimer (arg=0xee67c2300) at ./services/outside_network.c:2033
13 0x000000080045e0ed in ?? () from /usr/local/lib/libevent-2.1.so.7
14 0x000000080045a09c in event_base_loop () from /usr/local/lib/libevent-2.1.so.7
15 0x000000000024dc54 in thread_start (arg=0x8014c0800) at ./util/ub_event.c:280
16 0x0000000800780fac in ?? () from /lib/libthr.so.3
17 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdf7fa000 (gdb)
If we enter frame 12 (outnet_tcptimer) and do print pend structure, we will see the following: (gdb) print pend $15 = (struct pending_tcp ) 0x6 (gdb) print pend Cannot access memory at address 0x6 (gdb) And this corrupt pend structure is passing to reuse_cb_and_decommission() function (frame 11) and higher in the stacktrace output above.
In the outnet_tcptimer() function we can see the following code (in services/outside_network.c): / it was in use / struct pending_tcp pend=(struct pending_tcp)w->next_waiting;
But the structure w->next_waiting is of type waiting_tcp: (gdb) print w->next_waiting $18 = (struct waiting_tcp *) 0xdc9494980 (gdb)
So my question - is the types casting correct in outnet_tcptimer() function? And does this corrupt pend structure cause event_errx() in libevent? If it might help, i found structure of pending_tcp type in w structure: (gdb) print w->outnet->tcp_free, right = 0x3287d0 , key = 0x0, color = 1 '\001'}, addr = {
ss_len = 0 '\000', ss_family = 2 '\002', ss_pad1 = "\000\065X\320\017\067", __ss_align = 0,
ss_pad2 = "\000\000\000\000\000\000\000\016", '\000' <repeats 103 times>}, addrlen = 16, is_ssl = 0,
lru_next = 0xdc9494ae0, lru_prev = 0x0, item_on_lru_list = 0, pending = 0xdc9494980, cp_more_read_again = 0,
cp_more_write_again = 0, tree_by_id = {root = 0x3287d0 , count = 0,
cmp = 0x3133e0 }, write_wait_first = 0x0, write_wait_last = 0x0, outnet = 0xd7d805000}}
(gdb)
$23 = (struct pending_tcp ) 0xdc9494980 (gdb) (gdb) print w->outnet->tcp_free $24 = {next_free = 0xdc9493e40, pi = 0xd7da2c000, c = 0xdc97b7c00, query = 0x0, reuse = {node = {parent = 0xdc94953a0, left = 0x3287d0
Big thank you in advance!
PS I did not send core-file itself because of 31GB in size of the file.