NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
3.03k stars 348 forks source link

Unbound-1.13.1 crashed by SIGABRT #469

Closed iruzanov closed 1 year ago

iruzanov commented 3 years ago

Hello, Wouter!

I am actively using unbound-1.13.1 (with our DNSTAP patches, issue #367). And sometimes my unbound is crashing under highload, massive recursive TCP-requests. Any abnormal terminations caused by services/outside_network.c code. And now i have one of such core dumps: (gdb) bt

0 0x0000000800955c2a in thr_kill () from /lib/libc.so.7

1 0x0000000800954084 in raise () from /lib/libc.so.7

2 0x00000008008ca279 in abort () from /lib/libc.so.7

3 0x0000000800464641 in ?? () from /usr/local/lib/libevent-2.1.so.7

4 0x0000000800464939 in event_errx () from /usr/local/lib/libevent-2.1.so.7

5 0x000000080045ec54 in evmap_iodel () from /usr/local/lib/libevent-2.1.so.7

6 0x0000000800457e8f in event_delnolock () from /usr/local/lib/libevent-2.1.so.7

7 0x000000080045ada8 in event_del () from /usr/local/lib/libevent-2.1.so.7

8 0x000000000030e25b in ub_event_del (ev=) at ./util/ub_event.c:395

9 comm_point_close (c=0xdc97b7c00) at ./util/netevent.c:3860

10 0x0000000000315bab in decommission_pending_tcp (outnet=, pend=0xdc9494980)

at ./services/outside_network.c:945

11 0x00000000003147d6 in reuse_cb_and_decommission (outnet=0x18e75, pend=0x6, error=-2)

at ./services/outside_network.c:986

12 0x0000000000317491 in outnet_tcptimer (arg=0xee67c2300) at ./services/outside_network.c:2033

13 0x000000080045e0ed in ?? () from /usr/local/lib/libevent-2.1.so.7

14 0x000000080045a09c in event_base_loop () from /usr/local/lib/libevent-2.1.so.7

15 0x000000000024dc54 in thread_start (arg=0x8014c0800) at ./util/ub_event.c:280

16 0x0000000800780fac in ?? () from /lib/libthr.so.3

17 0x0000000000000000 in ?? ()

Backtrace stopped: Cannot access memory at address 0x7fffdf7fa000 (gdb)

If we enter frame 12 (outnet_tcptimer) and do print pend structure, we will see the following: (gdb) print pend $15 = (struct pending_tcp ) 0x6 (gdb) print pend Cannot access memory at address 0x6 (gdb) And this corrupt pend structure is passing to reuse_cb_and_decommission() function (frame 11) and higher in the stacktrace output above.

In the outnet_tcptimer() function we can see the following code (in services/outside_network.c): / it was in use / struct pending_tcp pend=(struct pending_tcp)w->next_waiting;

But the structure w->next_waiting is of type waiting_tcp: (gdb) print w->next_waiting $18 = (struct waiting_tcp *) 0xdc9494980 (gdb)

So my question - is the types casting correct in outnet_tcptimer() function? And does this corrupt pend structure cause event_errx() in libevent? If it might help, i found structure of pending_tcp type in w structure: (gdb) print w->outnet->tcp_free
$23 = (struct pending_tcp ) 0xdc9494980 (gdb) (gdb) print w->outnet->tcp_free $24 = {next_free = 0xdc9493e40, pi = 0xd7da2c000, c = 0xdc97b7c00, query = 0x0, reuse = {node = {parent = 0xdc94953a0, left = 0x3287d0 , right = 0x3287d0 , key = 0x0, color = 1 '\001'}, addr = { ss_len = 0 '\000', ss_family = 2 '\002', ss_pad1 = "\000\065X\320\017\067", __ss_align = 0, ss_pad2 = "\000\000\000\000\000\000\000\016", '\000' <repeats 103 times>}, addrlen = 16, is_ssl = 0, lru_next = 0xdc9494ae0, lru_prev = 0x0, item_on_lru_list = 0, pending = 0xdc9494980, cp_more_read_again = 0, cp_more_write_again = 0, tree_by_id = {root = 0x3287d0 , count = 0, cmp = 0x3133e0 }, write_wait_first = 0x0, write_wait_last = 0x0, outnet = 0xd7d805000}} (gdb)

Big thank you in advance!

PS I did not send core-file itself because of 31GB in size of the file.

iruzanov commented 2 years ago

Hi, @gthess!

I'm sorry for timeout. I was just waiting for core dump on some of loaded resolver that i still did not patch. And today such resolver has crashed ;) So, some hours ago i have patched next three loaded resolvers. The resolvers patched on previous week work fine. Please give me a time to monitor all of patched resolvers together with non-patched ones to see if core dump on any resolver will happen.

iruzanov commented 2 years ago

Hello, @gthess! I plan to upgrade next set of my resolvers from master branch (that i saved one month ago, version 1.14.1). But i saw that new version 1.15.0 of Unbound has released 10 February. So can i ugrade to this version, 1.15.0? Or 1.14.1 is more suitable for my testing?

gthess commented 2 years ago

Hi! 1.14.1 was supposed to be the version after 1.14.0 but since new code included changes to the ratelimit logic and the default value of aggressive nsec, we had to increase the major version. Thus, 1.14.1 turned to 1.15.0 upon releasing. So to answer your question, 1.15.0 is better to use (and it includes all the relevant fixes) than a specific point during development. It also went through the testing we do before releasing.

Btw, are you subscribed to the unbound-users mailing list? Early announcements about releases (also for release candidates) are announced there.

iruzanov commented 2 years ago

Thank you for your answer! I will upgrade all of my resolvers to version 1.15.0 then.

No, i am not subscribed yet. Usualy i often go to your web-site to see for the last news about NSD and Unbound ;) Ok, i will.

gthess commented 2 years ago

In that case nsd-users can also be useful to you.

gthess commented 1 year ago

Closing this as resolved by now.