Closed jaredmauch closed 3 years ago
Something must be wrong. qname NULL is not possible, it should have returned a formerr. Also your debug shows that delegation_domain is NULL, and this is asserted the line above the delegation_rrset. So it should have failed one line above. These inconsistencies make me believe that there is a miscompile of some sort. Like the debug information does not match the executable. Or the executable is compiled with mismatched .o files, for example from different NSD versions; make clean and make it again.
Yeah, let me make sure running version matches package installed version, but it should be, it's quite odd and started in the past hour.
I also checked for q->delegation_rrset NULL as condition, and from reading the code it should be impossible to have delegation_rrset NULL at that point, because in an earlier if statement the delegation_domain was checked before the routine is called, and is thus certain to be nonNULL, and if the delegation_domain is not NULL, then in the lookup routine for the delegation_rrset we are sure that the delegation_rrset is not NULL.
This trail is code is namedb.c:587 domain_find_ns_rrsets assigns to delegation_rrset and only then is delegation_domain returned, to query.c:1420 where the domain_find_ns_rrsets is called. Then a couple lines lower it checks if delegation_domain is NULL and then does not call answer_delegation() at all.
Looks to be related to v6 RDNS zone?
(gdb) up
#1 0x00007fb6a44c88a4 in abort () from /lib64/libc.so.6
(gdb) up
#2 0x00007fb6a44c8789 in __assert_fail_base.cold () from /lib64/libc.so.6
(gdb) up
#3 0x00007fb6a44d7a16 in __assert_fail () from /lib64/libc.so.6
(gdb) up
#4 0x000055e2adb12b43 in answer_delegation (answer=0x7ffc9ab81d20, query=0x55e2c36010a0) at /usr/src/debug/nsd-4.3.7-2.fc34.x86_64/query.c:894
894 assert(query->delegation_rrset);
(gdb) up
#5 answer_lookup_zone (nsd=0x55e2adbad340 <nsd>, q=0x55e2c36010a0, answer=0x7ffc9ab81d20, domain_number=0, exact=<optimized out>, closest_match=0x55e2b1c86573,
closest_encloser=<optimized out>, qname=0x55e2c3601300) at /usr/src/debug/nsd-4.3.7-2.fc34.x86_64/query.c:1438
1438 answer_delegation(q, answer);
(gdb) print q
$1 = (struct query *) 0x55e2c36010a0
(gdb) print q->qname
$2 = (const dname_type *) 0x55e2c3601300
(gdb) print *q->qname
$3 = {name_size = 74 'J', label_count = 35 '#'}
(gdb) print*q->qname
$4 = {name_size = 74 'J', label_count = 35 '#'}
(gdb) print answer
$5 = (answer_type *) 0x7ffc9ab81d20
(gdb) print *answer
value of type `answer_type' requires 204808 bytes, which is more than max-value-size
(gdb) down
#4 0x000055e2adb12b43 in answer_delegation (answer=0x7ffc9ab81d20, query=0x55e2c36010a0) at /usr/src/debug/nsd-4.3.7-2.fc34.x86_64/query.c:894
894 assert(query->delegation_rrset);
(gdb) up
#5 answer_lookup_zone (nsd=0x55e2adbad340 <nsd>, q=0x55e2c36010a0, answer=0x7ffc9ab81d20, domain_number=0, exact=<optimized out>, closest_match=0x55e2b1c86573,
closest_encloser=<optimized out>, qname=0x55e2c3601300) at /usr/src/debug/nsd-4.3.7-2.fc34.x86_64/query.c:1438
1438 answer_delegation(q, answer);
(gdb) up
#6 0x000055e2adb12cbc in answer_query (nsd=nsd@entry=0x55e2adbad340 <nsd>, q=q@entry=0x55e2c36010a0) at /usr/src/debug/nsd-4.3.7-2.fc34.x86_64/query.c:1456
1456 answer_lookup_zone(nsd, q, &answer, 0, exact, closest_match,
(gdb) print q
$6 = (struct query *) 0x55e2c36010a0
(gdb) print *q
$7 = {region = 0x55e2c158b410, addr = {ss_family = 2, __ss_padding = "\242\346\003\357\233\354", '\000' <repeats 111 times>, __ss_align = 0}, addrlen = 16, maxlen = 1232,
reserved_space = 11, edns = {status = EDNS_OK, position = 90, maxlen = 1232, opt_reserved_space = 0, dnssec_ok = 32768, nsid = 0, cookie_status = COOKIE_NOT_PRESENT,
cookie_len = 0, cookie = '\000' <repeats 39 times>, ede = -1, ede_text = 0x0, ede_text_len = 0}, tsig = {status = TSIG_NOT_PRESENT, position = 0, response_count = 0,
updates_since_last_prepare = 0, context = 0x0, algorithm = 0x0, key = 0x0, prior_mac_size = 0, prior_mac_data = 0x0, rr_region = 0x55e2c1679e30,
context_region = 0x55e2c17b7d00, key_name = 0x0, algorithm_name = 0x0, signed_time_high = 0, signed_time_low = 0, signed_time_fudge = 0, mac_size = 0, mac_data = 0x0,
original_query_id = 0, error_code = 0, other_size = 0, other_data = 0x0}, tsig_prepare_it = 1, tsig_update_it = 1, tsig_sign_it = 1, tcp = 0, tcplen = 0,
packet = 0x55e2c293cdf8, qname = 0x55e2c3601300, qtype = 12, qclass = 1, zone = 0x55e2b1c78806, delegation_domain = 0x55e2b1c86573, delegation_rrset = 0x0, opcode = 0 '\000',
cname_count = 0, compressed_dname_count = 0, compressed_dnames = 0x55e2adbaf040 <compressed_dnames.lto_priv>, compressed_dname_offsets = 0x55e2c29a8ad0,
compressed_dname_offsets_size = 415137, number_temporary_domains = 0, axfr_is_done = 0, axfr_zone = 0x0, axfr_current_domain = 0x0, axfr_current_rrset = 0x0,
axfr_current_rr = 0, wildcard_domain = 0x0}
(gdb) print *q->qname
$8 = {name_size = 74 'J', label_count = 35 '#'}
(gdb) print *q->packet
$9 = {_position = 90, _limit = 131337, _capacity = 131337, _data = 0x55e2c3605310 "s4\200", _fixed = 0}
(gdb) print *q->packet._data
$10 = 115 's'
(gdb) print q->packet._data
$11 = (uint8_t *) 0x55e2c3605310 "s4\200"
(gdb) print q->packet
$12 = (buffer_type *) 0x55e2c293cdf8
(gdb) print *q->packet
$13 = {_position = 90, _limit = 131337, _capacity = 131337, _data = 0x55e2c3605310 "s4\200", _fixed = 0}
(gdb) up
#7 0x000055e2adb1586f in query_process (q=0x55e2c36010a0, nsd=0x55e2adbad340 <nsd>, now_p=0x7ffc9abb3f7c) at /usr/src/debug/nsd-4.3.7-2.fc34.x86_64/query.c:1679
1679 answer_query(nsd, q);
(gdb) print q
$14 = (query_type *) 0x55e2c36010a0
(gdb) print *q
$15 = {region = 0x55e2c158b410, addr = {ss_family = 2, __ss_padding = "\242\346\003\357\233\354", '\000' <repeats 111 times>, __ss_align = 0}, addrlen = 16, maxlen = 1232,
reserved_space = 11, edns = {status = EDNS_OK, position = 90, maxlen = 1232, opt_reserved_space = 0, dnssec_ok = 32768, nsid = 0, cookie_status = COOKIE_NOT_PRESENT,
cookie_len = 0, cookie = '\000' <repeats 39 times>, ede = -1, ede_text = 0x0, ede_text_len = 0}, tsig = {status = TSIG_NOT_PRESENT, position = 0, response_count = 0,
updates_since_last_prepare = 0, context = 0x0, algorithm = 0x0, key = 0x0, prior_mac_size = 0, prior_mac_data = 0x0, rr_region = 0x55e2c1679e30,
context_region = 0x55e2c17b7d00, key_name = 0x0, algorithm_name = 0x0, signed_time_high = 0, signed_time_low = 0, signed_time_fudge = 0, mac_size = 0, mac_data = 0x0,
original_query_id = 0, error_code = 0, other_size = 0, other_data = 0x0}, tsig_prepare_it = 1, tsig_update_it = 1, tsig_sign_it = 1, tcp = 0, tcplen = 0,
packet = 0x55e2c293cdf8, qname = 0x55e2c3601300, qtype = 12, qclass = 1, zone = 0x55e2b1c78806, delegation_domain = 0x55e2b1c86573, delegation_rrset = 0x0, opcode = 0 '\000',
cname_count = 0, compressed_dname_count = 0, compressed_dnames = 0x55e2adbaf040 <compressed_dnames.lto_priv>, compressed_dname_offsets = 0x55e2c29a8ad0,
compressed_dname_offsets_size = 415137, number_temporary_domains = 0, axfr_is_done = 0, axfr_zone = 0x0, axfr_current_domain = 0x0, axfr_current_rrset = 0x0,
axfr_current_rr = 0, wildcard_domain = 0x0}
(gdb) print *q->zone
$16 = {node = 0x55e2b1c7888b, apex = 0x55e2b1c78afe, soa_rrset = 0x55e2b1c78c06, soa_nx_rrset = 0x55e2b1c78c3a, ns_rrset = 0x55e2b1c78c76, nsec3_param = 0x0, nsec3_last = 0x0,
nsec3tree = 0x0, hashtree = 0x0, wchashtree = 0x0, dshashtree = 0x0, opts = 0x55e2afbfa85b, filename = 0x55e2b1c05922 "/etc/nsd/secondaries/0.8.c.7.7.0.6.2.ip6.arpa",
logstr = 0x0, mtime = {tv_sec = 1615943727, tv_nsec = 346618416}, zonestatid = 729, is_secure = 0, is_ok = 1, is_changed = 0}
(gdb) print *q->zone->node
$17 = {elem = 0x55e2b1c78806, parent = 0x55e2b1c86bc2, pidx = 0 '\000', offset = 0 '\000', len = 0, capacity = 0, array = 0x0}
(gdb) print *q->zone->node->elem
Attempt to dereference a generic pointer.
(gdb) print *q->zone->node->parent
$18 = {elem = 0x0, parent = 0x55e2b1c4bf84, pidx = 6 '\006', offset = 56 '8', len = 47, capacity = 47, array = 0x55e2b1c6ed40}
So this trace has qname and also delegation_domain not null. It is a query of type PTR.
It shows that delegation_rrset is NULL. But from the code that looks impossible. namedb.c:587, there the rrset is assigned nonNULL, and only then, if that is nonNULL, the result is notNULL and that is the delegation_domain. Later processing can make delegation_domain NULL. So it is not possible, from the code read to have delegation_domain != NULL and delegation_rrset == NULL. And this is what your debug trace shows.
If you had a manual compile I would suggest lowering optimisation options in the compiler. Like, not use -flto and drop -O. But you have a package. flto can be disabled with the configure switch --disable-flto and the -O2 option by passing CFLAGS="-g" to configure. I have heard of compilers that fail with high -O settings and failures (at compile time) for -flto, but have not heard of this kind of trouble, but I have no other idea how the code can make domain && !rrset happen here.
I could add a check in line query.c:1426 to check for delegation_rrset, but it is already guaranteed to be nonNULL by the previous check, really.
index 44b6690a..d0fba064 100644
--- a/query.c
+++ b/query.c
@@ -1424,6 +1424,7 @@ answer_lookup_zone(struct nsd *nsd, struct query *q, answer_type *answer,
}
if (!q->delegation_domain
+ || !q->delegation_rrset
|| (exact && q->qtype == TYPE_DS && closest_encloser == q->delegation_domain))
{
if (q->qclass == CLASS_ANY) {
some feedback:
[2021-08-21 00:51:29.888] nsd[2188979]: notice: setup TCP for TLS service on interface 204.42.254.5@853
[2021-08-21 00:51:29.889] nsd[2188979]: notice: setup TCP for TLS service on interface 2001:418:3f4::5@853
nsd: query.c:894: answer_delegation: Assertion `query->delegation_rrset' failed.
[2021-08-21 00:51:31.358] nsd[2188873]: warning: server 2188942 died unexpectedly, restarting
[2021-08-21 00:51:31.366] nsd[2188873]: warning: process 2188942 terminated with status 134
[2021-08-21 00:51:31.383] nsd[2188990]: notice: setup TCP for TLS service on interface 204.42.254.5@853
[2021-08-21 00:51:31.384] nsd[2188990]: notice: setup TCP for TLS service on interface 2001:418:3f4::5@853
nsd: query.c:894: answer_delegation: Assertion `query->delegation_rrset' failed.
[2021-08-21 00:51:37.438] nsd[2188873]: warning: server 2188979 died unexpectedly, restarting
I went and pulled from git and applied your change to query.c as well and I no longer see the issue when the query comes inbound, perhaps via TCP/TLS as noted above.
After:
[2021-08-21 00:57:29.961] nsd[2190098]: info: service remaining TCP connections[2021-08-21 00:57:29.961] nsd[2190099]: info: NSTATS 1629521849 1629521829 A=2452 NS=241 CNAME=8 SOA=21 PTR=267 MX=185 TXT=55 AAAA=1162 SRV=279 TYPE38=1 DS=1 DNSKEY=77 TLSA=4 HTTPS=23 SPF=1 TYPE255=2
[2021-08-21 00:57:29.961] nsd[2190099]: info: XSTATS 1629521849 1629521829 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=4677 SFwdQ=0 SDupQ=0 SErr=0 RQ=4780 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=104 SFwdR=0 SFail=219 SFErr=0 SNaAns=29 SNXD=373 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[2021-08-21 00:57:29.964] nsd[2190154]: notice: setup TCP for TLS service on interface 204.42.254.5@853
[2021-08-21 00:57:29.964] nsd[2190154]: notice: setup TCP for TLS service on interface 2001:418:3f4::5@853
[2021-08-21 00:57:29.968] nsd[2190155]: notice: setup TCP for TLS service on interface 204.42.254.5@853
[2021-08-21 00:57:29.969] nsd[2190155]: notice: setup TCP for TLS service on interface 2001:418:3f4::5@853
[2021-08-21 00:57:30.063] nsd[2190098]: info: NSTATS 1629521850 1629521829 A=2566 NS=213 CNAME=8 SOA=19 PTR=278 MX=179 TXT=52 AAAA=1229 SRV=282 TYPE38=3 DS=8 DNSKEY=79 TLSA=7 HTTPS=17 TYPE255=2
[2021-08-21 00:57:30.063] nsd[2190098]: info: XSTATS 1629521850 1629521829 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=4858 SFwdQ=0 SDupQ=0 SErr=0 RQ=4942 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=84 SFwdR=0 SFail=196 SFErr=0 SNaAns=27 SNXD=437 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
Okay I have committed the fix, because it stops the problem from happening. Still think something must be wrong for this to happen, but the fix stops the crash so added that to the code repository.
712296f only hides the problem, it doesn't fix anything. The real fix is ba0002e
See #195 for an explanation why I think that is.
Starting today I've seen this crash occuring on my system where it appears that q->qname is NULL
here's some information, I have several core files.
Aug 20 12:00:49 puck nsd[1983194]: nsd: query.c:894: answer_delegation: Assertion
query->delegation_rrset' failed. Aug 20 12:00:49 puck nsd[1983197]: nsd: query.c:894: answer_delegation: Assertion
query->delegation_rrset' failed.