SSSD / sssd

A daemon to manage identity, authentication and authorization for centrally-managed systems.
https://sssd.io
GNU General Public License v3.0
588 stars 238 forks source link

If DNS service discovery is enabled, and name resolution for any IPA server times out, sssd gets stuck in offline mode #6512

Open yrro opened 1 year ago

yrro commented 1 year ago

My ISP inspects DNS traffic. Any responses with A records in the answer section for RFC 1918 addresses are dropped.

I believe this is being done by the router they provide because I see this behaviour when querying DNS servers on the internet as well as my ISP's own servers (i.e., dig @1.1.1.1 10-0-0-1.nip.io times out).

This causes SSSD to go into offline mode shortly after it starts up.

I'm seeing the following behaviour:

  1. SSSD queries the DNS for SRV records for _ldap._tcp.example.com
  2. SSSD receives the correct response (specifically, the answer sections contains 3 SRV records and the additional section contains A records for each of the SRV records, with a mixture of public and RFC 1918 addresses); the equivalent of:

    $ dig +noall +comments +answer +additional srv _ldap._tcp.example.com 
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61105
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 3, ADDITIONAL: 5
    
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 65494
    ;; ANSWER SECTION:
    _ldap._tcp.example.com. 83071 IN  CNAME   _ldap._tcp.int._locations.example.com.
    _ldap._tcp.int._locations.example.com. 83071 IN SRV 50 100 389 ipa5.example.com.
    _ldap._tcp.int._locations.example.com. 83071 IN SRV 50 100 389 ipa6.example.com.
    _ldap._tcp.int._locations.example.com 83071 IN SRV 0 100 389 ipa3.example.com.
    
    ;; ADDITIONAL SECTION:
    ipa3.example.com .186     IN      A       192.0.2.38
    ipa5.example.com. 186     IN      A       10.0.1.18
    ipa6.example.com. 186     IN      A       10.0.2.20

    Here ipa3.example.com has a public IP address and the other two servers have RFC 1918 addresses.

  3. SSSD queries the DNS for A and AAAA records for each of the names in the answer section
  4. SSSD receives a response for ipa3.example.com; no answers are received for the other two names.
  5. SSSD goes into offline mode.

I think the problem is with the part of SSSD that tries to write out the /var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM file. If I set krb5_use_kdcinfo to False then I do not observe the queries from step 3 above, and SSSD stays in online mode.

So it looks like if any of the IPA servers for a domain can't be resolved, the 'write out kdcinfo' process fails and SSSD goes into offline mode.

What I'd like to see is sssd handling this situation and continuing to talk to ipa3.example.com even though ipa5.example.com and ipa6.example.com are not resolvable.

i.e., a bit like what SSSD_KRB5_LOCATOR_IGNORE_DNS_FAILURES does in the Kerberos locator plugin, but controlling the behaviour of SSSD itself.

Seen with sssd-2.7.4-1.fc37.x86_64.rpm on Fedora 37 and sssd 2.4.1-2 on Debian 11 "bullseye".

yrro commented 1 year ago

I've tried to reproduce this in a lab where I can block queries for individual DNS names and I can't get sssd to go into offline mode. I think this is because the blocking method I'm using causes my DNS server to return an error to the client. But that's not reproducing the same condition that I'm seeing on the real network. I need to figure out a way to have my DNS server drop the query and not provide a response to the client.

My suspicion is that SSSD gets stuck in offline mode if its attempt to resolve one of the IPA server hostnames times out, but not if it fails within deadline.

alexey-tikhonov commented 1 year ago

Hi @thalman,

I think the problem is with the part of SSSD that tries to write out the /var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM file. If I set krb5_use_kdcinfo to False then I do not observe the queries from step 3 above, and SSSD stays in online mode.

So it looks like if any of the IPA servers for a domain can't be resolved, the 'write out kdcinfo' process fails and SSSD goes into offline mode.

does this ^^ ring a bell to you?

thalman commented 1 year ago

I think the problem is with the part of SSSD that tries to write out the /var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM file. If I set krb5_use_kdcinfo to False then I do not observe the queries from step 3 above, and SSSD stays in online mode. So it looks like if any of the IPA servers for a domain can't be resolved, the 'write out kdcinfo' process fails and SSSD goes into offline mode.

does this ^^ ring a bell to you?

If I recall correctly writing kdcinfo file does not perform DNS resolution. It uses IP addresses if the resolution is done by failover mechanism. If the name is not resolved yet, then it is written as FQDN.

If there is a device dropping packets, then tuning timeouts might help. I would try to shorten dns_resolver_op_timeout and see if that helps.

HTH

yrro commented 3 months ago

If there is a device dropping packets, then tuning timeouts might help. I would try to shorten dns_resolver_op_timeout and see if that helps.

It defaults to 3 which is pretty low, but lowering it to 1 (the lowest possible value) doesn't help here.

yrro commented 3 months ago

I think the problem is with the part of SSSD that tries to write out the /var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM file. If I set krb5_use_kdcinfo to False then I do not observe the queries from step 3 above, and SSSD stays in online mode.

It's a couple of years later and the behaviour of sssd (2.9.5) is a bit different.

First of all, sssd takes a long time (upwards of 90 seconds) to start up (to be precise, until it tells systemd that it's completed starting up). This looks like a DNS a query for for $HOSTNAME, that the ISP's DNS server forwards to my IPA domain's DNS server, which respond with an answer containing an RFC1918 address, causing the ISP's DNS server to drop the transaction, sending no reply to sssd. If I put an entry for $HOSTNAME into /etc/hosts then it starts up instantly.

Second of all, I can see in the backend logs attempts to query DNS for the two IPA servers with RFC1918 addresses:

Failed to resolve server 'ipa6.example.com': Timeout while contacting DNS servers
Marking server 'ipa6.example.com' as 'not working'
Couldn't resolve server (ipa6.example.com), resolver returned [5]: Input/output error

which is fine, because there's a third server that should be working. But then we end up with:

(2024-06-13  9:26:55): [be[example.com]] [be_resolve_server_process] (0x1000): [RID#5] Trying with the next one! 
(2024-06-13  9:26:55): [be[example.com]] [fo_resolve_service_send] (0x0100): [RID#5] Trying to resolve service 'IPA'
(2024-06-13  9:26:55): [be[example.com]] [get_server_status] (0x1000): [RID#5] Status of server 'ipa6.example.com' is 'not working'
(2024-06-13  9:26:55): [be[example.com]] [get_server_status] (0x1000): [RID#5] Status of server 'ipa3.example.com' is 'name resolved'
(2024-06-13  9:26:55): [be[example.com]] [get_port_status] (0x1000): [RID#5] Port status of port 389 for server 'ipa3.example.com' is 'not working'
(2024-06-13  9:26:55): [be[example.com]] [get_port_status] (0x0080): [RID#5] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
(2024-06-13  9:26:55): [be[example.com]] [get_server_status] (0x1000): [RID#5] Status of server 'ipa5.example.com' is 'not working'
(2024-06-13  9:26:55): [be[example.com]] [get_server_status] (0x1000): [RID#5] Status of server 'ipa6.example.com' is 'not working'
(2024-06-13  9:26:55): [be[example.com]] [fo_resolve_service_send] (0x0020): [RID#5] No available servers for service 'IPA'
(2024-06-13  9:26:55): [be[example.com]] [be_resolve_server_done] (0x1000): [RID#5] Server [NULL] resolution failed: [5]: Input/output error
(2024-06-13  9:26:55): [be[example.com]] [dp_req_done] (0x0400): [RID#5] DP Request [Online Check #5]: Request handler finished [0]: Success
(2024-06-13  9:26:55): [be[example.com]] [dp_req_done] (0x20000): [RID#5] DP Request [Online Check #5]: Handling request took [2125.513] milliseconds.
(2024-06-13  9:26:55): [be[example.com]] [_dp_req_recv] (0x0400): [RID#5] DP Request [Online Check #5]: Receiving request data.
(2024-06-13  9:26:55): [be[example.com]] [dp_req_destructor] (0x0400): [RID#5] DP Request [Online Check #5]: Request removed. 
(2024-06-13  9:26:55): [be[example.com]] [dp_req_destructor] (0x0400): [RID#5] Number of active DP request: 0
(2024-06-13  9:26:55): [be[example.com]] [be_check_online_done] (0x0400): [RID#5] Backend is offline

The logs don't say why sssd isn't able to talk to ipa3.example.com. It resolves fine. I can connect to it with ldapwhoami fine. But according to tcpdump, sssd doesn't even try to connect to it - no outgoing TCP packets are seen to port 389 or 636.

Third, disable krb5_use_kdcinfo doesn't help any more. But as you said above, perhaps it never did and I was mistaken about this.

I'll continue to debug this later...

yrro commented 3 months ago

Ok, I realise the problem.

To reduce noise in the logs, I set ipa_server = ipa3.example.com to disable DNS service discovery.

(2024-06-13  9:36:42): [be[example.com]] [set_server_common_status] (0x0100): [RID#1] Marking server 'ipa3.example.com' as 'name resolved' 
(2024-06-13  9:36:42): [be[example.com]] [be_resolve_server_process] (0x1000): [RID#1] Saving the first resolved server 
(2024-06-13  9:36:42): [be[example.com]] [be_resolve_server_process] (0x0200): [RID#1] Found address for server ipa3.example.com: [2a01:4f8:c17:d9c6::1] TTL 996 
(2024-06-13  9:36:42): [be[example.com]] [ipa_resolve_callback] (0x0400): [RID#1] Constructed uri 'ldap://ipa3.example.com' 
(2024-06-13  9:36:42): [be[example.com]] [sssd_async_socket_init_send] (0x4000): [RID#1] Using file descriptor [25] for the connection. 
(2024-06-13  9:36:42): [be[example.com]] [sssd_async_connect_send] (0x0020): [RID#1] connect failed [101][Network is unreachable]. 
(2024-06-13  9:36:42): [be[example.com]] [sssd_async_socket_init_send] (0x0400): [RID#1] Setting 6 seconds timeout [ldap_network_timeout] for connecting 
(2024-06-13  9:36:42): [be[example.com]] [sssd_async_socket_init_done] (0x0040): [RID#1] sdap_async_sys_connect request failed: [101]: Network is unreachable. 
(2024-06-13  9:36:42): [be[example.com]] [sssd_async_socket_state_destructor] (0x0400): [RID#1] closing socket [25] 
(2024-06-13  9:36:42): [be[example.com]] [sss_ldap_init_sys_connect_done] (0x0020): [RID#1] sssd_async_socket_init request failed: [101]: Network is unreachable.
(2024-06-13  9:36:42): [be[example.com]] [sdap_sys_connect_done] (0x0020): [RID#1] sdap_async_connect_call request failed: [101]: Network is unreachable. 
(2024-06-13  9:36:42): [be[example.com]] [sdap_handle_release] (0x2000): [RID#1] Trace: sh[0x55aa316eefa0], connected[0], ops[(nil)], ldap[(nil)], destructor_lock[0], release_memory[0] 
(2024-06-13  9:36:42): [be[example.com]] [_be_fo_set_port_status] (0x8000): [RID#1] Setting status: PORT_NOT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_done: 1665 
(2024-06-13  9:36:42): [be[example.com]] [fo_set_port_status] (0x0100): [RID#1] Marking port 0 of server 'ipa3.example.com' as 'not working' 
(2024-06-13  9:36:42): [be[example.com]] [fo_set_port_status] (0x0400): [RID#1] Marking port 0 of duplicate server 'ipa3.example.com' as 'not working' 
(2024-06-13  9:36:42): [be[example.com]] [fo_resolve_service_send] (0x0100): [RID#1] Trying to resolve service 'IPA' 
(2024-06-13  9:36:42): [be[example.com]] [get_server_status] (0x1000): [RID#1] Status of server 'ipa3.example.com' is 'name resolved' 
(2024-06-13  9:36:42): [be[example.com]] [get_port_status] (0x1000): [RID#1] Port status of port 0 for server 'ipa3.example.com' is 'not working' 
(2024-06-13  9:36:42): [be[example.com]] [get_port_status] (0x0080): [RID#1] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues. 
(2024-06-13  9:36:42): [be[example.com]] [fo_resolve_service_send] (0x0020): [RID#1] No available servers for service 'IPA' 
(2024-06-13  9:36:42): [be[example.com]] [be_resolve_server_done] (0x1000): [RID#1] Server [NULL] resolution failed: [5]: Input/output error
(2024-06-13  9:36:42): [be[example.com]] [sdap_id_op_connect_done] (0x0040): [RID#1] Failed to connect, going offline (5 [Input/output error])
(2024-06-13  9:36:42): [be[example.com]] [be_mark_offline] (0x2000): [RID#1] Going offline!

So the problem is that ipa3.example.com has both an AAAA record and an A record. But the ISP I'm on doesn't provide IPv6 connectivity. And sssd tries only to connect to the IPv6 address, it doesn't fall back to trying the IPv4 address.

So the workaround for that is to add an entry for the server's IPv4 address to /etc/hosts. With that, sssd goes and stays online!

I then set ipa_server = _srv_ and restarted sssd, and it returned to going offline. I'll continue investigating that later. But in the mean time I'll summarize what I've found. These are all probably issues to raise separately, I'll do that later.

yrro commented 3 months ago

I had lookup_family_order = ipv6_first in my config file, as a holdover from debugging this back when I first opened it. I've now removed it and done another pass at looking into ipa_servers = _srv_ causing sssd to get stuck in offline mode. Fortunately I don't think that setting is relevant to the failure.

With debug_level = 10 things are a little clearer...

==> /var/log/sssd/sssd_example.com.log <==^M
[...]
(2024-06-13 14:21:57): [be[example.com]] [sdap_get_rootdse_done] (0x2000): [RID#1] Got rootdse^M
[...]
(2024-06-13 14:21:57): [be[example.com]] [be_resolve_server_process] (0x0200): [RID#1] Found address for server ipa3.example.com: [203.0.111.60] TTL 7200^M
(2024-06-13 14:21:57): [be[example.com]] [sdap_kinit_kdc_resolved] (0x1000): [RID#1] KDC resolved, attempting to get TGT...^M
(2024-06-13 14:21:57): [be[example.com]] [create_tgt_req_send_buffer] (0x0400): [RID#1] buffer size: 72^M
(2024-06-13 14:21:57): [be[example.com]] [child_handler_setup] (0x2000): [RID#1] Setting up signal handler up for pid [356389]^M
(2024-06-13 14:21:57): [be[example.com]] [child_handler_setup] (0x2000): [RID#1] Signal handler set up for pid [356389]^M
(2024-06-13 14:21:57): [be[example.com]] [set_tgt_child_timeout] (0x0400): [RID#1] Setting 8 seconds timeout for TGT child^M
(2024-06-13 14:21:57): [be[example.com]] [sdap_process_result] (0x2000): Trace: sh[0x5555b17dea50], connected[1], ops[(nil)], ldap[0x5555b17d7700]^M
(2024-06-13 14:21:57): [be[example.com]] [sdap_process_result] (0x2000): Trace: end of ldap_result list^M
(2024-06-13 14:21:57): [be[example.com]] [_write_pipe_handler] (0x0400): [RID#1] All data has been sent!^M

==> /var/log/sssd/ldap_child.log <==^M
(2024-06-13 14:21:57): [ldap_child[356389]] [main] (0x0400): ldap_child started.^M
(2024-06-13 14:21:57): [ldap_child[356389]] [main] (0x2000): context initialized^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): total buffer size: 72^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): realm_str size: 17^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): got realm_str: EXAMPLE.COM^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): princ_str size: 31^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): got princ_str: host/myhost.example.com^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): keytab_name size: 0^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): lifetime: 86400^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x0200): Will run as [0][0].^M
(2024-06-13 14:21:57): [ldap_child[356389]] [privileged_krb5_setup] (0x2000): Kerberos context initialized^M
(2024-06-13 14:21:57): [ldap_child[356389]] [main] (0x2000): Kerberos context initialized^M
(2024-06-13 14:21:57): [ldap_child[356389]] [become_user] (0x0200): Trying to become user [0][0].^M
(2024-06-13 14:21:57): [ldap_child[356389]] [become_user] (0x0200): Already user [0].^M
(2024-06-13 14:21:57): [ldap_child[356389]] [main] (0x2000): Running as [0][0].^M
(2024-06-13 14:21:57): [ldap_child[356389]] [main] (0x2000): getting TGT sync^M
(2024-06-13 14:21:57): [ldap_child[356389]] [ldap_child_get_tgt_sync] (0x2000): got realm_name: [EXAMPLE.COM]^M
(2024-06-13 14:21:57): [ldap_child[356389]] [ldap_child_get_tgt_sync] (0x0100): Principal name is: [host/myhost.example.com@EXAMPLE.COM]^M
(2024-06-13 14:21:57): [ldap_child[356389]] [ldap_child_get_tgt_sync] (0x0100): Using keytab [MEMORY:/etc/krb5.keytab]^M
(2024-06-13 14:21:57): [ldap_child[356389]] [ldap_child_get_tgt_sync] (0x0100): Will canonicalize principals^M
(2024-06-13 14:21:57): [ldap_child[356389]] [sss_child_krb5_trace_cb] (0x4000): [356389] 1718284917.303677: Getting initial credentials for host/myhost.example.com@EXAMPLE.COM^M
^M
(2024-06-13 14:21:57): [ldap_child[356389]] [sss_child_krb5_trace_cb] (0x4000): [356389] 1718284917.303678: Found entries for host/myhost.example.com@EXAMPLE.COM in keytab: aes256-cts, aes128-cts^M
^M
(2024-06-13 14:21:57): [ldap_child[356389]] [sss_child_krb5_trace_cb] (0x4000): [356389] 1718284917.303680: Sending unauthenticated request^M
^M
(2024-06-13 14:21:57): [ldap_child[356389]] [sss_child_krb5_trace_cb] (0x4000): [356389] 1718284917.303681: Sending request (216 bytes) to EXAMPLE.COM^M

==> /var/log/sssd/sssd_example.com.log <==^M
(2024-06-13 14:21:58): [be[example.com]] [be_run_unconditional_online_cb] (0x4000): List of unconditional online callbacks is empty, nothing to do.^M
(2024-06-13 14:21:58): [be[example.com]] [check_if_online_delayed] (0x2000): Backend is already online, nothing to do.^M
(2024-06-13 14:22:05): [be[example.com]] [get_tgt_timeout_handler] (0x4000): [RID#1] timeout for sending SIGTERM to TGT child [356389] reached.^M
(2024-06-13 14:22:05): [be[example.com]] [get_tgt_timeout_handler] (0x0400): [RID#1] Setting 2 seconds timeout for sending SIGKILL to TGT child^M
(2024-06-13 14:22:05): [be[example.com]] [_read_pipe_handler] (0x0400): [RID#1] EOF received, client finished^M
(2024-06-13 14:22:05): [be[example.com]] [child_sig_handler] (0x1000): [RID#1] Waiting for child [356389].^M
(2024-06-13 14:22:05): [be[example.com]] [child_sig_handler] (0x0020): [RID#1] child [356389] failed with status [7].^M
(2024-06-13 14:22:05): [be[example.com]] [child_callback] (0x0020): [RID#1] LDAP child was terminated due to timeout^M
(2024-06-13 14:22:05): [be[example.com]] [sdap_kinit_done] (0x0080): [RID#1] Communication with KDC timed out, trying the next one^M
(2024-06-13 14:22:05): [be[example.com]] [_be_fo_set_port_status] (0x8000): [RID#1] Setting status: PORT_NOT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_kinit_done: 1269^M
(2024-06-13 14:22:05): [be[example.com]] [fo_set_port_status] (0x0100): [RID#1] Marking port 389 of server 'ipa3.example.com' as 'not working'^M
(2024-06-13 14:22:05): [be[example.com]] [fo_set_port_status] (0x0400): [RID#1] Marking port 389 of duplicate server 'ipa3.example.com' as 'not working'^M

So it looks like a problem with ldap_child getting stuck, and then getting killed resulting in ipa3 being marked as not responding.

I'm running wireshark while starting sssd up and I see the following:

  1. DNS SRV query for _ldap.tcp.example.com
  2. DNS A query for ipa3.example.com
  3. LDAP query of ipa3.example.com for root DSE (connection held open)
  4. DNS A/AAA queries for ipa5.example.com and ipa6.example.com
  5. (10 seconds after the prior root DSE query) unbind request followed by TCP connection close

What's missing is any Kerberos traffic--ldap_child isn't sending anything on either UDP or TCP port 88. But the last thing it logged before being terminated looks exactly like KRB5_TRACE style output from libkrb5... so why doesn't it actually send anything?

sumit-bose commented 3 months ago

Hi,

the Sending request (216 bytes) to EXAMPLE.COM messages indicates that libkrb5 wants to send a request to a KDC from the EXAMPLE.COM realm. Given the old discussion I assume you have set krb5_use_kdcinfo = False and libkrb5 will try DNS SRV and/or TXT depending on the settings in /etc/krb5.conf or look at the corresponding [realms] section in ¸/etc/krb5.conf` directly.

Are ipa5.example.com and ipa6.example.com listed as KDCs for EXAMPLE.COM in /etc/krb5.conf?

Maybe you can try to connect strace with the --follow-forks option to the sssd_be process to get more details about what libkrb5 is trying to do at this stage.

bye, Sumit

yrro commented 3 months ago

Thanks for taking a look at this again.

I've been running all along with krb5_use_kdcinfo unset, so the plugin has been enabled throughout today's debugging.

Here's a backtrace of ldap_child after it logs Sending request (216 bytes) to IPA.EXAMPLE.COM:

#0  0x00007fb4f80d8c0b in ppoll () from /lib64/libc.so.6
#1  0x00007fb4f7716e0f in ppoll_usec.constprop () from /lib64/libnss_resolve.so.2
#2  0x00007fb4f77101f3 in varlink_call.constprop () from /lib64/libnss_resolve.so.2
#3  0x00007fb4f7707a38 in _nss_resolve_gethostbyname4_r () from /lib64/libnss_resolve.so.2
#4  0x00007fb4f8113941 in getaddrinfo () from /lib64/libc.so.6
#5  0x00007fb4f7cbe1b4 in sssd_krb5_locator_lookup () from /usr/lib64/krb5/plugins/libkrb5/sssd_krb5_locator_plugin.so
#6  0x00007fb4f822718a in locate_server () from /lib64/libkrb5.so.3
#7  0x00007fb4f8227ef0 in k5_locate_server () from /lib64/libkrb5.so.3
#8  0x00007fb4f822a1b8 in krb5_sendto_kdc () from /lib64/libkrb5.so.3
#9  0x00007fb4f8201a22 in k5_init_creds_get () from /lib64/libkrb5.so.3
#10 0x00007fb4f820430f in get_init_creds_keytab () from /lib64/libkrb5.so.3
#11 0x00007fb4f82043a9 in krb5_get_init_creds_keytab () from /lib64/libkrb5.so.3
#12 0x000055e3fa193192 in main ()

So I've just tested again after setting krb5_use_kdcinfo = false then sssd stays online!

I should have tried this option earlier but trust me there were a lot of other combination of options I tested to try to pin this down, I just hadn't tried krb5_use_kdcinfo yet.

BTW, from the man page:

If the environment variable SSSD_KRB5_LOCATOR_IGNORE_DNS_FAILURES is set to any value plugin will try to resolve all DNS names in kdcinfo file. By default plugin returns KRB5_PLUGIN_NO_HANDLE to the caller immediately on first DNS resolving failure.

I dunno if returning KRB5_PLUGIN_NO_HANDLE is good, neutral or bad in a case like mine where effectively at least one server can't be resolved. Does it just disable use of the plugin and libkrb5 falls back to classic behavior?

Anyway, in this case it doesn't really matter - it looks like the plugin is taking too long to run, so the backend kills off the ldap_child before the plugin has a chance to finish.

I can't find where this 8 second timeout is set, but if there's a setting you know of then I'll try increasing the timeout & leaving the locator plugin enabled.

For the record, when the plugin is enabled, here's what the kdcinfo file looks like:

# cat /var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM
203.0.111.60
ipa6.example.com
ipa5.example.com

And with that file, KRB5_TRACE=/dev/stderr kinit hangs in a way very similar to ldap_child:

[374953] 1718291644.881060: Getting initial credentials for yrro@EXAMPLE.COM
[374953] 1718291644.881062: Sending unauthenticated request
[374953] 1718291644.881063: Sending request (187 bytes) to EXAMPLE.COM
[374953] 1718291657.802752: Sending initial UDP request to dgram 203.0.111.60:88
[374953] 1718291657.802753: Received answer (253 bytes) from dgram 203.0.111.60:88
[374953] 1718291658.552102: Response was from primary KDC

... note the 13 second delay between lines 3 and 4 of the above output.

yrro commented 3 months ago
  1. DNS SRV query for _ldap.tcp.example.com
  2. DNS A query for ipa3.example.com
  3. LDAP query of ipa3.example.com for root DSE (connection held open)
  4. DNS A/AAA queries for ipa5.example.com and ipa6.example.com
  5. (10 seconds after the prior root DSE query) unbind request followed by TCP connection close

So the queries in no. 4 are actually from the locator plugin, and they are made via nss not c-ares. If I add addresses for ipa{5,6}.example.com to /etc/hosts then the locator plugin completes immediately.

So here's what I think happens:

  1. backend launches ldap_child to test if ipa3 is online
  2. ldap_child invokes krb5 locator plugin
  3. krb5 locator plugin tries to resolve the hostnames of ipa5/ipa6 from the kdcinfo file
  4. the configured DNS server never replies
  5. backend kills ldap_child

As for what to do about it... I think sssd should be more robust when the configured DNS server behaves badly. But the problem is really in the locator plugin. As long as the plugin resolves all relies on NSS for hostname resolution then we're at the mercy of NSS plugins that behave weirdly or take a very long time to return a result.

Maybe, for the purpose of discovering whether server X is online, sssd could do some hacky stuff to force the ldap_child to only use server X for kerberos as well. It wouldn't fix the performance of the locator plugin in general, but it would let sssd stay online in the presence of an un-cooperative or unreliable DNS server.

sumit-bose commented 3 months ago

Hi,

thanks, I agree with your analysis. You might want to try to set

krb5_kdcinfo_lookahead = 1:0

which should only add the IP address of the KDC currently selected by SSSD to the kdcinfo file.

Originally only this IP address was in the kdcinfo file. But this caused issues for other processes using Kerberos if KDC the IP address was pointing to got offline and SSSD didn't run any Kerberos operations to detect this. Unfortunately, if the the reply from the locator plugin does not work libkrb5 does not fall back to other means to detect a KDC. So we added some other KDCs to the kdcinfo file as well to offer a larger selection. Currently those extra KDCs are added by name.

The locator plugin in general was added to allow the client to stick to a single KDC as long as possible. This is e.g important in a larger environment where password changes need some time to spread.

bye, Sumit

yrro commented 3 months ago

That makes sense, thanks. With the locator plugin enabled and krb5_kdcinfo_lookahead = 1:0 sssd is able to authenticate to ipa3's LDAP server and stay online.

Is it worth populating the additional entries in the kdcinfo file only once their names are able to be resolved? That way the IP address entry would always be there, but in my case the entries for ipa5/ipa6 would only show up once sssd has been able to resolve them. In my case, where I currently am, they never resolve, so sssd would never add them. The locator plugin then wouldn't cause a problem for sssd or other clients.