Open yrro opened 1 year ago
I've tried to reproduce this in a lab where I can block queries for individual DNS names and I can't get sssd to go into offline mode. I think this is because the blocking method I'm using causes my DNS server to return an error to the client. But that's not reproducing the same condition that I'm seeing on the real network. I need to figure out a way to have my DNS server drop the query and not provide a response to the client.
My suspicion is that SSSD gets stuck in offline mode if its attempt to resolve one of the IPA server hostnames times out, but not if it fails within deadline.
Hi @thalman,
I think the problem is with the part of SSSD that tries to write out the
/var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM
file. If I setkrb5_use_kdcinfo
toFalse
then I do not observe the queries from step 3 above, and SSSD stays in online mode.So it looks like if any of the IPA servers for a domain can't be resolved, the 'write out kdcinfo' process fails and SSSD goes into offline mode.
does this ^^ ring a bell to you?
I think the problem is with the part of SSSD that tries to write out the
/var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM
file. If I setkrb5_use_kdcinfo
toFalse
then I do not observe the queries from step 3 above, and SSSD stays in online mode. So it looks like if any of the IPA servers for a domain can't be resolved, the 'write out kdcinfo' process fails and SSSD goes into offline mode.does this ^^ ring a bell to you?
If I recall correctly writing kdcinfo file does not perform DNS resolution. It uses IP addresses if the resolution is done by failover mechanism. If the name is not resolved yet, then it is written as FQDN.
If there is a device dropping packets, then tuning timeouts might help. I would try to shorten dns_resolver_op_timeout
and see if that helps.
HTH
If there is a device dropping packets, then tuning timeouts might help. I would try to shorten dns_resolver_op_timeout and see if that helps.
It defaults to 3 which is pretty low, but lowering it to 1 (the lowest possible value) doesn't help here.
I think the problem is with the part of SSSD that tries to write out the /var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM file. If I set krb5_use_kdcinfo to False then I do not observe the queries from step 3 above, and SSSD stays in online mode.
It's a couple of years later and the behaviour of sssd (2.9.5) is a bit different.
First of all, sssd takes a long time (upwards of 90 seconds) to start up (to be precise, until it tells systemd that it's completed starting up). This looks like a DNS a query for for $HOSTNAME
, that the ISP's DNS server forwards to my IPA domain's DNS server, which respond with an answer containing an RFC1918 address, causing the ISP's DNS server to drop the transaction, sending no reply to sssd. If I put an entry for $HOSTNAME
into /etc/hosts
then it starts up instantly.
Second of all, I can see in the backend logs attempts to query DNS for the two IPA servers with RFC1918 addresses:
Failed to resolve server 'ipa6.example.com': Timeout while contacting DNS servers
Marking server 'ipa6.example.com' as 'not working'
Couldn't resolve server (ipa6.example.com), resolver returned [5]: Input/output error
which is fine, because there's a third server that should be working. But then we end up with:
(2024-06-13 9:26:55): [be[example.com]] [be_resolve_server_process] (0x1000): [RID#5] Trying with the next one!
(2024-06-13 9:26:55): [be[example.com]] [fo_resolve_service_send] (0x0100): [RID#5] Trying to resolve service 'IPA'
(2024-06-13 9:26:55): [be[example.com]] [get_server_status] (0x1000): [RID#5] Status of server 'ipa6.example.com' is 'not working'
(2024-06-13 9:26:55): [be[example.com]] [get_server_status] (0x1000): [RID#5] Status of server 'ipa3.example.com' is 'name resolved'
(2024-06-13 9:26:55): [be[example.com]] [get_port_status] (0x1000): [RID#5] Port status of port 389 for server 'ipa3.example.com' is 'not working'
(2024-06-13 9:26:55): [be[example.com]] [get_port_status] (0x0080): [RID#5] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
(2024-06-13 9:26:55): [be[example.com]] [get_server_status] (0x1000): [RID#5] Status of server 'ipa5.example.com' is 'not working'
(2024-06-13 9:26:55): [be[example.com]] [get_server_status] (0x1000): [RID#5] Status of server 'ipa6.example.com' is 'not working'
(2024-06-13 9:26:55): [be[example.com]] [fo_resolve_service_send] (0x0020): [RID#5] No available servers for service 'IPA'
(2024-06-13 9:26:55): [be[example.com]] [be_resolve_server_done] (0x1000): [RID#5] Server [NULL] resolution failed: [5]: Input/output error
(2024-06-13 9:26:55): [be[example.com]] [dp_req_done] (0x0400): [RID#5] DP Request [Online Check #5]: Request handler finished [0]: Success
(2024-06-13 9:26:55): [be[example.com]] [dp_req_done] (0x20000): [RID#5] DP Request [Online Check #5]: Handling request took [2125.513] milliseconds.
(2024-06-13 9:26:55): [be[example.com]] [_dp_req_recv] (0x0400): [RID#5] DP Request [Online Check #5]: Receiving request data.
(2024-06-13 9:26:55): [be[example.com]] [dp_req_destructor] (0x0400): [RID#5] DP Request [Online Check #5]: Request removed.
(2024-06-13 9:26:55): [be[example.com]] [dp_req_destructor] (0x0400): [RID#5] Number of active DP request: 0
(2024-06-13 9:26:55): [be[example.com]] [be_check_online_done] (0x0400): [RID#5] Backend is offline
The logs don't say why sssd isn't able to talk to ipa3.example.com
. It resolves fine. I can connect to it with ldapwhoami
fine. But according to tcpdump, sssd doesn't even try to connect to it - no outgoing TCP packets are seen to port 389 or 636.
Third, disable krb5_use_kdcinfo
doesn't help any more. But as you said above, perhaps it never did and I was mistaken about this.
I'll continue to debug this later...
Ok, I realise the problem.
To reduce noise in the logs, I set ipa_server = ipa3.example.com
to disable DNS service discovery.
(2024-06-13 9:36:42): [be[example.com]] [set_server_common_status] (0x0100): [RID#1] Marking server 'ipa3.example.com' as 'name resolved'
(2024-06-13 9:36:42): [be[example.com]] [be_resolve_server_process] (0x1000): [RID#1] Saving the first resolved server
(2024-06-13 9:36:42): [be[example.com]] [be_resolve_server_process] (0x0200): [RID#1] Found address for server ipa3.example.com: [2a01:4f8:c17:d9c6::1] TTL 996
(2024-06-13 9:36:42): [be[example.com]] [ipa_resolve_callback] (0x0400): [RID#1] Constructed uri 'ldap://ipa3.example.com'
(2024-06-13 9:36:42): [be[example.com]] [sssd_async_socket_init_send] (0x4000): [RID#1] Using file descriptor [25] for the connection.
(2024-06-13 9:36:42): [be[example.com]] [sssd_async_connect_send] (0x0020): [RID#1] connect failed [101][Network is unreachable].
(2024-06-13 9:36:42): [be[example.com]] [sssd_async_socket_init_send] (0x0400): [RID#1] Setting 6 seconds timeout [ldap_network_timeout] for connecting
(2024-06-13 9:36:42): [be[example.com]] [sssd_async_socket_init_done] (0x0040): [RID#1] sdap_async_sys_connect request failed: [101]: Network is unreachable.
(2024-06-13 9:36:42): [be[example.com]] [sssd_async_socket_state_destructor] (0x0400): [RID#1] closing socket [25]
(2024-06-13 9:36:42): [be[example.com]] [sss_ldap_init_sys_connect_done] (0x0020): [RID#1] sssd_async_socket_init request failed: [101]: Network is unreachable.
(2024-06-13 9:36:42): [be[example.com]] [sdap_sys_connect_done] (0x0020): [RID#1] sdap_async_connect_call request failed: [101]: Network is unreachable.
(2024-06-13 9:36:42): [be[example.com]] [sdap_handle_release] (0x2000): [RID#1] Trace: sh[0x55aa316eefa0], connected[0], ops[(nil)], ldap[(nil)], destructor_lock[0], release_memory[0]
(2024-06-13 9:36:42): [be[example.com]] [_be_fo_set_port_status] (0x8000): [RID#1] Setting status: PORT_NOT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_done: 1665
(2024-06-13 9:36:42): [be[example.com]] [fo_set_port_status] (0x0100): [RID#1] Marking port 0 of server 'ipa3.example.com' as 'not working'
(2024-06-13 9:36:42): [be[example.com]] [fo_set_port_status] (0x0400): [RID#1] Marking port 0 of duplicate server 'ipa3.example.com' as 'not working'
(2024-06-13 9:36:42): [be[example.com]] [fo_resolve_service_send] (0x0100): [RID#1] Trying to resolve service 'IPA'
(2024-06-13 9:36:42): [be[example.com]] [get_server_status] (0x1000): [RID#1] Status of server 'ipa3.example.com' is 'name resolved'
(2024-06-13 9:36:42): [be[example.com]] [get_port_status] (0x1000): [RID#1] Port status of port 0 for server 'ipa3.example.com' is 'not working'
(2024-06-13 9:36:42): [be[example.com]] [get_port_status] (0x0080): [RID#1] SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
(2024-06-13 9:36:42): [be[example.com]] [fo_resolve_service_send] (0x0020): [RID#1] No available servers for service 'IPA'
(2024-06-13 9:36:42): [be[example.com]] [be_resolve_server_done] (0x1000): [RID#1] Server [NULL] resolution failed: [5]: Input/output error
(2024-06-13 9:36:42): [be[example.com]] [sdap_id_op_connect_done] (0x0040): [RID#1] Failed to connect, going offline (5 [Input/output error])
(2024-06-13 9:36:42): [be[example.com]] [be_mark_offline] (0x2000): [RID#1] Going offline!
So the problem is that ipa3.example.com
has both an AAAA
record and an A
record. But the ISP I'm on doesn't provide IPv6 connectivity. And sssd tries only to connect to the IPv6 address, it doesn't fall back to trying the IPv4 address.
So the workaround for that is to add an entry for the server's IPv4 address to /etc/hosts
. With that, sssd goes and stays online!
I then set ipa_server = _srv_
and restarted sssd, and it returned to going offline. I'll continue investigating that later. But in the mean time I'll summarize what I've found. These are all probably issues to raise separately, I'll do that later.
$HOSTNAME
then sssd takes a very long time to start up (I'll raise a separate issue for this)ipa_server
has both AAAA and A records, sssd will try to connect to the IPv6 address, if this fails it considers the server offline, instead of trying the IPv4 address (I'll raise a separate issue for this)ipa_servers = _srv_
, and the configured DNS server does not respond to queries for A records for some of the discovered servers, sssd goes offine, even though if you set ipa_server = working-server.example.com
, sssd stays online (this issue)I had lookup_family_order = ipv6_first
in my config file, as a holdover from debugging this back when I first opened it. I've now removed it and done another pass at looking into ipa_servers = _srv_
causing sssd to get stuck in offline mode. Fortunately I don't think that setting is relevant to the failure.
With debug_level = 10
things are a little clearer...
==> /var/log/sssd/sssd_example.com.log <==^M
[...]
(2024-06-13 14:21:57): [be[example.com]] [sdap_get_rootdse_done] (0x2000): [RID#1] Got rootdse^M
[...]
(2024-06-13 14:21:57): [be[example.com]] [be_resolve_server_process] (0x0200): [RID#1] Found address for server ipa3.example.com: [203.0.111.60] TTL 7200^M
(2024-06-13 14:21:57): [be[example.com]] [sdap_kinit_kdc_resolved] (0x1000): [RID#1] KDC resolved, attempting to get TGT...^M
(2024-06-13 14:21:57): [be[example.com]] [create_tgt_req_send_buffer] (0x0400): [RID#1] buffer size: 72^M
(2024-06-13 14:21:57): [be[example.com]] [child_handler_setup] (0x2000): [RID#1] Setting up signal handler up for pid [356389]^M
(2024-06-13 14:21:57): [be[example.com]] [child_handler_setup] (0x2000): [RID#1] Signal handler set up for pid [356389]^M
(2024-06-13 14:21:57): [be[example.com]] [set_tgt_child_timeout] (0x0400): [RID#1] Setting 8 seconds timeout for TGT child^M
(2024-06-13 14:21:57): [be[example.com]] [sdap_process_result] (0x2000): Trace: sh[0x5555b17dea50], connected[1], ops[(nil)], ldap[0x5555b17d7700]^M
(2024-06-13 14:21:57): [be[example.com]] [sdap_process_result] (0x2000): Trace: end of ldap_result list^M
(2024-06-13 14:21:57): [be[example.com]] [_write_pipe_handler] (0x0400): [RID#1] All data has been sent!^M
==> /var/log/sssd/ldap_child.log <==^M
(2024-06-13 14:21:57): [ldap_child[356389]] [main] (0x0400): ldap_child started.^M
(2024-06-13 14:21:57): [ldap_child[356389]] [main] (0x2000): context initialized^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): total buffer size: 72^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): realm_str size: 17^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): got realm_str: EXAMPLE.COM^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): princ_str size: 31^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): got princ_str: host/myhost.example.com^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): keytab_name size: 0^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x1000): lifetime: 86400^M
(2024-06-13 14:21:57): [ldap_child[356389]] [unpack_buffer] (0x0200): Will run as [0][0].^M
(2024-06-13 14:21:57): [ldap_child[356389]] [privileged_krb5_setup] (0x2000): Kerberos context initialized^M
(2024-06-13 14:21:57): [ldap_child[356389]] [main] (0x2000): Kerberos context initialized^M
(2024-06-13 14:21:57): [ldap_child[356389]] [become_user] (0x0200): Trying to become user [0][0].^M
(2024-06-13 14:21:57): [ldap_child[356389]] [become_user] (0x0200): Already user [0].^M
(2024-06-13 14:21:57): [ldap_child[356389]] [main] (0x2000): Running as [0][0].^M
(2024-06-13 14:21:57): [ldap_child[356389]] [main] (0x2000): getting TGT sync^M
(2024-06-13 14:21:57): [ldap_child[356389]] [ldap_child_get_tgt_sync] (0x2000): got realm_name: [EXAMPLE.COM]^M
(2024-06-13 14:21:57): [ldap_child[356389]] [ldap_child_get_tgt_sync] (0x0100): Principal name is: [host/myhost.example.com@EXAMPLE.COM]^M
(2024-06-13 14:21:57): [ldap_child[356389]] [ldap_child_get_tgt_sync] (0x0100): Using keytab [MEMORY:/etc/krb5.keytab]^M
(2024-06-13 14:21:57): [ldap_child[356389]] [ldap_child_get_tgt_sync] (0x0100): Will canonicalize principals^M
(2024-06-13 14:21:57): [ldap_child[356389]] [sss_child_krb5_trace_cb] (0x4000): [356389] 1718284917.303677: Getting initial credentials for host/myhost.example.com@EXAMPLE.COM^M
^M
(2024-06-13 14:21:57): [ldap_child[356389]] [sss_child_krb5_trace_cb] (0x4000): [356389] 1718284917.303678: Found entries for host/myhost.example.com@EXAMPLE.COM in keytab: aes256-cts, aes128-cts^M
^M
(2024-06-13 14:21:57): [ldap_child[356389]] [sss_child_krb5_trace_cb] (0x4000): [356389] 1718284917.303680: Sending unauthenticated request^M
^M
(2024-06-13 14:21:57): [ldap_child[356389]] [sss_child_krb5_trace_cb] (0x4000): [356389] 1718284917.303681: Sending request (216 bytes) to EXAMPLE.COM^M
==> /var/log/sssd/sssd_example.com.log <==^M
(2024-06-13 14:21:58): [be[example.com]] [be_run_unconditional_online_cb] (0x4000): List of unconditional online callbacks is empty, nothing to do.^M
(2024-06-13 14:21:58): [be[example.com]] [check_if_online_delayed] (0x2000): Backend is already online, nothing to do.^M
(2024-06-13 14:22:05): [be[example.com]] [get_tgt_timeout_handler] (0x4000): [RID#1] timeout for sending SIGTERM to TGT child [356389] reached.^M
(2024-06-13 14:22:05): [be[example.com]] [get_tgt_timeout_handler] (0x0400): [RID#1] Setting 2 seconds timeout for sending SIGKILL to TGT child^M
(2024-06-13 14:22:05): [be[example.com]] [_read_pipe_handler] (0x0400): [RID#1] EOF received, client finished^M
(2024-06-13 14:22:05): [be[example.com]] [child_sig_handler] (0x1000): [RID#1] Waiting for child [356389].^M
(2024-06-13 14:22:05): [be[example.com]] [child_sig_handler] (0x0020): [RID#1] child [356389] failed with status [7].^M
(2024-06-13 14:22:05): [be[example.com]] [child_callback] (0x0020): [RID#1] LDAP child was terminated due to timeout^M
(2024-06-13 14:22:05): [be[example.com]] [sdap_kinit_done] (0x0080): [RID#1] Communication with KDC timed out, trying the next one^M
(2024-06-13 14:22:05): [be[example.com]] [_be_fo_set_port_status] (0x8000): [RID#1] Setting status: PORT_NOT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_kinit_done: 1269^M
(2024-06-13 14:22:05): [be[example.com]] [fo_set_port_status] (0x0100): [RID#1] Marking port 389 of server 'ipa3.example.com' as 'not working'^M
(2024-06-13 14:22:05): [be[example.com]] [fo_set_port_status] (0x0400): [RID#1] Marking port 389 of duplicate server 'ipa3.example.com' as 'not working'^M
So it looks like a problem with ldap_child
getting stuck, and then getting killed resulting in ipa3
being marked as not responding.
I'm running wireshark while starting sssd up and I see the following:
_ldap.tcp.example.com
ipa3.example.com
ipa3.example.com
for root DSE (connection held open)ipa5.example.com
and ipa6.example.com
What's missing is any Kerberos traffic--ldap_child
isn't sending anything on either UDP or TCP port 88. But the last thing it logged before being terminated looks exactly like KRB5_TRACE
style output from libkrb5... so why doesn't it actually send anything?
Hi,
the Sending request (216 bytes) to EXAMPLE.COM
messages indicates that libkrb5 wants to send a request to a KDC from the EXAMPLE.COM
realm. Given the old discussion I assume you have set krb5_use_kdcinfo = False
and libkrb5 will try DNS SRV and/or TXT depending on the settings in /etc/krb5.conf
or look at the corresponding [realms]
section in ¸/etc/krb5.conf` directly.
Are ipa5.example.com
and ipa6.example.com
listed as KDCs for EXAMPLE.COM
in /etc/krb5.conf
?
Maybe you can try to connect strace
with the --follow-forks
option to the sssd_be
process to get more details about what libkrb5 is trying to do at this stage.
bye, Sumit
Thanks for taking a look at this again.
I've been running all along with krb5_use_kdcinfo
unset, so the plugin has been enabled throughout today's debugging.
Here's a backtrace of ldap_child
after it logs Sending request (216 bytes) to IPA.EXAMPLE.COM
:
#0 0x00007fb4f80d8c0b in ppoll () from /lib64/libc.so.6
#1 0x00007fb4f7716e0f in ppoll_usec.constprop () from /lib64/libnss_resolve.so.2
#2 0x00007fb4f77101f3 in varlink_call.constprop () from /lib64/libnss_resolve.so.2
#3 0x00007fb4f7707a38 in _nss_resolve_gethostbyname4_r () from /lib64/libnss_resolve.so.2
#4 0x00007fb4f8113941 in getaddrinfo () from /lib64/libc.so.6
#5 0x00007fb4f7cbe1b4 in sssd_krb5_locator_lookup () from /usr/lib64/krb5/plugins/libkrb5/sssd_krb5_locator_plugin.so
#6 0x00007fb4f822718a in locate_server () from /lib64/libkrb5.so.3
#7 0x00007fb4f8227ef0 in k5_locate_server () from /lib64/libkrb5.so.3
#8 0x00007fb4f822a1b8 in krb5_sendto_kdc () from /lib64/libkrb5.so.3
#9 0x00007fb4f8201a22 in k5_init_creds_get () from /lib64/libkrb5.so.3
#10 0x00007fb4f820430f in get_init_creds_keytab () from /lib64/libkrb5.so.3
#11 0x00007fb4f82043a9 in krb5_get_init_creds_keytab () from /lib64/libkrb5.so.3
#12 0x000055e3fa193192 in main ()
So I've just tested again after setting krb5_use_kdcinfo = false
then sssd stays online!
I should have tried this option earlier but trust me there were a lot of other combination of options I tested to try to pin this down, I just hadn't tried krb5_use_kdcinfo
yet.
BTW, from the man page:
If the environment variable SSSD_KRB5_LOCATOR_IGNORE_DNS_FAILURES is set to any value plugin will try to resolve all DNS names in kdcinfo file. By default plugin returns KRB5_PLUGIN_NO_HANDLE to the caller immediately on first DNS resolving failure.
I dunno if returning KRB5_PLUGIN_NO_HANDLE
is good, neutral or bad in a case like mine where effectively at least one server can't be resolved. Does it just disable use of the plugin and libkrb5 falls back to classic behavior?
Anyway, in this case it doesn't really matter - it looks like the plugin is taking too long to run, so the backend kills off the ldap_child
before the plugin has a chance to finish.
I can't find where this 8 second timeout is set, but if there's a setting you know of then I'll try increasing the timeout & leaving the locator plugin enabled.
For the record, when the plugin is enabled, here's what the kdcinfo file looks like:
# cat /var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM
203.0.111.60
ipa6.example.com
ipa5.example.com
And with that file, KRB5_TRACE=/dev/stderr kinit
hangs in a way very similar to ldap_child
:
[374953] 1718291644.881060: Getting initial credentials for yrro@EXAMPLE.COM
[374953] 1718291644.881062: Sending unauthenticated request
[374953] 1718291644.881063: Sending request (187 bytes) to EXAMPLE.COM
[374953] 1718291657.802752: Sending initial UDP request to dgram 203.0.111.60:88
[374953] 1718291657.802753: Received answer (253 bytes) from dgram 203.0.111.60:88
[374953] 1718291658.552102: Response was from primary KDC
... note the 13 second delay between lines 3 and 4 of the above output.
- DNS SRV query for _ldap.tcp.example.com
- DNS A query for ipa3.example.com
- LDAP query of ipa3.example.com for root DSE (connection held open)
- DNS A/AAA queries for ipa5.example.com and ipa6.example.com
- (10 seconds after the prior root DSE query) unbind request followed by TCP connection close
So the queries in no. 4 are actually from the locator plugin, and they are made via nss not c-ares. If I add addresses for ipa{5,6}.example.com
to /etc/hosts
then the locator plugin completes immediately.
So here's what I think happens:
As for what to do about it... I think sssd should be more robust when the configured DNS server behaves badly. But the problem is really in the locator plugin. As long as the plugin resolves all relies on NSS for hostname resolution then we're at the mercy of NSS plugins that behave weirdly or take a very long time to return a result.
Maybe, for the purpose of discovering whether server X is online, sssd could do some hacky stuff to force the ldap_child
to only use server X for kerberos as well. It wouldn't fix the performance of the locator plugin in general, but it would let sssd stay online in the presence of an un-cooperative or unreliable DNS server.
Hi,
thanks, I agree with your analysis. You might want to try to set
krb5_kdcinfo_lookahead = 1:0
which should only add the IP address of the KDC currently selected by SSSD to the kdcinfo file.
Originally only this IP address was in the kdcinfo file. But this caused issues for other processes using Kerberos if KDC the IP address was pointing to got offline and SSSD didn't run any Kerberos operations to detect this. Unfortunately, if the the reply from the locator plugin does not work libkrb5 does not fall back to other means to detect a KDC. So we added some other KDCs to the kdcinfo file as well to offer a larger selection. Currently those extra KDCs are added by name.
The locator plugin in general was added to allow the client to stick to a single KDC as long as possible. This is e.g important in a larger environment where password changes need some time to spread.
bye, Sumit
That makes sense, thanks. With the locator plugin enabled and krb5_kdcinfo_lookahead = 1:0
sssd is able to authenticate to ipa3's LDAP server and stay online.
Is it worth populating the additional entries in the kdcinfo file only once their names are able to be resolved? That way the IP address entry would always be there, but in my case the entries for ipa5/ipa6 would only show up once sssd has been able to resolve them. In my case, where I currently am, they never resolve, so sssd would never add them. The locator plugin then wouldn't cause a problem for sssd or other clients.
My ISP inspects DNS traffic. Any responses with
A
records in the answer section for RFC 1918 addresses are dropped.I believe this is being done by the router they provide because I see this behaviour when querying DNS servers on the internet as well as my ISP's own servers (i.e.,
dig @1.1.1.1 10-0-0-1.nip.io
times out).This causes SSSD to go into offline mode shortly after it starts up.
I'm seeing the following behaviour:
_ldap._tcp.example.com
SSSD receives the correct response (specifically, the answer sections contains 3
SRV
records and the additional section containsA
records for each of theSRV
records, with a mixture of public and RFC 1918 addresses); the equivalent of:Here
ipa3.example.com
has a public IP address and the other two servers have RFC 1918 addresses.A
andAAAA
records for each of the names in the answer sectionipa3.example.com
; no answers are received for the other two names.I think the problem is with the part of SSSD that tries to write out the
/var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM
file. If I setkrb5_use_kdcinfo
toFalse
then I do not observe the queries from step 3 above, and SSSD stays in online mode.So it looks like if any of the IPA servers for a domain can't be resolved, the 'write out kdcinfo' process fails and SSSD goes into offline mode.
What I'd like to see is sssd handling this situation and continuing to talk to
ipa3.example.com
even thoughipa5.example.com
andipa6.example.com
are not resolvable.i.e., a bit like what
SSSD_KRB5_LOCATOR_IGNORE_DNS_FAILURES
does in the Kerberos locator plugin, but controlling the behaviour of SSSD itself.Seen with
sssd-2.7.4-1.fc37.x86_64.rpm
on Fedora 37 andsssd
2.4.1-2
on Debian 11 "bullseye".