Provider sporadically considered Offline and never restores

vudex commented 10 months ago

Hello,

I am encountering a persistent issue with sssd intermittently identifying the ipa backend as offline and failing to return online. Initially, I temporarily resolved this by restarting the service, but the problem persists without a permanent solution. I am reluctant to restart the service each time a user encounters this issue.

When sssd indicates that backend is Offline in the logs, I can successfully execute 'id' and 'kinit' commands for the affected user. The 'id' command retrieves the actual groups stored in FreeIPA, confirming that FreeIPA is operational and healthy. However, sssd seems to disagree and indicates otherwise.

I've provided a link to a comprehensive log file containing all entries from /var/log/sssd/ during the SSH login attempt for the 'test-user-ssh':

sssd_all.log

My system configurations are as follows:

sssd version: sssd-2.6.1-alt2.x86_64
freeipa-client version: freeipa-client-4.8.9-alt4.c9f2.3.x86_64

Here is a snippet of my sssd.conf file, in its default state post ipa-client-install:

[domain/custom.in-realm.domain]
id_provider = ipa
ipa_server = ipa-01.my-realm.internal
ipa_domain = custom.in-realm.domain
ipa_hostname = studio-01.custom.in-realm.domain
krb5_realm = MY-REALM.INTERNAL
auth_provider = ipa
chpass_provider = ipa
access_provider = ipa
cache_credentials = True
ldap_tls_cacert = /etc/ipa/ca.crt
krb5_store_password_if_offline = True

[sssd]
config_file_version = 2
services = nss, pam, ssh, sudo
user = _sssd
domains = custom.in-realm.domain

[nss]
[ssh]
[sudo]

Any insights or assistance in resolving this recurring sssd issue would be greatly appreciated.

sumit-bose commented 10 months ago

Hi,

unfortunately the most important part of the logs, the backend log, is missing. Would it be possible to attach the full /var/log/sssd/sssd_custom.in-realm.domain.log file?

bye, Sumit

vudex commented 10 months ago

Hi,

unfortunately the most important part of the logs, the backend log, is missing. Would it be possible to attach the full /var/log/sssd/sssd_custom.in-realm.domain.log file?

bye, Sumit

Well, thank you for quick reply. The point is, I used tail -f /var/log/sssd/* so if there is no sssd_be.log then it wasn't writhing anything during my ssh conenction. Maybe it uses buffer though, and writing to the log is delayed.

Yesterday after I gathered logs I restarted sssd and was able to successfully log in to the host. But today problem reapeared.

pam_sss(sshd:auth): received for user ***: 4 (System error)

I'm attaching sssd_be.log. It's quiet big for just one second. I tried to removed sensible info. Maybe there is a better chance if try to explain me another approaches to debug, or what exactly to look for in those logs. I can't see anything suspicious in the logs.

sssd_be2.log

sumit-bose commented 10 months ago

Hi,

what I'm looking for is the reason why the backend switched into the offline state which typically happens before you see issues like a failed login or similar. In sssd_be2.log only some group lookups by GID are recorded and during the time the backend was online and was able to read data from the server. There is nothing related to the failed authentication attempt in the logs.

So maybe if can help if you grep the logs for 'Going offline' and check or send the messages before this to see what might be the reason for going offline.

bye, Sumit

vudex commented 10 months ago

Well, thank you for quick responses, I understand something now. What I see is some ldap connection issues (maybe dirsrv hanged during some heavy write operation?) hard to know now.

But why is it not recovering ever? Or maybe it is recovering, but users are unable to login until service restart.

sssd_be_offline.log

I grepd those logs with

grep -ri -B 40 offline sssd_custom.in-realm.domain.log.1

sumit-bose commented 10 months ago

Hi,

thanks for the new log lines. There are LDAP child was terminated due to timeout messages. Can you check ldap_child.log, especially with PID 1464433 and 1470320.

bye, Sumit

andreboscatto commented 9 months ago

Hi @vudex,

Did you have a chance to take a look at it?

Kindly

vudex commented 8 months ago

@sumit-bose @andreboscatto Hi! We are still sporadically experiencing issues, but I couldn't investigate further on hosts from which I attached logs earlier. I still can not replicated the problem either - if I do a simple test of disabling backend - after enabling it sssd turns back online immidiatly as it should be.

Yesterday I encounterd same problem in new environment and I would like to discuss it here if you may.

So /etc/sssd/sssd.conf looks something like this:

[domain/pd40.other-domain.mtp]

id_provider = ipa
dns_discovery_domain = pd40.other-domain.mtp
ipa_server = ipa.pd40.gtp
ipa_domain = pd40.other-domain.mtp
ipa_hostname = host-01.pd40.other-domain.mtp
krb5_realm = PD40.SOL.MTP
auth_provider = ipa
chpass_provider = ipa
access_provider = ipa
cache_credentials = True
ldap_tls_cacert = /etc/ipa/ca.crt
krb5_store_password_if_offline = True
[sssd]
services = nss, pam, ssh, sudo

domains = pd40.other-domain.mtp
[nss]
homedir_substring = /home

[pam]

[sudo]

[autofs]

[ssh]

[pac]

[ifp]

[session_recording]

SSSD decided to resolve backend and failed. Well, it happens, should return online shortly after I guess:

*  (2024-02-01 10:48:49): [be[pd40.other-domain.mtp]] [get_server_status] (0x0100): Hostname resolution expired, resetting the server status of 'ipa.pd40.gtp'

(2024-02-01 10:49:04): [be[pd40.other-domain.mtp]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'ipa.pd40.gtp': Timeout while contacting DNS servers
(2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [fo_resolve_service_send] (0x0020): No available servers for service 'IPA'
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2024-02-01 10:49:04): [be[pd40.other-domain.mtp]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'ipa.pd40.gtp': Timeout while contacting DNS servers
   *  (2024-02-01 10:49:05): [be[pd40.other-domain.mtp]] [set_server_common_status] (0x0100): Marking server 'ipa.pd40.gtp' as 'not working'
   *  (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [be_resolve_server_process] (0x0080): Couldn't resolve server (ipa.pd40.gtp), resolver returned [5]: Input/output error
   *  (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [be_resolve_server_process] (0x1000): Trying with the next one!
   *  (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'IPA'
   *  (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [get_server_status] (0x1000): Status of server 'ipa.pd40.gtp' is 'not working'
   *  (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [get_server_status] (0x1000): Status of server 'ipa.pd40.gtp' is 'not working'
   *  (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [fo_resolve_service_send] (0x0020): No available servers for service 'IPA'

After that I think SSSD tries to restart and re-read configuration:

(2024-02-01 11:02:18): [be[pd40.other-domain.mtp]] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
(2024-02-01 11:02:20): [be[pd40.other-domain.mtp]] [ipa_get_config_done] (0x0040): [RID#1] Unexpected number of results, expected 1, got 0.

We see here that operation (ipa_get_config_done) is failed, cause it is returned 0 instead of 1. So I guess sssd could not read config here? During this operation several LDAP searches are executed, for example (from followed backtrace of ipa_get_config_done operation):

   *  (2024-02-01 11:02:19): [be[pd40.other-domain.mtp]] [sdap_set_search_base] (0x0100): [RID#1] Setting option [ldap_ipnetwork_search_base] to [dc=pd40,dc=sol,dc=mtp].

And this base is correct, cause if you see in sssd.conf I specified before my realm is PD40.SOL.MTP, so my LDAP root is dc=pd40,dc=sol,dc=mtp

But in that backtrace I can find some of the searches that executed with wrong search base:

   *  (2024-02-01 11:02:19): [be[pd40.other-domain.mtp]] [sdap_get_generic_ext_step] (0x0400): [RID#1] calling ldap_search_ext with [(|(&(objectClass=ipaCertMapRule)(ipaEnabledFlag=TRUE))(objectClass=ipaCertMapConfigObject))][cn=certmap,dc=pd40,dc=other-domain,dc=mtp].

   *  (2024-02-01 11:02:19): [be[pd40.other-domain.mtp]] [sdap_get_generic_ext_step] (0x0400): [RID#1] calling ldap_search_ext with [no filter][cn=default,cn=views,cn=accounts,dc=pd40,dc=other-domain,dc=mtp].
   *  (2024-02-01 11:02:19): [be[pd40.other-domain.mtp]] [sdap_get_generic_ext_step] (0x1000): [RID#1] Requesting attrs: [ipaDomainResolutionOrder]
   *  (2024-02-01 11:02:20): [be[pd40.other-domain.mtp]] [sdap_get_generic_ext_step] (0x0400): [RID#1] calling ldap_search_ext with [(&(cn=ipaConfig)(objectClass=ipaGuiConfig))][cn=etc,dc=pd40,dc=other-domain,dc=mtp].
   *  (2024-02-01 11:02:20): [be[pd40.other-domain.mtp]] [sdap_get_generic_ext_step] (0x1000): [RID#1] Requesting attrs: [ipaDomainResolutionOrder]

Here we can see that search base for some reason is dc=pd40,dc=other-domain,dc=mtp (same as domain of host)

And after that we see the final message:

(2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_domain_resolution_order_done] (0x0040): [RID#1] Failed to get the domains' resolution order configuration from the server [22]: Invalid argument
(2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_subdomains_handler_done] (0x0020): [RID#1] Unable to refresh subdomains [22]: Invalid argument
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
   *  (2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_domain_resolution_order_done] (0x0040): [RID#1] Failed to get the domains' resolution order configuration from the server [22]: Invalid argument
   *  (2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_domain_refresh_resolution_order_done] (0x0080): [RID#1] Unable to get the domains order resolution [22]: Invalid argument
   *  (2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [sdap_id_op_done] (0x4000): [RID#1] releasing operation connection
   *  (2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_domain_refresh_resolution_order_done] (0x0400): [RID#1] Unable to refresh subdomains [22]: Invalid argument
   *  (2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_subdomains_handler_done] (0x0020): [RID#1] Unable to refresh subdomains [22]: Invalid argument

Could this be somehow related to this problem? Or maybe it is not so important message?

vudex commented 8 months ago

This issue looks familiar for my case and major versions of sssd match. Though I would not say that hosts are experiencing any load or memory issues. https://github.com/SSSD/sssd/issues/6803

vudex commented 8 months ago

Here shouldn't we convert krb5_realm to basedn instead of domain section for IPA case?

sumit-bose commented 8 months ago

Hi,

what is the reason for using ipa_domain = pd40.other-domain.mtp? If you install FreeIPA the baseDN of the LDAP server is generated from the domain name you provide and not from the realm name.

bye, Sumit

vudex commented 7 months ago

Hi, @sumit-bose The reason is ipa-client-install command on the host were invoked with host’s domain. Maybe it is incorrect to do so, but in the end it doesn’t affect anything besides providing few log messages. I just thought there maybe some corner case when watchdog kills the process. I hang the sssd_be process (using gdb) and achieved termination by watchdog, but process just restarts normally.

All I can say that the host is experiencing load during the problem, either it is excessive cpu load, ram exhaust, or io latency. But I couldn’t achieve anything with loading test host with stress-ng.

Very sad I cannot reproduce the problem.

hexum commented 3 months ago

I have the similar issue

Jun 05 00:40:45 eugene2.servers.bright.gdn sssd[3357484]: (2024-06-05  0:40:45): [be[BRIGHT.GDN]] [sbus_issue_request_done] (0x0040): sssd.dataprovider.getAccountInfo: Error [1432158212]: SSSD is offline
Jun 05 00:40:45 eugene2.servers.bright.gdn sssd[3357485]: (2024-06-05  0:40:45): [nss] [cache_req_common_process_dp_reply] (0x3f7c0): [CID#28217] CR #50033: Could not get account info [1432158212]: SSSD is of
fline

After the sssd restart it restores.

sssd --version
2.9.4

Gentoo Linux Installed versions: 2.9.4 (05:50:04 PM 04/30/2024)(man netlink nls python sudo systemd -acl -doc -nfsv4 -samba -selinux -subid -systemtap -test ABI_MIPS="-n32 -n64 -o32" ABI_S390="-32 -64" ABI_X86="64 -32 -x32" PYTHON_SINGLE_TARGET="python3_11 -python3_10 -python3_12")

aneagoe commented 2 weeks ago

Running into this as well, on sssd version 2.9.5, fedora40. It happens only on some machines, but they're all configured via ansible so sssd config looks identical everywhere, same as versions. Restart of sssd resolves the issue. I even added reconnection_retries = 200 under [sssd] section in the hope it will resolve itself, but it doesn't.

jorhett commented 1 week ago

We're seeing the same issue. In our case, sssd_be is what fails (for reasons unclear) but the parent process isn't aware of the failure and does nothing to restart it.

Would it not make it simpler to make easy sssd daemon a separate systemd process, so that systemd can take care of them and restart individual daemons when they fail?

SSSD / sssd

Provider sporadically considered Offline and never restores #7045