Open vudex opened 10 months ago
Hi,
unfortunately the most important part of the logs, the backend log, is missing. Would it be possible to attach the full /var/log/sssd/sssd_custom.in-realm.domain.log
file?
bye, Sumit
Hi,
unfortunately the most important part of the logs, the backend log, is missing. Would it be possible to attach the full
/var/log/sssd/sssd_custom.in-realm.domain.log
file?bye, Sumit
Well, thank you for quick reply. The point is, I used tail -f /var/log/sssd/*
so if there is no sssd_be.log then it wasn't writhing anything during my ssh conenction. Maybe it uses buffer though, and writing to the log is delayed.
Yesterday after I gathered logs I restarted sssd and was able to successfully log in to the host. But today problem reapeared.
pam_sss(sshd:auth): received for user ***: 4 (System error)
I'm attaching sssd_be.log. It's quiet big for just one second. I tried to removed sensible info. Maybe there is a better chance if try to explain me another approaches to debug, or what exactly to look for in those logs. I can't see anything suspicious in the logs.
Hi,
what I'm looking for is the reason why the backend switched into the offline state which typically happens before you see issues like a failed login or similar. In sssd_be2.log
only some group lookups by GID are recorded and during the time the backend was online and was able to read data from the server. There is nothing related to the failed authentication attempt in the logs.
So maybe if can help if you grep the logs for 'Going offline' and check or send the messages before this to see what might be the reason for going offline.
bye, Sumit
Well, thank you for quick responses, I understand something now. What I see is some ldap connection issues (maybe dirsrv hanged during some heavy write operation?) hard to know now.
But why is it not recovering ever? Or maybe it is recovering, but users are unable to login until service restart.
I grepd those logs with
grep -ri -B 40 offline sssd_custom.in-realm.domain.log.1
Hi,
thanks for the new log lines. There are LDAP child was terminated due to timeout
messages. Can you check ldap_child.log
, especially with PID 1464433 and 1470320.
bye, Sumit
Hi @vudex,
Did you have a chance to take a look at it?
Kindly
@sumit-bose @andreboscatto Hi! We are still sporadically experiencing issues, but I couldn't investigate further on hosts from which I attached logs earlier. I still can not replicated the problem either - if I do a simple test of disabling backend - after enabling it sssd turns back online immidiatly as it should be.
Yesterday I encounterd same problem in new environment and I would like to discuss it here if you may.
So /etc/sssd/sssd.conf
looks something like this:
[domain/pd40.other-domain.mtp]
id_provider = ipa
dns_discovery_domain = pd40.other-domain.mtp
ipa_server = ipa.pd40.gtp
ipa_domain = pd40.other-domain.mtp
ipa_hostname = host-01.pd40.other-domain.mtp
krb5_realm = PD40.SOL.MTP
auth_provider = ipa
chpass_provider = ipa
access_provider = ipa
cache_credentials = True
ldap_tls_cacert = /etc/ipa/ca.crt
krb5_store_password_if_offline = True
[sssd]
services = nss, pam, ssh, sudo
domains = pd40.other-domain.mtp
[nss]
homedir_substring = /home
[pam]
[sudo]
[autofs]
[ssh]
[pac]
[ifp]
[session_recording]
SSSD decided to resolve backend and failed. Well, it happens, should return online shortly after I guess:
* (2024-02-01 10:48:49): [be[pd40.other-domain.mtp]] [get_server_status] (0x0100): Hostname resolution expired, resetting the server status of 'ipa.pd40.gtp'
(2024-02-01 10:49:04): [be[pd40.other-domain.mtp]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'ipa.pd40.gtp': Timeout while contacting DNS servers
(2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [fo_resolve_service_send] (0x0020): No available servers for service 'IPA'
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
* (2024-02-01 10:49:04): [be[pd40.other-domain.mtp]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'ipa.pd40.gtp': Timeout while contacting DNS servers
* (2024-02-01 10:49:05): [be[pd40.other-domain.mtp]] [set_server_common_status] (0x0100): Marking server 'ipa.pd40.gtp' as 'not working'
* (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [be_resolve_server_process] (0x0080): Couldn't resolve server (ipa.pd40.gtp), resolver returned [5]: Input/output error
* (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [be_resolve_server_process] (0x1000): Trying with the next one!
* (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'IPA'
* (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [get_server_status] (0x1000): Status of server 'ipa.pd40.gtp' is 'not working'
* (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [get_server_status] (0x1000): Status of server 'ipa.pd40.gtp' is 'not working'
* (2024-02-01 10:49:06): [be[pd40.other-domain.mtp]] [fo_resolve_service_send] (0x0020): No available servers for service 'IPA'
After that I think SSSD tries to restart and re-read configuration:
(2024-02-01 11:02:18): [be[pd40.other-domain.mtp]] [server_setup] (0x1f7c0): Starting with debug level = 0x0070
(2024-02-01 11:02:20): [be[pd40.other-domain.mtp]] [ipa_get_config_done] (0x0040): [RID#1] Unexpected number of results, expected 1, got 0.
We see here that operation (ipa_get_config_done) is failed, cause it is returned 0 instead of 1. So I guess sssd could not read config here? During this operation several LDAP searches are executed, for example (from followed backtrace of ipa_get_config_done
operation):
* (2024-02-01 11:02:19): [be[pd40.other-domain.mtp]] [sdap_set_search_base] (0x0100): [RID#1] Setting option [ldap_ipnetwork_search_base] to [dc=pd40,dc=sol,dc=mtp].
And this base is correct, cause if you see in sssd.conf I specified before my realm is PD40.SOL.MTP, so my LDAP root is dc=pd40,dc=sol,dc=mtp
But in that backtrace I can find some of the searches that executed with wrong search base:
* (2024-02-01 11:02:19): [be[pd40.other-domain.mtp]] [sdap_get_generic_ext_step] (0x0400): [RID#1] calling ldap_search_ext with [(|(&(objectClass=ipaCertMapRule)(ipaEnabledFlag=TRUE))(objectClass=ipaCertMapConfigObject))][cn=certmap,dc=pd40,dc=other-domain,dc=mtp].
* (2024-02-01 11:02:19): [be[pd40.other-domain.mtp]] [sdap_get_generic_ext_step] (0x0400): [RID#1] calling ldap_search_ext with [no filter][cn=default,cn=views,cn=accounts,dc=pd40,dc=other-domain,dc=mtp].
* (2024-02-01 11:02:19): [be[pd40.other-domain.mtp]] [sdap_get_generic_ext_step] (0x1000): [RID#1] Requesting attrs: [ipaDomainResolutionOrder]
* (2024-02-01 11:02:20): [be[pd40.other-domain.mtp]] [sdap_get_generic_ext_step] (0x0400): [RID#1] calling ldap_search_ext with [(&(cn=ipaConfig)(objectClass=ipaGuiConfig))][cn=etc,dc=pd40,dc=other-domain,dc=mtp].
* (2024-02-01 11:02:20): [be[pd40.other-domain.mtp]] [sdap_get_generic_ext_step] (0x1000): [RID#1] Requesting attrs: [ipaDomainResolutionOrder]
Here we can see that search base for some reason is dc=pd40,dc=other-domain,dc=mtp (same as domain of host)
And after that we see the final message:
(2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_domain_resolution_order_done] (0x0040): [RID#1] Failed to get the domains' resolution order configuration from the server [22]: Invalid argument
(2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_subdomains_handler_done] (0x0020): [RID#1] Unable to refresh subdomains [22]: Invalid argument
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
* (2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_domain_resolution_order_done] (0x0040): [RID#1] Failed to get the domains' resolution order configuration from the server [22]: Invalid argument
* (2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_domain_refresh_resolution_order_done] (0x0080): [RID#1] Unable to get the domains order resolution [22]: Invalid argument
* (2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [sdap_id_op_done] (0x4000): [RID#1] releasing operation connection
* (2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_domain_refresh_resolution_order_done] (0x0400): [RID#1] Unable to refresh subdomains [22]: Invalid argument
* (2024-02-01 11:02:20): [be[dc=pd40,dc=other-domain,dc=mtp]] [ipa_subdomains_handler_done] (0x0020): [RID#1] Unable to refresh subdomains [22]: Invalid argument
Could this be somehow related to this problem? Or maybe it is not so important message?
This issue looks familiar for my case and major versions of sssd match. Though I would not say that hosts are experiencing any load or memory issues. https://github.com/SSSD/sssd/issues/6803
Here shouldn't we convert krb5_realm to basedn instead of domain section for IPA case?
Hi,
what is the reason for using ipa_domain = pd40.other-domain.mtp
? If you install FreeIPA the baseDN of the LDAP server is generated from the domain name you provide and not from the realm name.
bye, Sumit
Hi, @sumit-bose The reason is ipa-client-install command on the host were invoked with host’s domain. Maybe it is incorrect to do so, but in the end it doesn’t affect anything besides providing few log messages. I just thought there maybe some corner case when watchdog kills the process. I hang the sssd_be process (using gdb) and achieved termination by watchdog, but process just restarts normally.
All I can say that the host is experiencing load during the problem, either it is excessive cpu load, ram exhaust, or io latency. But I couldn’t achieve anything with loading test host with stress-ng.
Very sad I cannot reproduce the problem.
I have the similar issue
Jun 05 00:40:45 eugene2.servers.bright.gdn sssd[3357484]: (2024-06-05 0:40:45): [be[BRIGHT.GDN]] [sbus_issue_request_done] (0x0040): sssd.dataprovider.getAccountInfo: Error [1432158212]: SSSD is offline
Jun 05 00:40:45 eugene2.servers.bright.gdn sssd[3357485]: (2024-06-05 0:40:45): [nss] [cache_req_common_process_dp_reply] (0x3f7c0): [CID#28217] CR #50033: Could not get account info [1432158212]: SSSD is of
fline
After the sssd
restart it restores.
sssd --version
2.9.4
Gentoo Linux Installed versions: 2.9.4 (05:50:04 PM 04/30/2024)(man netlink nls python sudo systemd -acl -doc -nfsv4 -samba -selinux -subid -systemtap -test ABI_MIPS="-n32 -n64 -o32" ABI_S390="-32 -64" ABI_X86="64 -32 -x32" PYTHON_SINGLE_TARGET="python3_11 -python3_10 -python3_12")
Running into this as well, on sssd version 2.9.5
, fedora40. It happens only on some machines, but they're all configured via ansible so sssd config looks identical everywhere, same as versions. Restart of sssd resolves the issue. I even added reconnection_retries = 200
under [sssd]
section in the hope it will resolve itself, but it doesn't.
We're seeing the same issue. In our case, sssd_be
is what fails (for reasons unclear) but the parent process isn't aware of the failure and does nothing to restart it.
Would it not make it simpler to make easy sssd daemon a separate systemd process, so that systemd can take care of them and restart individual daemons when they fail?
Hello,
I am encountering a persistent issue with sssd intermittently identifying the ipa backend as offline and failing to return online. Initially, I temporarily resolved this by restarting the service, but the problem persists without a permanent solution. I am reluctant to restart the service each time a user encounters this issue.
When sssd indicates that backend is Offline in the logs, I can successfully execute 'id' and 'kinit' commands for the affected user. The 'id' command retrieves the actual groups stored in FreeIPA, confirming that FreeIPA is operational and healthy. However, sssd seems to disagree and indicates otherwise.
I've provided a link to a comprehensive log file containing all entries from /var/log/sssd/ during the SSH login attempt for the 'test-user-ssh':
sssd_all.log
My system configurations are as follows:
Here is a snippet of my sssd.conf file, in its default state post ipa-client-install:
Any insights or assistance in resolving this recurring sssd issue would be greatly appreciated.