SSSD / sssd

A daemon to manage identity, authentication and authorization for centrally-managed systems.
https://sssd.io
GNU General Public License v3.0
603 stars 247 forks source link

sssd does not lookup user gid's at reboot without *.ldb files #7612

Open joakim-tjernlund opened 1 month ago

joakim-tjernlund commented 1 month ago

cd /var/lib/sss/db/ rm -f *.ldb reboot

as soon as machine is up, ssh as root or login as local root and do:

> id labuser

uid=10019(labuser) gid=100(users) groups=100(users)

just system gid's returned.

Wait c.a 2 mins and try another user, then id will return AD gid's try first user again:

> id labuser

uid=10019(labuser) gid=100(users) groups=100(users)

still returns just system gid's and will do so for a while(minutes)

This is on current master but the issue has been present for months I think.

alexey-tikhonov commented 1 month ago

Hi.

What is setup configuration?

And what step do you consider a bug? Does it help if you call 2nd id as SSS_NSS_USE_MEMCACHE=NO id...?

joakim-tjernlund commented 1 month ago

I consider id cmd returning just system gid's when sssd is running and network is up. Also that sssd caches this false entry for long time.

./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --datarootdir=/usr/share --disable-dependency-tracking --disable-silent-rules --disable-static --docdir=/usr/share/doc/sssd-9999 --htmldir=/usr/share/doc/sssd-9999/html --with-sysroot=/ --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --runstatedir=/run --sbindir=/usr/sbin --with-pid-path=/run --with-plugin-path=/usr/lib64/sssd --enable-pammoddir=//lib64/security --with-ldb-lib-dir=/usr/lib64/samba/ldb --with-db-path=/var/lib/sss/db --with-gpo-cache-path=/var/lib/sss/gpo_cache --with-pubconf-path=/var/lib/sss/pubconf --with-pipe-path=/var/lib/sss/pipes --with-mcache-path=/var/lib/sss/mc --with-secrets-db-path=/var/lib/sss/secrets --with-log-path=/var/log/sssd --with-kcm --enable-kcm-renewal --with-os=gentoo --disable-rpath --disable-static --disable-valgrind --with-samba --enable-cifs-idmap-plugin --without-selinux --enable-krb5-locator-plugin --disable-pac-responder --with-nfsv4-idmapd-plugin --enable-nls --with-libnl --with-manpages --without-sudo --with-autofs --with-ssh --without-oidc-child --without-passkey --without-subid --disable-systemtap --without-python2-bindings --with-python3-bindings --with-initscript=systemd --with-systemdunitdir=/usr/lib/systemd/system
alexey-tikhonov commented 1 month ago

Sorry, I meant sssd.conf Is this user - labuser - AD user?

joakim-tjernlund commented 1 month ago

Sorry, I meant sssd.conf Is this user - labuser - AD user?

yes, it is an AD user

sssd.conf:

[sssd]
#config_file_version = 2
domains = infinera.com
#domains = infinera.com,transmode.se
services = nss, pam
#debug_level = 0x0fff
debug_level = 0x0000

[nss]
fallback_homedir = /home/%u
default_shell = /bin/bash
#debug_level = 0x0fff
enum_cache_timeout = 3600
entry_negative_timeout = 300
debug_level = 0x0000

[kcm]
tgt_renewal = true

# will inherit all KCM krb5_xxx values
#tgt_renewal_inherit = infinera.com
krb5_renewable_lifetime = 7d
krb5_lifetime = 10h
krb5_renew_interval = 2h
debug_level = 0x0000

[pam]
#Needs patch ?
pam_account_locked_message = "Account Locked"
#debug_level = 0x0fff
pam_response_filter = -ENV:KRB5CCNAME:sudo-i, -ENV:KRB5CCNAME:sudo

[domain/infinera.com]
dns_resolver_use_search_list = false
ad_enabled_domains = infinera.com
#debug_level = 0xffff
debug_level = 0x0000

timeout = 30
ad_maximum_machine_account_password_age = 0

#Do not think we need referals? Is a performance drain
ldap_referrals = false

ignore_group_members = false
ldap_id_mapping = false
cache_credentials = true
enumerate = false
ldap_enumeration_refresh_timeout = 1800
entry_cache_timeout = 3600
refresh_expired_interval = 2700

id_provider = ad
auth_provider = ad
access_provider = permit
chpass_provider = ad

ad_server = x.y.com

dyndns_auth = none
dyndns_auth_ptr = GSS-TSIG
dyndns_update = true
dyndns_refresh_interval = 60
dyndns_update_ptr = true
dyndns_ttl = 3600
case_sensitive = false

ldap_referrals = false
ldap_sasl_mech = GSSAPI
ldap_schema = rfc2307bis

ldap_access_order = expire
ldap_account_expire_policy = ad
ldap_force_upper_case_realm = true

krb5_realm = INFINERA.COM
krb5_canonicalize = true
krb5_store_password_if_offline = true
krb5_use_kdcinfo = False
krb5_renewable_lifetime = 7d
krb5_lifetime = 24h
krb5_renew_interval = 4h
alexey-tikhonov commented 1 month ago

Why do you use ldap_schema = rfc2307bis with id_provider = ad?

Does it help if you call 2nd id as SSS_NSS_USE_MEMCACHE=NO id...?

Did you have a chance to check this? I guess "try first user again... still returns just system gid's and will do so for a while(minutes)" is due to mem-cache.

Wrt first lookup returning correct UID but empty GIDs - one needs to check logs. Those should be deduced from tokenGroups.

joakim-tjernlund commented 1 month ago

Why do you use ldap_schema = rfc2307bis with id_provider = ad?

That is ancient, probably an leftover from LDAP days. Will try plain rfc2307

Does it help if you call 2nd id as SSS_NSS_USE_MEMCACHE=NO id...?

Did you have a chance to check this? I guess "try first user again... still returns just system gid's and will do so for a while(minutes)" is due to mem-cache.

I did systemctl edit sssd.service and added: [Service] Environment=SSS_NSS_USE_MEMCACHE=NO

This did not change anything, should I have done differently?

Wrt first lookup returning correct UID but empty GIDs - one needs to check logs. Those should be deduced from tokenGroups.

joakim-tjernlund commented 1 month ago

Why do you use ldap_schema = rfc2307bis with id_provider = ad?

That is ancient, probably an leftover from LDAP days. Will try plain rfc2307

Does it help if you call 2nd id as SSS_NSS_USE_MEMCACHE=NO id...?

Did you have a chance to check this? I guess "try first user again... still returns just system gid's and will do so for a while(minutes)" is due to mem-cache.

I did systemctl edit sssd.service and added: [Service] Environment=SSS_NSS_USE_MEMCACHE=NO

This did not change anything, should I have done differently?

Oh, I misread: Did instead: SSS_NSS_USE_MEMCACHE=NO id labuser

but that did not change anything either, in fact it got worse. Now it wont resolve gids for any user even if I wait a few mins Then I skipped the SSS_NSS_USE_MEMCACHE=NO part and id would fetch gids again for new users.

alexey-tikhonov commented 1 month ago

Why do you use ldap_schema = rfc2307bis with id_provider = ad?

That is ancient, probably an leftover from LDAP days. Will try plain rfc2307

Why not to leave it as a default -- ldap_schema = ad?

joakim-tjernlund commented 1 month ago

Why do you use ldap_schema = rfc2307bis with id_provider = ad?

That is ancient, probably an leftover from LDAP days. Will try plain rfc2307

Why not to leave it as a default -- ldap_schema = ad?

I can try that too, I don't recall why that was there. it was added many years ago.

joakim-tjernlund commented 1 month ago

Nothing I have done above have helped, sssd simply does NOT speak to AD untial a few minutes(2-3) has passed. Doing id before that has happened will prolong this time with several minutes.

I hope you can reproduce this?

alexey-tikhonov commented 1 month ago

Would it be possible to get SSSD logs (sssdnss.log and sssd$domain.log) with debug_level = 9?

joakim-tjernlund commented 1 month ago

Would it be possible to get SSSD logs (sssdnss.log and sssd$domain.log) with debug_level = 9?

sssd_infinera.com.log sssd_nss.log

alexey-tikhonov commented 1 month ago

Would it be possible to get SSSD logs (sssdnss.log and sssd$domain.log) with debug_level = 9?

sssd_infinera.com.log sssd_nss.log

It looks more or less fine, 'tokenGroups' lookup seems to return a list of SIDs.

The problem is that at this moment domain, that those SIDs belong to, isn't yet(?) known (not discovered by SSSD):

$ grep "Domain not found for SID" sssd_infinera.com.log 
... [sdap_ad_tokengroups_get_posix_members] (0x0080): [RID#28] Domain not found for SID S-1-5-21-1757981266-1085031214-682003330-46276
... [sdap_ad_tokengroups_get_posix_members] (0x0080): [RID#28] Domain not found for SID S-1-5-21-1757981266-1085031214-682003330-89875
... [sdap_ad_tokengroups_get_posix_members] (0x0080): [RID#28] Domain not found for SID S-1-5-21-1757981266-1085031214-682003330-92642
... [sdap_ad_tokengroups_get_posix_members] (0x0080): [RID#28] Domain not found for SID S-1-5-21-1757981266-1085031214-682003330-513
... [sdap_ad_tokengroups_get_posix_members] (0x0080): [RID#28] Domain not found for SID S-1-5-21-1757981266-1085031214-682003330-92633
... [sdap_ad_tokengroups_get_posix_members] (0x0080): [RID#28] Domain not found for SID S-1-5-21-1757981266-1085031214-682003330-56419
... [sdap_ad_tokengroups_get_posix_members] (0x0080): [RID#28] Domain not found for SID S-1-5-21-1757981266-1085031214-682003330-92645
... [sdap_ad_tokengroups_get_posix_members] (0x0080): [RID#28] Domain not found for SID S-1-5-21-1757981266-1085031214-682003330-92638

@sumit-bose, @justin-stephenson, this looks familiar but I can't recall details...

sumit-bose commented 1 month ago

Hi,

I guess you are thinking of https://github.com/SSSD/sssd/issues/7250.

bye, Sumit

alexey-tikhonov commented 1 month ago

Hi,

I guess you are thinking of #7250.

Indeed. But it was fixed quite some time ago and, IIUC, @joakim-tjernlund is using build of latest 'master'?

joakim-tjernlund commented 1 month ago

Hi, I guess you are thinking of #7250.

Indeed. But it was fixed quite some time ago and, IIUC, @joakim-tjernlund is using build of latest 'master'?

Yes, I am on master. I vaguely remember the 7250 issue but no details I am afraid.

joakim-tjernlund commented 1 month ago

for fun I did:

diff --git a/src/providers/ad/ad_subdomains.c b/src/providers/ad/ad_subdomains.c
index d8f3738ce..fe8b823d6 100644
--- a/src/providers/ad/ad_subdomains.c
+++ b/src/providers/ad/ad_subdomains.c
@@ -1582,7 +1582,7 @@ static void ad_get_root_domain_done(struct tevent_req *subreq)
         return;
     }

-    ret = ad_get_root_domain_refresh(state, false);
+    ret = ad_get_root_domain_refresh(state, true);
     if (ret != EOK) {
         DEBUG(SSSDBG_OP_FAILURE, "ad_get_root_domain_refresh() failed.\n");
     }

but that didn't help.

sumit-bose commented 1 month ago

Hi,

according to the logs DNS is not available when SSSD is starting, it this expected?

bye, Sumit

joakim-tjernlund commented 1 month ago

Hi,

according to the logs DNS is not available when SSSD is starting, it this expected?

bye, Sumit

yes, sssd starts before network is UP, network may never come UP if not connected at all. NW is started by NetworkManager which uses DHCP

joakim-tjernlund commented 1 month ago

Restating sssd after NW is UP does not help either.

joakim-tjernlund commented 1 month ago

I am getting more complaints/support requests now as people upgrade there computers. Any progress ?

sumit-bose commented 1 month ago

Hi,

it looks like there is a race condition between getting up the network, reading the domain topology (including domain SIDs) and handling requests which depend on the domain topology. I'm looking for a way to avoid it.

bye, Sumit

joakim-tjernlund commented 1 month ago

Any luck? Eager to test something

joakim-tjernlund commented 2 weeks ago

Ping?

sumit-bose commented 1 week ago

Hi,

thank you for your patience. Please have a look at https://github.com/SSSD/sssd/pull/7673, there are copr build for recent Fedora and RHEL releases at https://copr.fedorainfracloud.org/coprs/g/sssd/pr7673/.

The pull-request is currently in Draft state, because I'm not sure if it will be the final solution because I have to figure out if there are still some race conditions. So it would be nice if you can check as well if you still see failures after reboot. Additionally, the patch does a refresh of the sub-domain data at every switch from offline to online and not only ensures that it is done after restart when getting online.

bye, Sumit

joakim-tjernlund commented 1 week ago

A quick test on my test system works. Now id cmd just hangs a few sec and then I get full groups back

I guess the initial id request does quite some extra work?

joakim-tjernlund commented 1 week ago

I have added the patch to our Gentoo so it will get some more testing the coming week.

another unrelated observation:

sss_cache -E
id user1 - takes about 5 secs
id user2 - well below 1 second

The initial id cmd after sss_cache -E or rm *ldb files ; restart sssd always takes several secs(5 or so) to complete but any other id cmd after that is fast.

sumit-bose commented 1 week ago

I have added the patch to our Gentoo so it will get some more testing the coming week.

another unrelated observation:

sss_cache -E
id user1 - takes about 5 secs
id user2 - well below 1 second

The initial id cmd after sss_cache -E or rm *ldb files ; restart sssd always takes several secs(5 or so) to complete but any other id cmd after that is fast.

Hi,

thanks for testing. The different times are expected since you are using ignore_group_members = false. This means for the first id call SSSD has to read all groups the user is a member of and all the members of those groups as well. If the second user is a member of similar groups than the first all groups for the second user are already in the cache.

bye, Sumit

joakim-tjernlund commented 1 week ago

I have added the patch to our Gentoo so it will get some more testing the coming week. another unrelated observation:

sss_cache -E
id user1 - takes about 5 secs
id user2 - well below 1 second

The initial id cmd after sss_cache -E or rm *ldb files ; restart sssd always takes several secs(5 or so) to complete but any other id cmd after that is fast.

Hi,

thanks for testing. The different times are expected since you are using ignore_group_members = false. This means for the first id call SSSD has to read all groups the user is a member of and all the members of those groups as well. If the second user is a member of similar groups than the first all groups for the second user are already in the cache.

bye, Sumit

Not anymore ! :) Seriously, what use case needs that? samba file servers or something more exotic?

joakim-tjernlund commented 1 week ago

This extra work to read all members of a group, could that not be a background task? id cmd is not asking for that so it seems that work can be batched in background.

alexey-tikhonov commented 1 week ago

id cmd is not asking for that so it seems that work can be batched in background.

id gets a list of GIDs user is a member of (using getgrouplist()) and then needs to resolve every GID to group name. This resolution is done using getgrgid() that returns struct group, including all members. The fact that id doesn't use group::gr_mem data later doesn't matter.

joakim-tjernlund commented 4 days ago

Hi,

thank you for your patience. Please have a look at #7673, there are copr build for recent Fedora and RHEL releases at https://copr.fedorainfracloud.org/coprs/g/sssd/pr7673/.

The pull-request is currently in Draft state, because I'm not sure if it will be the final solution because I have to figure out if there are still some race conditions. So it would be nice if you can check as well if you still see failures after reboot. Additionally, the patch does a refresh of the sub-domain data at every switch from offline to online and not only ensures that it is done after restart when getting online.

bye, Sumit

a handful people or so has tested this now and still looks good, ship it! :)

joakim-tjernlund commented 2 days ago

I have added the patch to our Gentoo so it will get some more testing the coming week. another unrelated observation:

sss_cache -E
id user1 - takes about 5 secs
id user2 - well below 1 second

The initial id cmd after sss_cache -E or rm *ldb files ; restart sssd always takes several secs(5 or so) to complete but any other id cmd after that is fast.

Hi, thanks for testing. The different times are expected since you are using ignore_group_members = false. This means for the first id call SSSD has to read all groups the user is a member of and all the members of those groups as well. If the second user is a member of similar groups than the first all groups for the second user are already in the cache. bye, Sumit

Not anymore ! :) Seriously, what use case needs that? samba file servers or something more exotic?

So ignore_group_members = true failed for www-apache/mod_authz_unixgroup If you need it, there is an PR to make it work: https://github.com/phokz/mod-auth-external/pull/54/commits/687b088c2b703243036cfbf8b3b5692dd7177bc5