SSSD / sssd

A daemon to manage identity, authentication and authorization for centrally-managed systems.
https://sssd.io
GNU General Public License v3.0
588 stars 238 forks source link

SSSD-GETENT Not returning any group members or a few probably based on cache #4883

Closed sssd-bot closed 4 years ago

sssd-bot commented 4 years ago

Cloned from Pagure issue: https://pagure.io/SSSD/sssd/issue/3898


We have an environment where we have migrated from NIS in which we imported the attributes into ad as follows. All of the user/group mapping work perfectly, but what I can't figure out is why getent group does not work which is very odd. That command is used frequently in the environment to determine access, and etc and we don't want to have to use ldapsearch for this functionality as getent should work. I've looked all through forums but none of the ideas work. :

adcli was used to join the domains Oracle release: Oracle Linux Server release 6.8

user: NIS AD uid -> uidNumber username -> sAMAccountName gid -> gidNumber

Group: groupname-> sAMAccountName gid -> gidNumber username -> Member DN in AD(normal Member)

getent works only if I remark out the below, and it works perfectly and quickly. But, then users can't login and sssd_be goes through every item in AD trying to figure out things and the cpu spikes and finally fails the user login.

I can't figure out why getent group only works properly if I remark the entry below.

ldap_group_member = member I can login successfully. (then getent group works)

id with the user logged in always works properly

domains = xxx.xxx.com services = nss, pam, pac debug_level = 0

[domain/xxx.xxx.com] id_provider = ad auth_provider = ad chpass_provider = ad access_provider = ad

cache_credentials = true

ldap_id_mapping = false

ldap_group_member = member

[nss] filter_groups = root filter_users = root reconnection_retries = 3 entry_cache_timeout = 300 entry_cache_nowait_percentage = 75

Comments


Comment from adssd at 2018-12-07 21:23:21

Meant to say when this is remarked out: ldap_group_member = uniqueMember (logins work but getent group does not work)

ldap_group_member = uniqueMember (logins don't work but getent group does work)


Comment from jhrozek at 2018-12-10 09:23:13

Do you mean getent group w/o any additional parameters to display all the groups or getent group $groupname?

enumerate=true is required to be set in order for 'getent group' to list all the groups, but enumeration is very, very slow.

ldap_group_member=member is already the default for the AD provider.

btw RHEL-6.8 (of which OEL 6.8 is a rebuilt) is very old. Upgrading to 6.10 might be a good idea.


Comment from adssd at 2018-12-10 21:13:46

For example getent group mygroupname only returns the group name and number like: mygroupname*:4367:

What is odd is if I use this parameter in /etc/sssd/sssd.conf (ldap_group_member = member) when I am logged in as root and perform the getent it works perfectly and retrieves the users of the group every time quickly.

However, when that parameter is used I can't login to the server. This is why I don't think enumeration has to be turned on as it can grab the users without this. I've also turned enumeration on and it doesn't retrieve users with the getent.

Also, when using the ad provider what parameters are available to use like the ldap_group_member?

Thanks for responding back I really appreciate feedback on this. I've spent months trying to figure this out. I've come from a Centrify environment where everything just works and you don't have to know all the details in the background.

[domain/xxx.xxx.com] id_provider = ad auth_provider = ad chpass_provider = ad access_provider = ad cache_credentials = True ldap_id_mapping = False enumerate = False debug_level = 9

ldap_group_member = member


Comment from jhrozek at 2018-12-10 21:21:25

Group members shouldn't have much connection to the login process (the get-groups-for-user call uses a different codepath than get-members-of-a-group). So at this point I think it would be best to see logs of the failing login with configuration that sticks to defaults as much as possible.


Comment from adssd at 2018-12-10 21:42:15

I agree you need the logs. Is there a way you can keep the logs secure?


Comment from adssd at 2018-12-11 17:22:23

Hi Jakub, Can I send you snippets that you are looking in logs where it won't have the sensitive data or a limited amount that I can at least change names to protect the data?

On Mon, Dec 10, 2018 at 2:22 PM Jakub Hrozek pagure@pagure.io wrote:

jhrozek added a new comment to an issue you are following: `` Group members shouldn't have much connection to the login process (the get-groups-for-user call uses a different codepath than get-members-of-a-group). So at this point I think it would be best to see logs of the failing login with configuration that sticks to defaults as much as possible.

``

To reply, visit the link below or just reply to this email https://pagure.io/SSSD/sssd/issue/3898


Comment from adssd at 2018-12-11 17:39:35

When using the basic settings the getent group works perfectly if you are logged in as root. If I try to login it eventually comes back and fails after about 30-45 seconds. In the logs I can see it's going through all the ad structure going through names/groups and etc like full enumeration.

[domain/xxx.xxx.com] id_provider = ad auth_provider = ad chpass_provider = ad access_provider = ad

cache_credentials = false

ldap_id_mapping = false enumerate = false debug_level = 9


Comment from jhrozek at 2018-12-17 12:55:51

I guess you can send the logs to my nick at redhat dot com


Comment from adssd at 2018-12-18 04:45:29

I'm going to have a hard time sending log files because there is alot of sensitive data included in them. Is there another parameter I can change or something I can grab out of a log file so it doesn't include sensitive data?


Comment from adssd at 2018-12-21 23:59:28

This is very interesting.... OEL 7.5 is working as expected with the AD provider

OEL 6 and assuming any Red Hat 6 will have these issues with AD provider (losing secondary groups sporadically, getent group not returning group members, etc). I think this should be easily reproducible in the Linux 6 environment with AD as it's consistent in our environment. The good news is OEL 7.5 seems to be very stable.

Was there more development and fixes in 7 and will Linux 6 have any more fixes?

Oracle Linux 7.5 does not have the issues above as getent works properly and configuration works correctly. This looks like a much more mature version. sssd-1.16.2-13.el7.x86_64

Oracle Linux 6.x versions having issues like above sssd-1.13.3-60.0.2.el6.x86_64


Comment from adssd at 2019-01-21 22:43:38

Jakub, After a long time I've finally found a solution to this issue which probably is affecting a lot of others on Linux 6 while troubleshooting another issue. I found these errors frequently on other servers throughout time as well but could never figure out why. Today I was getting it repeated about every minute on a server so I took the opportunity to try to dig further into the issue.

The original question in this whole thread was with getent so I wasn't suspecting this issue was also related but thought it might possibly be. I did notice when using getent group xxx it was traversing all over the domain and I saw it going to other subdomains we don't use in our unix environment.

GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Cannot find KDC for requested realm) Unspecified GSS failure. Minor code may provide more information (Server not found in Kerberos database)

When having the debug level at 10 I could see it erroring out trying to login to multiple subdomains in our trusted environment and failing similar to what I was seeing with the getent group xxxx command.

When I added this in the sssd.conf and restarted sssd the errors immediately went away. subdomains_provider = none The errors went away immediately. I could login without issues. Then I ran getent group xxx and it worked exactly suspected. Wow! This is great...

However, in the docs for sssd.conf for this parameter it states: "This value should be always the same as id_provider."
I'm using the AD provider but it fixes the issue but is this not going to be supported or why did this fix our issue? For OEL 7 I don't have to use this parameter and everything works as expected it's the OEL 6 clients 6.8+

Can you provide me any feedback on this. Thanks!


Comment from sbose at 2019-01-22 08:08:03

Hi,

by default the AD provider tries to discover the whole AD forest and tires to make all users and groups available.

Using subdomains_provider = none effectively disables this so that SSSD will only looked at the configured domain you are joined too. With ldap_id_mapping = false this should mostly work. What you might want to check out is if the member of a group (getent group groupname) and the group memberships of a user (id username) is consistent. The reason I'm asking is that the AD provider by default determines the group membership of a user with a special call which returns all groups the user is a member of from the AD perspective. But if you manage the members of a group with the uniqueMember attribute which is not used by AD, there might be a difference.

There is less radical way to tell SSSD to not look at all domains in a forest. Recent versions of SSSD have the ad_enabled_domains option where you can list the domains you are interested in. This way you can keep subdomain_provider = ad and let it discover the forest with all the additional details like short domain names, domain SIDs etc but not use all of them. I cannot tell from the top of my head if this option is backported to the version you are using, please check if the sssd-ad man page describes this option or not.

HTH

bye, Sumit


Comment from jhrozek at 2019-01-22 09:21:22

It should also be backported to el6


Comment from adssd at 2019-01-23 14:41:26

I added in domains: ad_enabled_domains = xxx.xxx.xxx and cleaned cached and restarted sssd however this didn't work and I couldn't login. It looked like in the logs it was going through all the accounts/users/groups creating a large log and load on the server until it failed.

Maybe our version in Oracle doesn't support that backport. We have hardware requirements for awhile to stay at this version (OEL 6).
This is the version of sssd: sssd-client-1.13.3-60.0.2.el6.x86_64

Sumit also commented above. It appears though subdomains_provider is our only solution. Do we lose failover capabilities then if the domain controller goes down for any reason? We still have lots of domain controllers. What are issues if we use this?


Comment from adssd at 2019-01-23 16:52:17

One more item we found during testing.. When subdomains_provider = none when you issue groups xxxx doesn't return all the groups the user is a member of thus when they login to a server and the groups are checked it causes issues in security thus it's a show stopper. The real workaround is as stated by Sumit that ad_enabled_domains should be used. However, on our version it must not be working or available. Can you please help us with this?


Comment from sbose at 2019-01-28 11:15:53

The matching version of RHEL6 has this option enabled, so I doubt that it is disabled in OEL6.

To debug the further it would be good to see SSSD logs with debug_level=9 at least in the [domain/...] section of sssd.conf.

About the groupmemberships, if the user is a member of groups form other domains in the forest you are right that subdomains_provider = none cannot be used as long as those groups are needed. But please note that you have to add those domains to the list in ad_enabled_domains as well.

bye, Sumit


Comment from adssd at 2019-01-28 22:21:46

Thanks Sumit for getting back with me. When I added subdomains_provider = none , getent group, id commands don't return any secondary groups says it can't find the SID in the domain. As you stated it's breaking discovering SIDS.

With ad_enabled_domains = xxx.xxx.xxxx getent passwd/getent group are working, however I can't login. When looking at the logs I can see it's going through every userid/group in the domain and probably adding them in the cache. Then it finally comes back as failed login. I can't provide a full log file right now as it has alot of sensitive data in it. I can provide you highlights. If you can let me know what you are looking for I can remark out any sensitive userid/domain, etc.

[sdap_process_ghost_members]... ton's of these messages at the end of the log


Comment from sbose at 2019-01-29 09:18:07

Thanks Sumit for getting back with me. When I added subdomains_provider = none , getent group, id commands don't return any secondary groups says it can't find the SID in the domain. As you stated it's breaking discovering SIDS.

In this case it might help to specify the domain SID of the AD domain you are joined to with ldap_idmap_default_domain_sid and if needed the proper domain name with ldap_idmap_default_domain, see man sssd-ldap for details.

With ad_enabled_domains = xxx.xxx.xxxx getent passwd/getent group are working, however I can't login. When looking at the logs I can see it's going through every userid/group in the domain and probably adding them in the cache. Then it finally comes back as failed login. I can't provide a full log file right now as it has alot of sensitive data in it. I can provide you highlights. If you can let me know what you are looking for I can remark out any sensitive userid/domain, etc. [sdap_process_ghost_members]... ton's of these messages at the end of the log

So you might run into a timeout while processing all the group members. For testing you might want to try to speed this up by either ignoring the group members, options ignore_group_members and subdomain_inherit, see man sssd.conf for details. Or you can put SSSD's cache into the memory as described in the 'Mount the cache in tmpfs' section of https://jhrozek.wordpress.com/2015/08/19/performance-tuning-sssd-for-large-ipa-ad-trust-deployments/.

bye, Sumit


Comment from adssd at 2019-01-29 18:40:38

Sumit, thanks for the information. We need secondary groups in our environment as it provides security to folders. So the above info above I've tried but doesn't work and get the same errors.

The main issue is something with the code is not working properly, but getent group is working (see table 1) returning secondary groups properly but I can't login.

-- Table 1 -- [domain/xxx.xxx.xxxx] id_provider = ad ad_enabled_domains = xxx.xxx.xxx (With this remarked out same results) auth_provider = ad chpass_provider = ad access_provider = ad cache_credentials = false ldap_id_mapping = false enumerate = false debug_level = 9

-- Table 2 -- (This should be a default sssd.conf scenario with AD with posix). I can login but getent does not return secondary groups properly, and out of the blue secondary groups start disappearing which is a huge issue as users lose access.

[domain/xxx.xxx.xxxx] id_provider = ad auth_provider = ad chpass_provider = ad access_provider = ad cache_credentials = false ldap_id_mapping = false ldap_group_member = memberUid ** I can only login when I put something here, if I put member instead of memberUid it hangs on login (as it should be the default setting). enumerate = false debug_level = 9

Something with the code with logging in with Table 1 where it tries to go through all the entries in AD. getent group, getent passwd it doesn't do that.


Comment from adssd at 2019-01-29 18:45:13

Also, with OEL 7 the configs above in Table1 work with no issues. OEL 6 has the bug.


Comment from sbose at 2019-01-30 11:01:34

Is the 'is username' command working as well on OEL6 with Table1?

If the domain you are joined to is not the forest root, can you try to add the forest root domain to ad_enabled_domains as well?

bye, Sumit


Comment from adssd at 2019-01-30 15:15:06

Sumit,

id username does return the correct values in table 1 as well. We just can't login with Table1. getent group, getent passwd, id (all work with this but can't login and fails.


Comment from sbose at 2019-01-30 15:55:15

Ok, then it looks like a pure authentication/autorization issues. Getting the groups the user is a member of is part of the authentication, that's why I asked if 'id' is working.

Please have look at /var/log/security for PAM related messages during the login attempt and check if authentication (auth) or access control (acct_mgmt) fails. If it is access control you might try 'access_provider = permit' as a workaround.

To debug this further you should add debug_level=9 to the [pam] and [domain/...] section of sssd.conf, restart SSSD and follow the authentication and authorization requests through sssd_pam..log, sssd_domain.name.log and krb5_child.log. You can look for "command: SSS_PAM_AUTHENTICATE" in the pam and domain log for the authentcation request.

bye, Sumit


Comment from adssd at 2019-01-30 16:35:10

Sumit,

Here's what I've found in the log files when using Table 1. I tried access_provider = permit but I couldn't login and changed it back then logged in again to capture the logs below.

Jan 30 15:12:33 servernameXXX sshd[8852]: pam_sss(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=DOMAINserverXXX user=XXXXX Jan 30 15:12:33 servernameXXX sshd[8852]: pam_sss(sshd:auth): received for user XXXXX: 9 (Authentication service cannot retrieve authentication info) Jan 30 15:12:35 servernameXXX sshd[8852]: Failed password for UserNameXXX from xx.xx.xx.xxx port 49223 ssh2

(Wed Jan 30 15:19:03 2019) [sssd[pam]] [pam_dp_process_reply] (0x0200): received: [9 (Authentication service cannot retrieve authentication info)][xxx.xxx.xxx] (Wed Jan 30 15:19:03 2019) [sssd[pam]] [pam_reply] (0x0200): pam_reply called with result [9]: Authentication service cannot retrieve authentication info. (Wed Jan 30 15:19:03 2019) [sssd[pam]] [filter_responses] (0x0100): [pam_response_filter] not available, not fatal.


Comment from sbose at 2019-01-30 17:25:33

SSSD logs will have more details, but it sounds a bit like SSSD thinks it is offline. Since the user and group lookup is working fine it might be related to a timeout during authentication. Maybe addding `krb5_auth_timeout=15' to the [domain/...] section of sssd.conf help. This would allow Kerberos authentication including ticket validation take up to 15s.

bye, Sumit


Comment from adssd at 2019-01-30 17:47:59

Thanks for helping I really appreciate it. I've been trying to figure this out for a couple of months. What's odd is how it works find in Table2 just not the getent groups and groups disappearing but the authentication works quickly.


Comment from sbose at 2019-02-01 16:15:57

Thanks for helping I really appreciate it. I've been trying to figure this out for a couple of months. What's odd is how it works find in Table2 just not the getent groups and groups disappearing but the authentication works quickly.

Yes, this looks odd. Maybe the timeout was a bit on the edge so that it more often worked with Table 2. Please note that with 'cache_credentials=True' after a successful online authentication SSSD will cache a password hash and use it for upcoming authentication if the system is offline and a timeout during authentication will be considered as offline. The difference is that you do not have a fresh Kerberos ticket after authentication.

Do you agree to close this ticket?

bye, Sumit


Comment from adssd at 2019-02-01 16:40:06

Sumit, we haven't resolved this as getent group, and groups are disappearing. We don't have a configuration that fixes this yet for OEL 6. Table1 is the one that fixes the getent group and I'm susspecting will fix the disappearing groups, but we can't authenticate with this configuration. Please keep helping us with this.


Comment from sbose at 2019-02-01 17:23:46

Sumit, we haven't resolved this as getent group, and groups are disappearing. We don't have a configuration that fixes this yet for OEL 6. Table1 is the one that fixes the getent group and I'm susspecting will fix the disappearing groups, but we can't authenticate with this configuration. Please keep helping us with this.

Of course, I must have misinterpreted your last comment and thought that authentication is working now. But I'm afraid we have to find a way so that I can have a look at the SSSD debug logs. Would it be possible to send them by email to my name here '@redhat.com' ?

bye, Sumit


Comment from adssd at 2019-02-01 19:13:23

Let me check to get the approval.


Comment from adssd at 2019-02-01 23:35:16

Hi Sumit, Have you collabored/worked with Oracle Technical support Development Team with the sssd opensource. I think we could do something with them and your team if you have the working partnership. Let me know.


Comment from adssd at 2019-02-06 18:02:33

Here's the latest of what I found yesterday and today. I turned down the debugging level to turn down all the noise and see if something stuck out. What I found was very interesting and it pinpoints where the issue is coming from. I received this error in krb5.child.log (dealing with groups). When I logged in as myself I was getting this error below

krb5_child.log:(Wed Feb 6 15:23:01 2019) [[sssd[krb5_child[27197]]]] [sss_send_pac] (0x0040): sss_pac_make_request failed [-1][0]. krb5_child.log:(Wed Feb 6 15:23:01 2019) [[sssd[krb5_child[27197]]]] [validate_tgt] (0x0040): sss_send_pac failed, group membership for user with principal [MyFullName\@xxx.xxx.COM@xxx.xxx.COM] might not be correct.

In AD I'm a member of about 30 groups and only a few of them are posix groups. When I have full debugging turned on I can see it massively going through all the users/groups loading it into cache.

I then tested using a user with only 3 posix groups and it logged in successfully. One other thing I noticed was the sssd_be daemon goes to 100% as well. With my userid it stays like that for quite a while until it timesout, with the other use it returns in 15-20 seconds after it logins in. It has something to do in OEL6 with getting all the group members put into the groups and possibly timing out. This behavior is not seen in OEL 7 so some code has fixed the issue.

Sumit, as you suggested with ignore_group_members fixes this issue.
If I add in sssd.conf (ignore_group_members = True) I can login successfully. getent doesn't show the members as this code stops that from occurring, but id works. When I login to a server it's seeing my secondary groups which is important. I need to test deeper what this parameter effects in our environment.

Is there anything in OEL6 that maybe wasn't fixed that OEL7 has to workaround this issue.


Comment from adssd at 2019-02-20 17:40:36

Metadata Update from @adssd:

kevdogg commented 9 months ago

Sucks this issue was closed w/o resolution.