Server hangs during group membership update

vudex commented 1 week ago

Issue Description Currently, I have two groups named external-users and internal-users, which identify users requiring access to specific virtual machine environments. Whenever a new environment is established, a group is formed in the 389 Directory Server (even though we are utilizing FreeIPA for clarity). Following the creation of the user group, these two groups mentioned earlier are enlisted as members to grant access to the new environment.

Presently, each operation, such as adding "external-users" and "internal-users" groups to a new environment group, is significantly time-consuming, lasting multiple minutes. This slow operation is coupled with the unresponsiveness of the 389 Directory Server and substantial delays in responses. This can be noted through monitoring systems triggering alerts, especially our VPN checker (which relies on authentication via the sssd package sensitive to LDAP responsiveness) and our DNS server (leveraging the freeipa-dns package) encountering response delays.

Therefore, I have collected stack traces generated during these operations in the hope that someone could assist me by offering advice on actions to alleviate this issue.

stacktraces.tar.gz

Package Version and Platform:

Platform: Centos-8-Stream
Package and version: [root@infra-ipa-service-01 /]# rpm -qa | grep 389-ds 389-ds-base-1.4.3.28-6.module_el8.6.0+1102+fe5d910f.x86_64 389-ds-base-debugsource-1.4.3.39-2.module_el8+909+03085a27.x86_64 389-ds-base-libs-debuginfo-1.4.3.39-2.module_el8+909+03085a27.x86_64 389-ds-base-libs-1.4.3.28-6.module_el8.6.0+1102+fe5d910f.x86_64 389-ds-base-debuginfo-1.4.3.39-2.module_el8+909+03085a27.x86_64

Steps to Reproduce Steps to reproduce the behavior:

Create 100 users, place them to the internal-users group
Create 150 groups
For every such group, start to add internal-users group as a group member
Observ slowdown in operations

Expected results Behavior that is not affecting normal operations that much.

progier389 commented 1 week ago

Hi, there could be lot of reasons:

Missing index
Tuning issue (dn cache or entry cache too small ?)
Not sure it is not the normal behavior (memberof plugin trigger several internal operations whenever a group is changed and these operations get more expensive as the group size or number is increased)

IMHO the first thing to check is the access log internal operations (make sure that internal operation are logged (i.e seeing some conn=internal in the access log)) and check for notes=F and the etime

FYI: The stack trace it not much helpful: server is mostly idle with sometime a thread trying to add the memberof attribute in the db with an internal modify operation. So mostly a negative result: nothing unexpected in the stack traces

vudex commented 1 week ago

Thank you for your prompt response.

Regarding the missing index, could you provide more details? Since the settings are default, it is likely that 389-ds already has essential database indexes in place, which usually do not require user intervention.

Here are the DN cache and entry cache sizes for an 8GB RAM replica: The configuration appears to be set automatically, and it seems to be appropriate.

Regarding the memberOf plugin, while it may have some obstacles, I presume that my current scenario is too small to experience a noticeable performance impact. Could you specify the group size or quantity you are referring to? Are you discussing the number of groups listed in the memberOf attribute of a user entry?

Is there a way to enhance the performance of the memberOf plugin? Perhaps by making it more asynchronous? Despite observing many idle threads in the LDAP server, the source of the performance impact is still unclear to me.

I will attempt to log internal operations and provide you with the access log for further analysis.

tbordaz commented 3 days ago

Hi, Something noticeable is that the pstacks are always showing Thread 27 processing a MOD. Were they capture during a single MOD ? In such case, it last ~1min for updating 100 entries, that looks quite slow. I agree that enable internal op logging (nsslapd-plugin-logging=on, nsslapd-accesslog-level=260) would confirm what memberof was doing and how fast. It is spending lot of time updating indexes, is 'memberof' indexed in substring ?

In the pstacks I did not see others threads than the MOD. When the server was unresponsive, do you mean that during a MOD others SRCH are impacted (slowdown, hanging) ? It matches a known issue with memberof but was only seen with larger DB, not with 100 users.

vudex commented 1 day ago

@tbordaz @progier389 thank you for the hints. I enabled internal logging and this is what I found:

There is no notes=F lines in logs, only some notes=P
There are indeed some long operations (below)

This time I did exactly same operations as a last time, but with new group: 1) Created new user group another-test-plat-l2 2) Added user groups team-internal-l2 and team-external-l2 as members to this group

These are all operations that took over 1 second:

[04/Jul/2024:02:18:34.178577356 +0300] conn=Internal(0) op=0(58)(1) RESULT err=0 tag=48 nentries=0 wtime=0.000026208 optime=23.334019584 etime=23.334042863
[04/Jul/2024:02:18:34.189065387 +0300] conn=65 op=3 RESULT err=0 tag=103 nentries=0 wtime=0.000054426 optime=17.115815016 etime=17.115866908 csn=6685dc750006000d0000
[04/Jul/2024:02:18:34.197934981 +0300] conn=66 (Internal) op=0(9)(1) RESULT err=0 tag=48 nentries=0 wtime=0.000002844 optime=12.040072384 etime=12.040072563 csn=6685dc750009000d0000
[04/Jul/2024:02:18:34.198039567 +0300] conn=62 op=3 RESULT err=0 tag=103 nentries=0 wtime=0.000085583 optime=42.039195112 etime=42.039279372 csn=6685dc4b0001000d0000
[04/Jul/2024:02:18:34.202183974 +0300] conn=67 op=3 RESULT err=0 tag=103 nentries=0 wtime=0.000080047 optime=7.117827099 etime=7.117904894 csn=6685dc75000a000d0000
[04/Jul/2024:02:18:35.825313963 +0300] conn=62 op=5 RESULT err=0 tag=103 nentries=0 wtime=0.001186630 optime=1.625993952 etime=1.627176466 csn=6685dc75000c000d0000

I looked them up further in logs:

[04/Jul/2024:02:18:10.844558474 +0300] conn=Internal(0) op=0(58)(1) MOD dn="dc=pd15,dc=ipa,dc=gtp"
[04/Jul/2024:02:18:17.073252627 +0300] conn=65 op=3 MOD dn="uid=portal_monitoring,cn=users,cn=accounts,dc=pd15,dc=ipa,dc=gtp"
[04/Jul/2024:02:18:22.157861160 +0300] conn=66 (Internal) op=0(9)(1) MOD dn="uid=portal_monitoring,cn=users,cn=accounts,dc=pd15,dc=ipa,dc=gtp"
[04/Jul/2024:02:17:52.158850026 +0300] conn=62 op=3 MOD dn="cn=another-test-plat-l2,cn=groups,cn=accounts,dc=pd15,dc=ipa,dc=gtp"
[04/Jul/2024:02:18:27.084359473 +0300] conn=67 op=3 MOD dn="uid=portal_monitoring,cn=users,cn=accounts,dc=pd15,dc=ipa,dc=gtp"
[04/Jul/2024:02:18:34.199321012 +0300] conn=62 op=5 MOD dn="cn=another-test-plat-l2,cn=groups,cn=accounts,dc=pd15,dc=ipa,dc=gtp"

What indeed surprised me is portal_monitoring account, that is used to monitor various external resources. Actually that user has over 100 HBAC rules.

But that user is not participating in neither team-internal-l2, team-external-l2 or newly added another-test-plat-l2 groups? Maybe I don't understand some of the mechanics, but why is this user even checked or modified?

4th row is the large one and has a lot of modification and search operations. It is hard to sanitize such logs so I removed detailed information from SRCH and MOD operations, to just give a hint on amount of operations occuring. access_conn_62.log

Overall it is seems like it is intended behavior.

Also, maybe I was misunderstood, but we don't have 100 users in catalog, only in groups team-internal-l2/team-external-l2. I thought only those users will be affected by modifications. Overall directory has ~4000 entries, ~2800 entries, 750 HBAC rules and 750 sudo rules.

tbordaz commented 5 hours ago

Hi @vudex ,

Few comments.

There are SRCH (tag=48) with high etime, you may look at the filter as it possibly exist an unindexed component (I doubt) or the filter is expensive to evaluate. What type of entries matches the filter ? (groups ?)

Internal operations are showing a deep nested level (up to 6). It is quite surprising. For example if a direct update on a group triggers the update of a member (memberof plugin) then nest = 1. If the update of the member triggers an update of a group (automember) then nest = 2. If then this update of the group triggers the update of a member then nest=3.... This is a known behavior but up to 6 seems quite high

There are many direct operation (like MOD portal_monitoring), you may focus on one of the operation that is long and try to diagnose the reason of its etime. Possibly fixing this specific operation will help for the others.

389ds / 389-ds-base

Server hangs during group membership update #6237