389ds / 389-ds-base

The enterprise-class Open Source LDAP server for Linux
https://www.port389.org/
Other
210 stars 90 forks source link

Paged search impacts performance #4551

Closed vashirov closed 1 year ago

vashirov commented 3 years ago

Issue Description When a server is under search load, a paged search from a regular user impacts the performance. Same paged search requested by DM has much smaller impact.

Package Version and Platform:

Steps to Reproduce Steps to reproduce the behavior:

  1. Create a db with 40000 users
  2. Run search load using ldclt -D "uid=test,ou=people,dc=example,dc=com" -w test -e bindeach,esearch -b "ou=people,dc=example,dc=com" -f "uid=00001"
  3. While the search load is running, run paged search in a loop that requests all ids: while : ; do ldapsearch -D "uid=test,ou=people,dc=example,dc=com" -w test -b dc=example,dc=com 'uid=*' -E pr=100/noprompt; done

Actual results

ldclt[1353]: Average rate: 2415.70/thr  (2415.70/sec), total:  24157 -- only ldclt is running 
ldclt[1353]: Average rate: 2342.50/thr  (2342.50/sec), total:  23425
ldclt[1353]: Average rate: 1048.90/thr  (1048.90/sec), total:  10489 -\
ldclt[1353]: Average rate:  413.10/thr  ( 413.10/sec), total:   4131   | paged search from a regular user
ldclt[1353]: Average rate:  461.00/thr  ( 461.00/sec), total:   4610   |
ldclt[1353]: Average rate: 1759.30/thr  (1759.30/sec), total:  17593 -/
ldclt[1353]: Average rate: 2374.70/thr  (2374.70/sec), total:  23747
ldclt[1353]: Average rate: 1952.70/thr  (1952.70/sec), total:  19527 -\
ldclt[1353]: Average rate: 1783.00/thr  (1783.00/sec), total:  17830   | paged search from DM
ldclt[1353]: Average rate: 1749.70/thr  (1749.70/sec), total:  17497 -/
ldclt[1353]: Average rate: 2319.80/thr  (2319.80/sec), total:  23198
ldclt[1353]: Average rate: 2378.80/thr  (2378.80/sec), total:  23788

Expected results There shouldn't be a significant drop in performance.

Additional information @wisebaldone who reported the issue, also mentioned that 1.2.11.15-48 doesn't have the issue and 1.2.11.15-97 does have.

Firstyear commented 3 years ago

Are you able to take flamegraphs from an ldclt vs ldclt with regular user paged searches occuring at the same time? That could help us identify which code paths are under more stress in this situation.

I'm immediately suspicious of access controls since that's the major difference between cn=dm and a regular user.

vashirov commented 3 years ago

Here are the flamegraphs.

Firstyear commented 3 years ago

Thanks @vashirov. Looking at it, there is a bit more work in acl (1.1% vs 0.3%), but the big shift appears to be in slapi_attr_type_comp(2.9 vs 1.2), slapi_sdn_issuffix(4.2 vs 1.4), slapi_utf8_casecmp(2.5 vs 0.7), value_cmp (3.0 vs 1.0) and value_normalize_ext(3.6 vs 0.9). There are a few others like this too. It looks like the traces are not correctly stacked up but I wonder if that's to do with our plugin system causing perf to not see the correct locations.

Saying this two of the mentioned functions are in acl:

ds/ldap/servers/plugins/acl # grep -i -r -n -e 'slapi_attr_type' .
./aclparse.c:2381:        if (slapi_attr_type_cmp(filter_type, t, SLAPI_TYPE_CMP_BASE) == 0) {
./acl.c:1331:        if (slapi_attr_type_cmp(aclpb->aclpb_Evalattr, attr, SLAPI_TYPE_CMP_SUBTYPE) == 0) {
./acl.c:1351:        if (slapi_attr_type_cmp(aclpb->aclpb_Evalattr, attr, SLAPI_TYPE_CMP_SUBTYPE) == 0) {
./acl.c:1775:            if (slapi_attr_type_cmp(mods[j]->mod_type, aci_attr_type,
./acl.c:2619:                slapi_attr_type_cmp(aclpb->aclpb_curr_attrEval->attrEval_name,
./acl.c:2751:                    if (slapi_attr_type_cmp(attr->u.attr_str, res_attr, SLAPI_TYPE_CMP_SUBTYPE) == 0) {
./acl.c:3625:        if (slapi_attr_type_cmp(a_eval->attrEval_name, attr, SLAPI_TYPE_CMP_SUBTYPE) == 0) {
./acl.c:3816:            if (slapi_attr_type_cmp(src->acle_attrEval[i].attrEval_name,
./acl.c:4035:            slapi_attr_type_cmp(c_attrEval->attrEval_name,
./aclanom.c:490:                slapi_attr_type_cmp(attr, destArray[j], 1) == 0) {
ds/ldap/servers/plugins/acl # grep -i -r -n -e 'slapi_sdn_issuff' .
./aclparse.c:134:            if (!slapi_sdn_issuffix(&targdn, aci_item->aci_sdn)) {
./acl.c:2138:    if (!slapi_sdn_issuffix(aclpb->aclpb_curr_entry_sdn, aci->aci_sdn) || (!slapi_is_rootdse(res_ndn) && slapi_is_rootdse(aci_ndn))) {
./aclanom.c:386:        if (!slapi_sdn_issuffix(e_sdn, acl_anom_profile->anom_targetinfo[i].anom_target) || (!slapi_is_rootdse(ndn) && slapi_is_rootdse(aci_ndn)))

And I would assume the remaining are nested in calls from acl.

I think the next step would be some systemtap probes around acl to see if that really is the source of the timing differential? It could be especially true if acl is recomputing the access each iteration of the paged search handler.

progier389 commented 1 year ago

I think I have finally found the root cause:
In fact it is not acl related but because the conn->c_mutex is held while sending page search result And it generate contention with handle_pr_read_ready that delay the listening thread A solution is to use a table of locks hashed by conn address for page search handing instead of the c_mutex.

progier389 commented 1 year ago

4c6b2ecae..4c6b2ecae main -> main
cf4fe8e49..7ce99c28a 389-ds-base-2.3 -> 389-ds-base-2.3 17a6cc86c..c11e86f6b 389-ds-base-2.2 -> 389-ds-base-2.2 1415bc017..2983d762c 389-ds-base-2.1 -> 389-ds-base-2.1 eb342461c..35ef59e49 389-ds-base-2.0 -> 389-ds-base-2.0 578903bab..e8f0b0e18 389-ds-base-1.4.4 -> 389-ds-base-1.4.4 175a031fe..a6f0981de 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

progier389 commented 10 months ago

Beware: There is a known regression for that one. See https://github.com/389ds/389-ds-base/issues/5984

tbordaz commented 9 months ago

28d22a143..79be7bc99 389-ds-base-1.3.11