Search request executes slowly, if filter contains many values (more than 100 in 'or' list).

OGSmv commented 1 month ago

Issue Description Filter evaluation (or matching) non optimal. This causes slow execution time on search request with big filter. One search request with 1000 uids in filter works slower than 2 requests by 500 uids in filter.

Package Version and Platform:

Platform: Debian (Buster)
Package and version: 389ds/dirsrv:2.4; 389ds/dirsrv:3.0

Steps to Reproduce

Start 389ds with simple structure
Create test organizational unit
Create many user accounts (for example 100000 accounts) in test ou.
Execute search requests for 1000 accounts with filter by uids.
Split request on two request (by 500 accounts) and execute search 2 times (by 500) + concatenation results. This works faster than one search for 1000 accounts

Expected results One search slower than several. On screenshots we can see the result of reading 1000 entities in several strategies (Time spent in processing).

Screenshots

Additional context Python script to reproduce bug. Command: python3 read_test.py --host <ip> --port <port> --user "cn=Directory Manager" --pass <password> --root <root_dn>

progier389 commented 1 month ago

Hi, Interesting study. I am not surprised: IMHO that test is showing the impact of the candidates list merge in the or operation (the one that ensures that search operation returns an entry only once). As it is the worst case merging 1000 time a single candidate the algorithm cannot be very efficient. I think that complexity is something like ( n * n ) as we store candidates in an array

I do not think that we will try to improve that (anyway, we do not want to encourage developer to use complex filters) At least this clearly shows that whwn writing ldap application, you has better use a reasonable number of sub-filters in OR filter.

OGSmv commented 1 month ago

Hi @progier389,

Do you mean that it is normal case to use many simple requests to obtain 1000+ unique entities, than use one request ?

Firstyear commented 1 month ago

@progier389 See:

https://github.com/389ds/389-ds-base/blob/main/ldap/servers/slapd/back-ldbm/back-ldbm.h#L270 https://github.com/389ds/389-ds-base/blob/main/ldap/servers/slapd/back-ldbm/filterindex.c#L960 https://github.com/389ds/389-ds-base/blob/main/ldap/servers/slapd/back-ldbm/idl_set.c#L249

I think that if we have a ton of highly fragmented small idls rather than a few large ones in the OR condition, we end up with this slow down. The problem is that we have to iterate over each idl to find the next min value, and as we eliminate idls we prune them out the set. So with tons of small idls, all of which have a unique value, we're likely doing n*n as you say. This in mind, it's still better than the "older approach" as I recall (where we continually unioned each idl in progress one at a time).

The way the OR condition here was optimised was to assist SSSD which does stupid queries where it creates very few but large OR conditions instead.

After a bit of thought I think there is a pretty simple optimisation here:

Allocate an array of idl_set->total_size
Iterate over each idl and add each value to the array (without sorting)
QSort array
Make an IDList of size idl_set->total_size
Iterate over the first array, inserting each unique item to the IDList.

That way we still get the faster union behaviour over multiple OR conditions, we minimise re-allocs, and we maintain the IDList's need for sorted results.

389ds / 389-ds-base

Search request executes slowly, if filter contains many values (more than 100 in 'or' list). #6275