Adldap2 / Adldap2

A PHP LDAP Package for humans.
MIT License
710 stars 160 forks source link

Unable to retrieve paginated users #676

Open Albvadi opened 5 years ago

Albvadi commented 5 years ago

Description:

Trying to retrieve paginated users, results in a empty recordset.

Steps To Reproduce:

If I test the query with limit, the users recordset it´s returned correctly:

$paginator = $ad->search()->users()->select(['uid', 'mail'])->limit(5)->get();

image

But, if I use paginated, I have two different errors, depending of the system:

$paginator = $ad->search()->users()->select(['uid', 'mail'])->paginate(5);

Sun One Directory Server: Empty paginator image

Active Directory: Memory allocated exhausted image

Any idea what can I test??

Thanks!

stevebauman commented 5 years ago

Hi @Albvadi,

Strange, I haven't encountered this issue with ActiveDirectory - only with paginating extremely large sets (exceeding 1000 records).

I'm not sure why a ForeignSecurityPrincipal model is being created when you've scoped your query using the search()->users() method. This scope should limit the results to only include User models.

Can you post your configuration array that you're using for your ActiveDirectory server connection? I don't have a Sun Directory server to test with - so lets debug the AD connection first before getting into Sun Directory.

stevebauman commented 5 years ago

Also, if you could post your full stack trace of the exception - that would be great too (with any sensitive details omitted).

Albvadi commented 5 years ago

Ok, I think my problem is that my Active Directory has more than 20k users ... and I think that it is dying trying to find out the total number of users that there are to get the number of paginator pages ... :disappointed:

Any idea to get all the users in blocks without having to obtain the total number of users and pages before? :confused:

Albvadi commented 5 years ago

In my "N" try... I obtained the paginator correctly... image image

There are a lot of users... :see_no_evil:

I have increased the memory in the php.ini from 128m to 512m and it seems that it no longer fails ... If you don´t have any other idea, you can close this issue without a problem...

jspringe commented 5 years ago

Pagination should not hydrate all of the records. Is there a way to deterministically paginate?

If not than caching might be a better option, but there’s some complexity that comes with that.

Edit: I sense a use for factories for seeding/testing.

stevebauman commented 5 years ago

No I haven’t implemented a way to do that yet besides scoping your query with more filters unfortunately :(

There isn’t really a way to prevent exhaustion of memory with large query results unless you specify raw returns (using raw() on the query builder) or by caching like you mentioned.

Any thoughts or suggestions?

jspringe commented 5 years ago

Does LDAP have a limit clause similar to MySQL?

If not than I believe caching might be the only answer and that could be a real pain. You need to be able to reference the cache and appropriately expire it per user configuration, but I think 300 seconds (5 minutes) is a sensible default.

Idea being that on the first call you create a cache and on each subsequent reference to the pagination you refer to cache to prevent the extra calls to LDAP for every user again. You also have more control over the cache so you could implement a limit clause by partitioning the cache. I also work with SQL Server so I'd probably implement it similar to row_number over

Edit: If I were to implement a cache I would start with a file system cache driver but use FlySystem as an abstraction.

stevebauman commented 5 years ago

Does LDAP have a limit clause similar to MySQL?

It does. There is a $sizelimit parameter in ldap_search(), ldap_list() & ldap_read() mentioned here:

This is utilized here in the query builder:

https://github.com/Adldap2/Adldap2/blob/63c024ba4f7eef09cd56c57be606b8352986f806/src/Query/Builder.php#L347

Idea being that on the first call you create a cache and on each subsequent reference to the pagination you refer to cache to prevent the extra calls to LDAP for every user again.

Yeah this might be the only option.

@Albvadi There is the option of going through the each letter in the alphabet and querying for users that start with that letter. This type of chunking would limit memory usage as you loop through each result set and then overwrite your variables on each chunk.

For example:

$letters = range('A', 'Z');

foreach($letters as $letter) {
    $users = $ad->search()->users()->select(['uid', 'mail'])->whereStartsWith('cn', $letter)->get();

    foreach ($users as $user) {
        // Do something with each user.
    }
}

Though this will of course only return users that start with a letter - so users (such as service accounts you may have) that start with numbers will not be returned.

jspringe commented 5 years ago

Would this function help: ldap_control_paged_result? Haven't had a chance to really dig into it, but it might be something to consider.

Looks like its only supported in version 3, but that could be a simple check and exception or fall back to cache.

Sorry, you are using that.. I should've looked at the code first.. long day that won't quit.

EDIT: What a PITA that cookie is. I can't think of anyway of utilizing that function properly without setting a session and storing the cookie.

jspringe commented 5 years ago

Okay, so I'm gonna "think out loud" here. Instead of caching all of the results we cache the cookie and previous results.

Here is a directory structure I was thinking about:

/paginations
    /$connection_name
        /$actual_query_coerced_into_something_file_systems_wont_complain_about
            $page_number.php

The file will contain:

return [
    'expiry' => $defined_expiration_datetime, // this is also controlled by the server as cookies do expire
    'previous_results_file' => $previous_results_file_or_null,
    'current_results' => $current_results,
    'cookie' => $cookie,
];

Page 1 would just return $current_results and should have $previous_results_file = null;. When page 2 is requested it will use the $current_results of page 1 (file_get_contents($previous_results_file_or_null);) as the $results on ldap_control_paged_result_response and the $cookie.

If pages get skipped (jump from page 1 to 5) then we drop back to the last known page and build it forward.

jspringe commented 5 years ago

Reading through the RFC. This would have to be done a little different. The paging has to be done sequentially, which was already the plan, but I think I might use 2 files instead.

last_cookie.php:

return [
    'expiry' => $expiration_date,
    'cookie' => $cookie,
    'last_results_file' => $last_results_file,
    'page' => $page_number,
];

page#.php

return [
    'expiry' => $expiration_date,
    'results' => $results,
];

First try to find the page in cache if it doesn't exist then look at the last cookie and build the results forward to that page.

Some thought needs to go into how to deal with expiration. If you request page 5 of 100 pages of results and page 5 expired, but the last_cookie hasn't and is ahead.. you have to rebuild from the start because cookies can't be reused or rather the reuse of cookies is "undefined" - so maybe supported, but not defined by the RFC.

stevebauman commented 5 years ago

EDIT: What a PITA that cookie is. I can't think of anyway of utilizing that function properly without setting a session and storing the cookie.

Yeah definitely, that's why I've been particularly stumped on how I could cache the paginated results.

I have found a similar package that has implemented caching functionality on LDAP queries here:

https://github.com/ldaptools/ldaptools/commit/283321299c4a45c95ee443883dca991e569e8cfb

He then uses a CacheInterface for each driver and typically stores the cached results in PHP's system temp folder:

https://github.com/ldaptools/ldaptools/blob/master/src/LdapTools/Cache/DoctrineCache.php

Though that implementation is overly complex in my opinion - I'd rather something much simpler.

jspringe commented 5 years ago

I'd probably consider building off of a known Cache library - I'd want to look around first. Also, I think there should be an abstraction on how it's stored. We may only create a FileSystemCache implementation, but it should be open to allow for things like Redis/DB/Memcached/etc.

We could just roll our own Cache to keep it simple, but still worth looking at the work that has already been done.

PHP's system temp could be a good fallback directory, but it should be configurable per provider. Also.. I have it in my head that it gets cleared per lifecycle, but maybe that's just temp files.

stevebauman commented 5 years ago

I'd probably consider building off of a known Cache library - I'd want to look around first. Also, I think there should be an abstraction on how it's stored. We may only create a FileSystemCache implementation, but it should be open to allow for things like Redis/DB/Memcached/etc.

Yeah absolutely, a built in cache driver would be great using the PHP's temp directory - then you could utilize your own driver with a standard cache interface.

What about the PSR-6 Cache interface?

https://github.com/php-fig/cache

jspringe commented 5 years ago

Works for me. I'm just gonna place these 2 links here for reference later:

https://docs.microsoft.com/en-us/windows-server/identity/ad-ds/manage/how-ldap-server-cookies-are-handled

https://tools.ietf.org/html/rfc2696

It's possible my idea is moot and won't work, but it's worth a try.

Albvadi commented 5 years ago

@Albvadi There is the option of going through the each letter in the alphabet and querying for users that start with that letter. This type of chunking would limit memory usage as you loop through each result set and then overwrite your variables on each chunk.

For example:

$letters = range('A', 'Z');

foreach($letters as $letter) {
    $users = $ad->search()->users()->select(['uid', 'mail'])->whereStartsWith('cn', $letter)->get();

    foreach ($users as $user) {
        // Do something with each user.
    }
}

Though this will of course only return users that start with a letter - so users (such as service accounts you may have) that start with numbers will not be returned.

I have already solved in another way the task I had to do, but I had not thought of this way of doing it. I take note for another occasion.. Thanks!!

As for your debate on how to solve the root problem, I have lost a bit. I do not understand how having a cache would solve the problem. Because to have that cache, first you have to bring the users and save them in it ... and it is precisely that complicated point ... no?

stevebauman commented 5 years ago

Ok sounds good! Have any thoughts of API? I think it would be nice to add a cache() method to the Builder class.

The cache() method will accept a DateTimeInterface param that will set the expiry time of the cache. Setting the param to explicitly null will set the cache to live forever.

If the results are inside the cache, it will be retrieved. If the retrieved cached results are expired, then the query is executed to the LDAP server and those results are then cached and returned and a new expiry time is set.

Calling the cache() method will also validate that a cache driver is present and immediately throw an exception when one isn't available.

Here's what I was thinking:

$provider = new Provider();

$until = new \DateTime('+ 1 day');

$users = $provider->search()->users()->cache($until)->get();

We could also add a secondary param to forcibly flush the cache that matches the current query:

$provider = new Provider();

$until = new \DateTime('+1 day');

$flush = true;

$users = $provider->search()->users()->cache($until, $flush = true)->get();

We might also be able to offer a getCache() method on the Builder so dev's can perform caches themselves with the cache object. The Cache::has() method could accept a Builder instance that it would retrieve its unique cache key (maybe from Builder::getCacheKey()?) by md5ing the search params, options, and filters:

$provider = new Provider();

$query = $provider->search()->newQuery();

$cache = $query->getCache();

$query->search()->users()->whereStartsWith('cn', 'John');

if ($cache->has($query) && ! $cache->expired($query)) {
    return $cache->get($query);
} else {
    // Maybe a 'remember' method that can execute the given query and cache the results?
    return $cache->remember($query, $until = new \DateTime('+1 day'));
}

Anything I'm missing? I think this type of caching interaction would be great and simplistic.

EDIT: We would likely have to build something specific internally for paginated queries - but I think if we got the basics first, we could build off of it and work pagination in.

stevebauman commented 5 years ago

@Albvadi

As for your debate on how to solve the root problem, I have lost a bit. I do not understand how having a cache would solve the problem. Because to have that cache, first you have to bring the users and save them in it ... and it is precisely that complicated point ... no?

Having a cache means we could store each paginated result into a different system (such as the file system) instead of storing the results in memory. This would avoid running into PHP's memory limit issues as you're experiencing.

Right now, the paginate() method will continuously execute queries and store all of the results until your LDAP server responds and says there are no more records to return. Since we store all of the paged results of a paginated query, this will cause a memory limit exception - but if we cache each paged result as @jspringe is saying, we can completely bypass the memory limit.

Albvadi commented 5 years ago

Right now, the paginate() method will continuously execute queries and store all of the results until your LDAP server responds and says there are no more records to return.

This is the crucial information to understand how a cache system can solve the problem. I thought that obtaining users was in a single query instead of many queries in a row... :man_facepalming:

Thanks so much for the explanation @stevebauman!

jspringe commented 5 years ago

This is the crucial information to understand how a cache system can solve the problem. I thought that obtaining users was in a single query instead of many queries in a row...

Yeah, unfortunately the "paging" capabilities aren't great for this use-case. Fine for a something like a Windows Application where you are storing the cookie in memory and just clicking from page to page.

@stevebauman did his best to implement pagination with the tools that are available, but that unfortunately meant still pulling all records which is what sort of keyed me into this issue you opened.

To make maters worse LDAP itself doesn't support any sort of incrementing key so you can't even work around it that way.

Ok sounds good! Have any thoughts of API? I think it would be nice to add a cache() method to the Builder class.

I haven't gone that deep into it yet. One thing I would like to do first is create some sort of factory to create a ton of responses for testing. Something akin to Laravel's Model Factories, but maybe at the LDAP query response level because that's what I'd prefer to cache.

stevebauman commented 5 years ago

I haven't gone that deep into it yet. One thing I would like to do first is create some sort of factory to create a ton of responses for testing.

Good idea! Maybe we can leverage Faker for this when running tests and have our own LdapResponseFactory?

but maybe at the LDAP query response level because that's what I'd prefer to cache.

Yea you're right, caching the response itself would definitely be ideal here.

jspringe commented 5 years ago

Good idea! Maybe we can leverage Faker for this when running tests and have our own LdapResponseFactory?

Absolutely, and also would like to put together a suite of testing tools to be used when testing your own application. Like Guzzle does, but I can open a new issue for that when I'm ready to put a couple together.

stevebauman commented 5 years ago

Okay, I took my first stab at query caching, take a look: https://github.com/Adldap2/Adldap2/commit/096e4ce96f7f760324c26ee8ea601c5d9dc04281

Thoughts?

stevebauman commented 5 years ago

Usage (Using the Laravel for brevity and for its Cache repository):

Caching a regular query request:

$cache = app(\Illuminate\Cache\Repository::class);

$provider = Adldap::getDefaultProvider();

$provider->setCache($cache);

$users = $provider->users()->cache(new \DateTime('+1 day'))->get();

Caching a pagination request:

$cache = app(\Illuminate\Cache\Repository::class);

$provider = Adldap::getDefaultProvider();

$provider->setCache($cache);

$users = $provider->users()->cache(new \DateTime('+1 day'))->paginate();

I wasn't able to figure out how I could easily cache each paged result as its own response and retrieve each page from the cache upon request. Right now during a paginate request, all pages are cached in a single item.

EDIT: I also ended up using PSR-16 instead of PSR-6. PSR-6 is overly complex for the caching we need to perform here (in my opinion), and Laravel is compatible with it out of the box, along with Symfony and other frameworks.

Caching sped up large queries immensely (from 1000ms down to 15ms in some cases) - I'll post some benchmarks shortly.

Albvadi commented 5 years ago

Wow!! Super fast first implementation!! I'm not sure if tomorrow, but I'll try test It as soon as possible with muy production AD and the 30k users...

Thanks so much for the superb work!!

stevebauman commented 5 years ago

Thanks @Albvadi, much appreciated! 😄

I'm curious how cached queries speed up your environment - my directory has just over 1000 users, so I'm not able to test massive result sets like that!

Albvadi commented 5 years ago

This morning I was able to take a few minutes to do some quick tests, but there is something I don´t understand (or I thought it worked differently) ...

Adldap2 logging

[2019-05-31 11:43:16] local.INFO: LDAP (ldap://company.com:389) - Connection: default - Operation: Binding - Username: administrator@company.com 
[2019-05-31 11:43:16] local.INFO: LDAP (ldap://company.com:389) - Connection: default - Operation: Bound - Username: administrator@company.com  

Adldap2 logging

[2019-05-31 11:51:43] local.INFO: LDAP (ldap://company.com:389) - Connection: default - Operation: Binding - Username: administrator@company.com 
[2019-05-31 11:51:43] local.INFO: LDAP (ldap://company.com:389) - Connection: default - Operation: Bound - Username: administrator@company.com  
[2019-05-31 11:51:43] local.INFO: LDAP (ldap://company.com:389) - Connection: default - Operation: Paginate - Base DN: DC=company,DC=com - Filter: (&(objectclass=user)(objectcategory=person)(!(objectclass=contact))) - Selected: (samaccountname,email,cn,objectcategory,objectclass) - Time Elapsed: 4907.28

Now, the part that I don´t understand. Starting from the third test, if I load the page, it takes 10 minutes for example to retrieve all the users... Next time I refresh the page with the same query, should it be immediate? If I jump from page1 to page2, should it be immedaite too (what it takes to load 50 users, which is nothing) since it´s loading data that it already has in the cache with the same query?

If the answer is affirmative, then itsn´t working, because every time that it refreshed the same page, it returned to make the complete query to the active directory again, taking those 10 minutes of the first time...

P.S: Sorry my poor english... It´s a mix between my own english and the Google´s tranlator english... :sweat_smile:

stevebauman commented 5 years ago

Hi @Albvadi, which cache driver are you using in Laravel (file, array, database etc.)?

Also, the first LDAP query you execute will cache the returned results - the next time the same query is executed it will use the cached results, so you will only notice a speed increase on the second time it is executed, not the first.

The LDAP query also needs to be identical on the second request, otherwise a new cache will be created.

This cache update won't fix memory issues or speed up queries that return tens of thousands of users - there is no way around this unless you limit the scope of your query.

I think there may be a misunderstanding of how pagination is working in Adldap2. When you call paginate() on a query, it will retrieve all results from your LDAP query - for example, calling this will return all 30,000 users from your LDAP server in chunks of 50:

// Returns all users from your directory in chunks of 50:
$users = Adldap::search()->users()->paginate(50);

This operates completely differently than pagination in Laravel with a SQL database. Calling paginate(50) with a Laravel model will only retrieve 50 results from your SQL database. For example:

// Returns only 50 users from your database:
$users = \App\User::paginate(50);

// Returns **all** users from your LDAP directory in **chunks** of 50:
$ldapUsers = Adldap::search()->users()->paginate(50);

I'm still trying to figure out a way for Adldap2 pagination to operate similarly to Laravel pagination. Updates will be coming!

jspringe commented 5 years ago

@stevebauman - Awesome work! Unfortunately, I've been knee deep in a work project that has me doing JS/Vue/Popper/etc. so I haven't had time to circle back to this and get my brain back to PHP. Hopefully, I'll have time this weekend to grab your implementation and see how we can use it for pagination.

Albvadi commented 5 years ago

I'm using file cache driver.

so, starting from the first example, in the first load it makes a query against the active directory and in the second load it does not make that connection and it takes it directly from the cache?

ApoNie commented 5 years ago

$letters = range('A', 'Z');

foreach($letters as $letter) { $users = $ad->search()->users()->select(['uid', 'mail'])->whereStartsWith('cn', $letter)->get();

foreach ($users as $user) {
    // Do something with each user.
}

}

better add paginate if large data.


$letters = range('A', 'Z');

foreach($letters as $letter) {
    $users = $ad->search()->users()->select(['uid', 'mail'])->whereStartsWith('cn', $letter)->paginate(1000)->getResults();

    foreach ($users as $user) {
        // Do something with each user.
    }
}