Open Albvadi opened 5 years ago
Hi @Albvadi,
Strange, I haven't encountered this issue with ActiveDirectory - only with paginating extremely large sets (exceeding 1000 records).
I'm not sure why a ForeignSecurityPrincipal
model is being created when you've scoped your query using the search()->users()
method. This scope should limit the results to only include User
models.
Can you post your configuration array that you're using for your ActiveDirectory server connection? I don't have a Sun Directory server to test with - so lets debug the AD connection first before getting into Sun Directory.
Also, if you could post your full stack trace of the exception - that would be great too (with any sensitive details omitted).
Ok, I think my problem is that my Active Directory has more than 20k users ... and I think that it is dying trying to find out the total number of users that there are to get the number of paginator pages ... :disappointed:
Any idea to get all the users in blocks without having to obtain the total number of users and pages before? :confused:
In my "N" try... I obtained the paginator correctly...
There are a lot of users... :see_no_evil:
I have increased the memory in the php.ini from 128m to 512m and it seems that it no longer fails ... If you don´t have any other idea, you can close this issue without a problem...
Pagination should not hydrate all of the records. Is there a way to deterministically paginate?
If not than caching might be a better option, but there’s some complexity that comes with that.
Edit: I sense a use for factories for seeding/testing.
No I haven’t implemented a way to do that yet besides scoping your query with more filters unfortunately :(
There isn’t really a way to prevent exhaustion of memory with large query results unless you specify raw returns (using raw()
on the query builder) or by caching like you mentioned.
Any thoughts or suggestions?
Does LDAP have a limit clause similar to MySQL?
If not than I believe caching might be the only answer and that could be a real pain. You need to be able to reference the cache and appropriately expire it per user configuration, but I think 300 seconds (5 minutes) is a sensible default.
Idea being that on the first call you create a cache and on each subsequent reference to the pagination you refer to cache to prevent the extra calls to LDAP for every user again. You also have more control over the cache so you could implement a limit
clause by partitioning the cache. I also work with SQL Server so I'd probably implement it similar to row_number
over
Edit: If I were to implement a cache I would start with a file system cache driver but use FlySystem as an abstraction.
Does LDAP have a limit clause similar to MySQL?
It does. There is a $sizelimit
parameter in ldap_search()
, ldap_list()
& ldap_read()
mentioned here:
This is utilized here in the query builder:
Idea being that on the first call you create a cache and on each subsequent reference to the pagination you refer to cache to prevent the extra calls to LDAP for every user again.
Yeah this might be the only option.
@Albvadi There is the option of going through the each letter in the alphabet and querying for users that start with that letter. This type of chunking would limit memory usage as you loop through each result set and then overwrite your variables on each chunk.
For example:
$letters = range('A', 'Z');
foreach($letters as $letter) {
$users = $ad->search()->users()->select(['uid', 'mail'])->whereStartsWith('cn', $letter)->get();
foreach ($users as $user) {
// Do something with each user.
}
}
Though this will of course only return users that start with a letter - so users (such as service accounts you may have) that start with numbers will not be returned.
Would this function help: ldap_control_paged_result? Haven't had a chance to really dig into it, but it might be something to consider.
Looks like its only supported in version 3, but that could be a simple check and exception or fall back to cache.
Sorry, you are using that.. I should've looked at the code first.. long day that won't quit.
EDIT: What a PITA that cookie
is. I can't think of anyway of utilizing that function properly without setting a session and storing the cookie.
Okay, so I'm gonna "think out loud" here. Instead of caching all of the results we cache the cookie and previous results.
Here is a directory structure I was thinking about:
/paginations
/$connection_name
/$actual_query_coerced_into_something_file_systems_wont_complain_about
$page_number.php
The file will contain:
return [
'expiry' => $defined_expiration_datetime, // this is also controlled by the server as cookies do expire
'previous_results_file' => $previous_results_file_or_null,
'current_results' => $current_results,
'cookie' => $cookie,
];
Page 1 would just return $current_results
and should have $previous_results_file = null;
. When page 2 is requested it will use the $current_results
of page 1 (file_get_contents($previous_results_file_or_null);
) as the $results
on ldap_control_paged_result_response
and the $cookie
.
If pages get skipped (jump from page 1 to 5) then we drop back to the last known page and build it forward.
Reading through the RFC. This would have to be done a little different. The paging has to be done sequentially, which was already the plan, but I think I might use 2 files instead.
last_cookie.php:
return [
'expiry' => $expiration_date,
'cookie' => $cookie,
'last_results_file' => $last_results_file,
'page' => $page_number,
];
page#.php
return [
'expiry' => $expiration_date,
'results' => $results,
];
First try to find the page in cache if it doesn't exist then look at the last cookie and build the results forward to that page.
Some thought needs to go into how to deal with expiration. If you request page 5 of 100 pages of results and page 5 expired, but the last_cookie hasn't and is ahead.. you have to rebuild from the start because cookies can't be reused or rather the reuse of cookies is "undefined" - so maybe supported, but not defined by the RFC.
EDIT: What a PITA that cookie is. I can't think of anyway of utilizing that function properly without setting a session and storing the cookie.
Yeah definitely, that's why I've been particularly stumped on how I could cache the paginated results.
I have found a similar package that has implemented caching functionality on LDAP queries here:
https://github.com/ldaptools/ldaptools/commit/283321299c4a45c95ee443883dca991e569e8cfb
He then uses a CacheInterface
for each driver and typically stores the cached results in PHP's system temp folder:
https://github.com/ldaptools/ldaptools/blob/master/src/LdapTools/Cache/DoctrineCache.php
Though that implementation is overly complex in my opinion - I'd rather something much simpler.
I'd probably consider building off of a known Cache library - I'd want to look around first. Also, I think there should be an abstraction on how it's stored. We may only create a FileSystemCache implementation, but it should be open to allow for things like Redis/DB/Memcached/etc.
We could just roll our own Cache to keep it simple, but still worth looking at the work that has already been done.
PHP's system temp could be a good fallback directory, but it should be configurable per provider. Also.. I have it in my head that it gets cleared per lifecycle, but maybe that's just temp files.
I'd probably consider building off of a known Cache library - I'd want to look around first. Also, I think there should be an abstraction on how it's stored. We may only create a FileSystemCache implementation, but it should be open to allow for things like Redis/DB/Memcached/etc.
Yeah absolutely, a built in cache driver would be great using the PHP's temp directory - then you could utilize your own driver with a standard cache interface.
What about the PSR-6 Cache interface?
Works for me. I'm just gonna place these 2 links here for reference later:
https://tools.ietf.org/html/rfc2696
It's possible my idea is moot and won't work, but it's worth a try.
@Albvadi There is the option of going through the each letter in the alphabet and querying for users that start with that letter. This type of chunking would limit memory usage as you loop through each result set and then overwrite your variables on each chunk.
For example:
$letters = range('A', 'Z'); foreach($letters as $letter) { $users = $ad->search()->users()->select(['uid', 'mail'])->whereStartsWith('cn', $letter)->get(); foreach ($users as $user) { // Do something with each user. } }
Though this will of course only return users that start with a letter - so users (such as service accounts you may have) that start with numbers will not be returned.
I have already solved in another way the task I had to do, but I had not thought of this way of doing it. I take note for another occasion.. Thanks!!
As for your debate on how to solve the root problem, I have lost a bit. I do not understand how having a cache would solve the problem. Because to have that cache, first you have to bring the users and save them in it ... and it is precisely that complicated point ... no?
Ok sounds good! Have any thoughts of API? I think it would be nice to add a cache()
method to the Builder
class.
The cache()
method will accept a DateTimeInterface
param that will set the expiry time of the cache. Setting the param to explicitly null
will set the cache to live forever.
If the results are inside the cache, it will be retrieved. If the retrieved cached results are expired, then the query is executed to the LDAP server and those results are then cached and returned and a new expiry time is set.
Calling the cache()
method will also validate that a cache driver is present and immediately throw an exception when one isn't available.
Here's what I was thinking:
$provider = new Provider();
$until = new \DateTime('+ 1 day');
$users = $provider->search()->users()->cache($until)->get();
We could also add a secondary param to forcibly flush the cache that matches the current query:
$provider = new Provider();
$until = new \DateTime('+1 day');
$flush = true;
$users = $provider->search()->users()->cache($until, $flush = true)->get();
We might also be able to offer a getCache()
method on the Builder
so dev's can perform caches themselves with the cache object. The Cache::has()
method could accept a Builder
instance that it would retrieve its unique cache key (maybe from Builder::getCacheKey()
?) by md5
ing the search params, options, and filters:
$provider = new Provider();
$query = $provider->search()->newQuery();
$cache = $query->getCache();
$query->search()->users()->whereStartsWith('cn', 'John');
if ($cache->has($query) && ! $cache->expired($query)) {
return $cache->get($query);
} else {
// Maybe a 'remember' method that can execute the given query and cache the results?
return $cache->remember($query, $until = new \DateTime('+1 day'));
}
Anything I'm missing? I think this type of caching interaction would be great and simplistic.
EDIT: We would likely have to build something specific internally for paginated queries - but I think if we got the basics first, we could build off of it and work pagination in.
@Albvadi
As for your debate on how to solve the root problem, I have lost a bit. I do not understand how having a cache would solve the problem. Because to have that cache, first you have to bring the users and save them in it ... and it is precisely that complicated point ... no?
Having a cache means we could store each paginated result into a different system (such as the file system) instead of storing the results in memory. This would avoid running into PHP's memory limit issues as you're experiencing.
Right now, the paginate()
method will continuously execute queries and store all of the results until your LDAP server responds and says there are no more records to return. Since we store all of the paged results of a paginated query, this will cause a memory limit exception - but if we cache each paged result as @jspringe is saying, we can completely bypass the memory limit.
Right now, the
paginate()
method will continuously execute queries and store all of the results until your LDAP server responds and says there are no more records to return.
This is the crucial information to understand how a cache system can solve the problem. I thought that obtaining users was in a single query instead of many queries in a row... :man_facepalming:
Thanks so much for the explanation @stevebauman!
This is the crucial information to understand how a cache system can solve the problem. I thought that obtaining users was in a single query instead of many queries in a row...
Yeah, unfortunately the "paging" capabilities aren't great for this use-case. Fine for a something like a Windows Application where you are storing the cookie
in memory and just clicking from page to page.
@stevebauman did his best to implement pagination with the tools that are available, but that unfortunately meant still pulling all records which is what sort of keyed me into this issue you opened.
To make maters worse LDAP itself doesn't support any sort of incrementing key so you can't even work around it that way.
Ok sounds good! Have any thoughts of API? I think it would be nice to add a cache() method to the Builder class.
I haven't gone that deep into it yet. One thing I would like to do first is create some sort of factory to create a ton of responses for testing. Something akin to Laravel's Model Factories, but maybe at the LDAP query response level because that's what I'd prefer to cache.
I haven't gone that deep into it yet. One thing I would like to do first is create some sort of factory to create a ton of responses for testing.
Good idea! Maybe we can leverage Faker for this when running tests and have our own LdapResponseFactory
?
but maybe at the LDAP query response level because that's what I'd prefer to cache.
Yea you're right, caching the response itself would definitely be ideal here.
Good idea! Maybe we can leverage Faker for this when running tests and have our own LdapResponseFactory?
Absolutely, and also would like to put together a suite of testing tools to be used when testing your own application. Like Guzzle does, but I can open a new issue for that when I'm ready to put a couple together.
Okay, I took my first stab at query caching, take a look: https://github.com/Adldap2/Adldap2/commit/096e4ce96f7f760324c26ee8ea601c5d9dc04281
Thoughts?
Usage (Using the Laravel for brevity and for its Cache repository):
Caching a regular query request:
$cache = app(\Illuminate\Cache\Repository::class);
$provider = Adldap::getDefaultProvider();
$provider->setCache($cache);
$users = $provider->users()->cache(new \DateTime('+1 day'))->get();
Caching a pagination request:
$cache = app(\Illuminate\Cache\Repository::class);
$provider = Adldap::getDefaultProvider();
$provider->setCache($cache);
$users = $provider->users()->cache(new \DateTime('+1 day'))->paginate();
I wasn't able to figure out how I could easily cache each paged result as its own response and retrieve each page from the cache upon request. Right now during a paginate request, all pages are cached in a single item.
EDIT: I also ended up using PSR-16 instead of PSR-6. PSR-6 is overly complex for the caching we need to perform here (in my opinion), and Laravel is compatible with it out of the box, along with Symfony and other frameworks.
Caching sped up large queries immensely (from 1000ms down to 15ms in some cases) - I'll post some benchmarks shortly.
Wow!! Super fast first implementation!! I'm not sure if tomorrow, but I'll try test It as soon as possible with muy production AD and the 30k users...
Thanks so much for the superb work!!
Thanks @Albvadi, much appreciated! 😄
I'm curious how cached queries speed up your environment - my directory has just over 1000 users, so I'm not able to test massive result sets like that!
This morning I was able to take a few minutes to do some quick tests, but there is something I don´t understand (or I thought it worked differently) ...
First test: I made a query with 1000 users and it takes about 4-5 seconds to get all the results. It´s practically the same time it takes without using cache.
$cache = app(\Illuminate\Cache\Repository::class);
$provider = Adldap::getDefaultProvider();
$provider->setCache($cache);
// Caches query for 1000 users
$users= $provider->search()->users()->cache(new \DateTime('+1 day'))->limit(1000)->get();
Second test: I made a pagination query with all the attributes. Laravel directly dies with a 500 error, doesn´t write anything in the log error, doesn´t write anything in the cache and doesn´t write search query in the adldap logging.
$cache = app(\Illuminate\Cache\Repository::class);
$provider = Adldap::getDefaultProvider();
$provider->setCache($cache);
$recordsPerPage = 50;
$currentPage = $_GET['page'];
$paginator = $provider->search()->users()->cache(new \DateTime('+1 day'))->paginate($recordsPerPage, $currentPage);
Adldap2 logging
[2019-05-31 11:43:16] local.INFO: LDAP (ldap://company.com:389) - Connection: default - Operation: Binding - Username: administrator@company.com
[2019-05-31 11:43:16] local.INFO: LDAP (ldap://company.com:389) - Connection: default - Operation: Bound - Username: administrator@company.com
Third test: I´ve launched the pagination query but limiting to only 3 attributes. It takes a while (about 10 minutes), but it does it well, with no memory error and writing the results in the cache and the search filter in the adldap logging. I couldn´t analyze the file wroted in the cache because it was too big and I did not have time, but I understand that the 30k users will be there.
$cache = app(\Illuminate\Cache\Repository::class);
$provider = Adldap::getDefaultProvider();
$provider->setCache($cache);
$recordsPerPage = 50;
$currentPage = $_GET['page'];
$paginator = $provider->search()->users()->cache(new \DateTime('+1 day'))->select(['samaccountname', 'email', 'cn'])->paginate($recordsPerPage, $currentPage);
Adldap2 logging
[2019-05-31 11:51:43] local.INFO: LDAP (ldap://company.com:389) - Connection: default - Operation: Binding - Username: administrator@company.com
[2019-05-31 11:51:43] local.INFO: LDAP (ldap://company.com:389) - Connection: default - Operation: Bound - Username: administrator@company.com
[2019-05-31 11:51:43] local.INFO: LDAP (ldap://company.com:389) - Connection: default - Operation: Paginate - Base DN: DC=company,DC=com - Filter: (&(objectclass=user)(objectcategory=person)(!(objectclass=contact))) - Selected: (samaccountname,email,cn,objectcategory,objectclass) - Time Elapsed: 4907.28
Now, the part that I don´t understand. Starting from the third test, if I load the page, it takes 10 minutes for example to retrieve all the users... Next time I refresh the page with the same query, should it be immediate? If I jump from page1 to page2, should it be immedaite too (what it takes to load 50 users, which is nothing) since it´s loading data that it already has in the cache with the same query?
If the answer is affirmative, then itsn´t working, because every time that it refreshed the same page, it returned to make the complete query to the active directory again, taking those 10 minutes of the first time...
P.S: Sorry my poor english... It´s a mix between my own english and the Google´s tranlator english... :sweat_smile:
Hi @Albvadi, which cache driver are you using in Laravel (file
, array
, database
etc.)?
Also, the first LDAP query you execute will cache the returned results - the next time the same query is executed it will use the cached results, so you will only notice a speed increase on the second time it is executed, not the first.
The LDAP query also needs to be identical on the second request, otherwise a new cache will be created.
This cache update won't fix memory issues or speed up queries that return tens of thousands of users - there is no way around this unless you limit the scope of your query.
I think there may be a misunderstanding of how pagination is working in Adldap2. When you call paginate()
on a query, it will retrieve all results from your LDAP query - for example, calling this will return all 30,000 users from your LDAP server in chunks of 50:
// Returns all users from your directory in chunks of 50:
$users = Adldap::search()->users()->paginate(50);
This operates completely differently than pagination in Laravel with a SQL database. Calling paginate(50)
with a Laravel model will only retrieve 50 results from your SQL database. For example:
// Returns only 50 users from your database:
$users = \App\User::paginate(50);
// Returns **all** users from your LDAP directory in **chunks** of 50:
$ldapUsers = Adldap::search()->users()->paginate(50);
I'm still trying to figure out a way for Adldap2 pagination to operate similarly to Laravel pagination. Updates will be coming!
@stevebauman - Awesome work! Unfortunately, I've been knee deep in a work project that has me doing JS/Vue/Popper/etc. so I haven't had time to circle back to this and get my brain back to PHP. Hopefully, I'll have time this weekend to grab your implementation and see how we can use it for pagination.
I'm using file cache driver.
so, starting from the first example, in the first load it makes a query against the active directory and in the second load it does not make that connection and it takes it directly from the cache?
$letters = range('A', 'Z');
foreach($letters as $letter) { $users = $ad->search()->users()->select(['uid', 'mail'])->whereStartsWith('cn', $letter)->get();
foreach ($users as $user) { // Do something with each user. }
}
better add paginate if large data.
$letters = range('A', 'Z');
foreach($letters as $letter) {
$users = $ad->search()->users()->select(['uid', 'mail'])->whereStartsWith('cn', $letter)->paginate(1000)->getResults();
foreach ($users as $user) {
// Do something with each user.
}
}
Description:
Trying to retrieve paginated users, results in a empty recordset.
Steps To Reproduce:
If I test the query with limit, the users recordset it´s returned correctly:
But, if I use paginated, I have two different errors, depending of the system:
Sun One Directory Server: Empty paginator
Active Directory: Memory allocated exhausted
Any idea what can I test??
Thanks!