Closed upats closed 3 years ago
It sounds like the script is having trouble querying the group members. The previous developer indicated that Active Directory had problems above 999 users and coded logic into ldap_get_group_members_ad()
to query for 999 members at a time. The code looks sane, but I don't have access to an AD environment.
It would be interesting to know if AD is actually returning information. Is it possible for you to check what's set in $resultg
after lines 299-302 in locallib.php
, before the if block?
As a test, I created a branch which syncs 10,000 users. The tests are passing: https://travis-ci.org/LafColITS/moodle-local_ldap/builds/328489722. Unfortunately the tests only run against OpenLDAP, so if there's some specific issue with the AD code it won't be revealed here.
I turned debugging on when I run the scheduled task manually and I'm getting this:
Warning: ldap_search(): Search: Can't contact LDAP server in /var/www/html/moodle32/local/ldap/locallib.php on line 303
So there's definitely some type of timeout type thing going on with the LDAP server. Unfortunately, instead of the plugin checking to see if a connection was successful, it instead thinks it returned an empty array which then causes the remainder of the script to remove everyone in the remaining cohorts.
It seems like there should be a if connection_successful
to see if the LDAP connection even worked before it decides to remove users.
As for our LDAP connection problem, I'll have to look into that further and report back what might be holding it up. Maybe there is a way we can get around it if I can get more info.
@upats I've written a patch which adds more error handling; it ought to prevent the mass removals that you're seeing. It's on the 8-error-handling
branch, or you can download it here: https://raw.githubusercontent.com/LafColITS/moodle-local_ldap/8-error-handling/locallib.php. Please let me know what results you see.
@mackensen This seems to work as a failsafe but I wonder if we couldn't re-initiate the LDAP connection in case it disconnects due to this limitation? It's interesting that it actually populates the huge group but any remaining groups that it tries to build through that same LDAP connection will fail.
Do you think it'd be possible, if $ldapconnection
returns false
, to use ldap_close()
and try another ldap_connect()
, then resume through the list of cohorts? Maybe it could try up to 3 times and if it fails on the 3rd attempt then it really does failsafe out of the script because in that case it would be apparent the LDAP connection is actually down.
Also, i've been using your MOODLE_32_STABLE branch and your changes were made to the master branch which is for 3.3 and has that dependency set. I changed the dependency to work with 3.2 and the plugin still seems to work fine... do you know if there were actually any 3.3-based functions you switched to that will cause it not to work in 3.2?
@upats The differences aren't significant, but I've rolled a version against MOODLE_32_STABLE: https://raw.githubusercontent.com/LafColITS/moodle-local_ldap/8-error-handling-32/locallib.php.
I have just installed this plugin and like the original poster, we have a very large Active Directory domain (hundreds of thousands of users and groups). Not all users are in our Moodle, neither are all those groups, though.
The script consistently runs for 22 minutes. All/most users are added to their group based cohorts, and are then removed again. It's likely to be a time-out issue on the LDAP side as described above - what can I do to assist with debugging this?
@vk6hgr I've rebased the old patch here for handling server failures more cleanly, which might at least help with the user removals. I'm thinking about ways to make the retrieval more resilient.
Is there any update on this? The fix branch is now way behind master. we are suffering from this problem and desperately need a fix.
@durzo I've rebased the error-handling patch against current main
; I was hoping to get some feedback (and possibly a clue as to the root issue) before including it in a release. Are you able to test it in your current environment?
I will get started on testing it and report back when I have any information
@mackensen This looks great! I was able to test it on a site that the LDAP connection went away with two syncs in a row and the intended behavior of this fix, to bail out and not remove users from cohorts, worked perfectly.
@mackensen any chance this can be merged into master?
@jtsafran yes, sorry for the delay, I've been out of the office the last two weeks. I'll package a new release today.
@mackensen No worries! Thank you so much for your work on this!
Released in v3.7.1
.
I've been trying to debug a weird issue we encounter often: cohorts would sometimes sync fine, then lose all enrollments after a different sync processed. I finally started adding some debug messages to the code today to see if I could find what was happening.
What I've found is when the sync script attempts to add a large amount of users (I'm not exactly sure what the minimum amount is--I've successfully synchd groups with up to ~9000 users but 18,000 users will break--so it might be 10,000 which I think is some type of query limit for AD if I remember correctly) any remaining Cohorts will delete all users that are in them.
The script seems to break in
public function sync_cohorts_by_group()
at
line 630: $ldapmembers = $this->ldap_get_group_members($groupname);
which is where it builds an array of users that match between AD groups & Moodle Cohorts.
When the script finishes syncing (successfully) a super large group (i.e. greater than possibly 10k users), the above
ldap_get_group_members
function returns nothing which subsequently causes the following// Remove local Moodle users not present in LDAP.
foreach ($cohortmembers as $userid => $user) {
if (!isset($ldapmembers[$userid])) {
cohort_remove_member($cohortid, $userid);
}
}
to trigger, thereby removing all users who used to be in those cohorts.
I don't know what the fix could be since I'm not really a developer :(
Any ideas?