LafColITS / moodle-local_ldap

Various synchronization scripts between Moodle and LDAP directories (see https://tracker.moodle.org/browse/MDL-25011 )
9 stars 7 forks source link

Synching a large group causes remaining groups to remove all users #8

Closed upats closed 3 years ago

upats commented 6 years ago

I've been trying to debug a weird issue we encounter often: cohorts would sometimes sync fine, then lose all enrollments after a different sync processed. I finally started adding some debug messages to the code today to see if I could find what was happening.

What I've found is when the sync script attempts to add a large amount of users (I'm not exactly sure what the minimum amount is--I've successfully synchd groups with up to ~9000 users but 18,000 users will break--so it might be 10,000 which I think is some type of query limit for AD if I remember correctly) any remaining Cohorts will delete all users that are in them.

The script seems to break in

public function sync_cohorts_by_group()

at

line 630: $ldapmembers = $this->ldap_get_group_members($groupname);

which is where it builds an array of users that match between AD groups & Moodle Cohorts.

When the script finishes syncing (successfully) a super large group (i.e. greater than possibly 10k users), the above ldap_get_group_members function returns nothing which subsequently causes the following

// Remove local Moodle users not present in LDAP. foreach ($cohortmembers as $userid => $user) { if (!isset($ldapmembers[$userid])) { cohort_remove_member($cohortid, $userid); } }

to trigger, thereby removing all users who used to be in those cohorts.

I don't know what the fix could be since I'm not really a developer :(

Any ideas?

mackensen commented 6 years ago

It sounds like the script is having trouble querying the group members. The previous developer indicated that Active Directory had problems above 999 users and coded logic into ldap_get_group_members_ad() to query for 999 members at a time. The code looks sane, but I don't have access to an AD environment.

It would be interesting to know if AD is actually returning information. Is it possible for you to check what's set in $resultg after lines 299-302 in locallib.php, before the if block?

mackensen commented 6 years ago

As a test, I created a branch which syncs 10,000 users. The tests are passing: https://travis-ci.org/LafColITS/moodle-local_ldap/builds/328489722. Unfortunately the tests only run against OpenLDAP, so if there's some specific issue with the AD code it won't be revealed here.

upats commented 6 years ago

I turned debugging on when I run the scheduled task manually and I'm getting this:

Warning: ldap_search(): Search: Can't contact LDAP server in /var/www/html/moodle32/local/ldap/locallib.php on line 303

So there's definitely some type of timeout type thing going on with the LDAP server. Unfortunately, instead of the plugin checking to see if a connection was successful, it instead thinks it returned an empty array which then causes the remainder of the script to remove everyone in the remaining cohorts.

It seems like there should be a if connection_successful to see if the LDAP connection even worked before it decides to remove users.

As for our LDAP connection problem, I'll have to look into that further and report back what might be holding it up. Maybe there is a way we can get around it if I can get more info.

mackensen commented 6 years ago

@upats I've written a patch which adds more error handling; it ought to prevent the mass removals that you're seeing. It's on the 8-error-handling branch, or you can download it here: https://raw.githubusercontent.com/LafColITS/moodle-local_ldap/8-error-handling/locallib.php. Please let me know what results you see.

upats commented 6 years ago

@mackensen This seems to work as a failsafe but I wonder if we couldn't re-initiate the LDAP connection in case it disconnects due to this limitation? It's interesting that it actually populates the huge group but any remaining groups that it tries to build through that same LDAP connection will fail.

Do you think it'd be possible, if $ldapconnection returns false, to use ldap_close() and try another ldap_connect(), then resume through the list of cohorts? Maybe it could try up to 3 times and if it fails on the 3rd attempt then it really does failsafe out of the script because in that case it would be apparent the LDAP connection is actually down.

upats commented 6 years ago

Also, i've been using your MOODLE_32_STABLE branch and your changes were made to the master branch which is for 3.3 and has that dependency set. I changed the dependency to work with 3.2 and the plugin still seems to work fine... do you know if there were actually any 3.3-based functions you switched to that will cause it not to work in 3.2?

mackensen commented 6 years ago

@upats The differences aren't significant, but I've rolled a version against MOODLE_32_STABLE: https://raw.githubusercontent.com/LafColITS/moodle-local_ldap/8-error-handling-32/locallib.php.

gavinwa725 commented 4 years ago

I have just installed this plugin and like the original poster, we have a very large Active Directory domain (hundreds of thousands of users and groups). Not all users are in our Moodle, neither are all those groups, though.

The script consistently runs for 22 minutes. All/most users are added to their group based cohorts, and are then removed again. It's likely to be a time-out issue on the LDAP side as described above - what can I do to assist with debugging this?

mackensen commented 4 years ago

@vk6hgr I've rebased the old patch here for handling server failures more cleanly, which might at least help with the user removals. I'm thinking about ways to make the retrieval more resilient.

durzo commented 3 years ago

Is there any update on this? The fix branch is now way behind master. we are suffering from this problem and desperately need a fix.

mackensen commented 3 years ago

@durzo I've rebased the error-handling patch against current main; I was hoping to get some feedback (and possibly a clue as to the root issue) before including it in a release. Are you able to test it in your current environment?

durzo commented 3 years ago

I will get started on testing it and report back when I have any information

jtsafran commented 3 years ago

@mackensen This looks great! I was able to test it on a site that the LDAP connection went away with two syncs in a row and the intended behavior of this fix, to bail out and not remove users from cohorts, worked perfectly.

jtsafran commented 3 years ago

@mackensen any chance this can be merged into master?

mackensen commented 3 years ago

@jtsafran yes, sorry for the delay, I've been out of the office the last two weeks. I'll package a new release today.

jtsafran commented 3 years ago

@mackensen No worries! Thank you so much for your work on this!

mackensen commented 3 years ago

Released in v3.7.1.