Online replica initialization broken with large, deep databases

389-ds-bot commented 4 years ago

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49796

Created at 2018-06-19 20:39:59 by telackey
Assigned to nobody

Issue Description

Online replica initialization no longer works if you have a very large, deep database.

This appears to be related to a changes made to fix this bug: https://pagure.io/389-ds-base/issue/48755

As a result, a new scheme of gathering the entries for a total update was introduced, keyed off the filter parentid >= 1 and a new operator, SLAPI_OP_RANGE_NO_IDL_SORT.

Despite the name, it seems more like a "tree depth order" sort, ensuring that all parent IDs are ahead of their children in the ID list. With that "tree depth" ordering, as the total update sends entries to the consumer, it can be confident that the parent has been sent and exists before any child entries are sent.

The sorting is done by adding the current ID to the array only if its parent ID is already present in the array (cf. idl_id_is_in_idlist()), else it is added to a special leftovers array, which is processed at the end.

The function that does the check, idl_id_is_in_idlist(), sequentially scans the ID list, looking for the parent ID. It is scanned for every entry.

If the data is arranged in a shallow tree, this is probably OK, as the parent entry will always be present early on in the array. However, if you have a tree that is deep or leafy, or even where the entries are split (like a forked tree, with half the entries under ou=R and half under ou=B) then idl_id_is_in_idlist() may need to scan millions of IDs, and it will scan them over, and over, and over...

At a certain point, replication can become impossible, because the connection that was opened when the total update was initiated is closed before the IDList is finished sorting and any entries are sent.

That is not only theoretical, but observed with a multi-GB, deep and leafy database after it had spent about 19 hours (!) gathering and sorting the IDList in idl_new_range_fetch() before being ready to send its first entry.

Package Version and Platform

Tested on CentOS 7 with 1.3.6.14.

Steps to reproduce

Fill a database with a large, deep tree.
Attempt to initialize a consumer online.

Actual results

The total update operation is begun, but a very long time is spent gathering the IDList before any entries are sent to the consumer. One indication of this is an import status message on the consumer indicating 0 or 1 entries, and other bogus figures:

[19/Jun/2018:11:50:55.340613439 -0400] - INFO - import_monitor_threads - import userRoot: Processed 1 entries (pass 2) -- average rate 214748364.8/sec, recent rate 0.0/sec, hit ratio 0%
[19/Jun/2018:11:51:16.487648096 -0400] - INFO - import_monitor_threads - import userRoot: Processed 1 entries (pass 2) -- average rate 104755299.9/sec, recent rate 0.0/sec, hit ratio 0%
[19/Jun/2018:11:51:36.779065427 -0400] - INFO - import_monitor_threads - import userRoot: Processed 1 entries (pass 2) -- average rate 70409299.9/sec, recent rate 0.0/sec, hit ratio 0%

If the database is big enough, it may even take long enough for the connection to close.

Other indications are using 100% of a core on the supplier after initiating a total update, and of excessive time spent in idl_id_is_in_idlist().

Expected results

Initialization should begin sending entries almost immediately.

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2018-06-19 20:56:28

If you do an offline init (ldif2db) is it faster?

For example:

Master: db2ldif -r -n userroot -a /tmp/ldif Replica: ldif2db -n userroot -i /tmp/ldif

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2018-06-19 20:56:29

Metadata Update from @mreynolds389:

Custom field component adjusted to None
Custom field origin adjusted to None
Custom field reviewstatus adjusted to None
Custom field type adjusted to None
Custom field version adjusted to None

389-ds-bot commented 4 years ago

Comment from telackey at 2018-06-20 00:06:55

@mreynolds389

Yes, in fact, it doesn't really hit this problem all, because in this server, the parentid is always less than the child (ie, no moddn). The db2ldif code uses a separate scheme for ensuring the order of the tree (cf. _export_or_index_parents() in ldap/servers/slapd/back-ldbm/ldif2ldbm.c) which seems better suited to the problem.

I added a counter for the number IDs scanned by idl_id_is_in_idlist() using db2ldif and using online initialization with my small (compared to the real system) DB of 650K entries. On db2ldif, the answer was zero, since my tree was already ordered. In the online case, the loop in idl_id_is_in_idlist() was started 855,526 times and scanned 27,842,335,568 IDs in the process. That small database "only" took 11 minutes to scan the IDList, not 19 hours like the real DB, but it was still long enough to cause the total update to fail when it finally tried to send the first entry:

[19/Jun/2018:17:01:04.699080440 -0500] - DEBUG - NSMMReplicationPlugin - close_connection_internal - agmt="cn=389ds-build_to_test76" (test76:389) - Disconnected from the consumer
[19/Jun/2018:17:01:05.326634281 -0500] - ERR - NSMMReplicationPlugin - release_replica - agmt="cn=389ds-build_to_test76" (test76:389): Unable to send endReplication extended operation (Timed out)
[19/Jun/2018:17:01:05.327526742 -0500] - ERR - NSMMReplicationPlugin - repl5_tot_run - Total update failed for replica "agmt="cn=389ds-build_to_test76" (test76:389)", error (-1)

For comparison sake, running db2ldif only took 3 minutes to export all 653,112 entries.

389-ds-bot commented 4 years ago

Comment from lkrispen (@elkris) at 2018-06-20 09:35:09

We need to do something here. In the discussion in ticket 48755 I was already concerned about the sorting overhead and made two other proposals, maybe we should investigate again

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2018-06-21 17:48:48

Metadata Update from @mreynolds389:

Issue set to the milestone: 1.4.0

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2020-02-19 16:31:40

Could be a duplicate of #2974

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2020-06-24 16:31:48

Metadata Update from @mreynolds389:

Issue priority set to: normal
Issue set to the milestone: 1.4.4 (was: 1.4.0)

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2020-06-24 16:32:34

Metadata Update from @mreynolds389:

Issue priority set to: major (was: normal)
Issue set to the milestone: 1.4.3 (was: 1.4.4)

mreynolds389 commented 3 years ago

Here was Ludwig's comment about this issue:

Why don't we do it like ldbm2ldif does it, which I think has very low overhead:

Iterate thru the id2entry file, without need of an idlist
for each entry check if parentid < entryid
if parentid < entryid the parent was already sent, send the entry
if not, there are two options:

[1] - what ldbm2ldif does: get the parent entry and send it, make a not that this entry was sent, continue [2] - my spontaneous idea: stack the entry, continue, later process the stack (but this may be more complicated since a parent now could be in the stack)

[1] is probably the best option

Firstyear commented 3 years ago

I think this may be an over simplification of the algo. Because the algo becomes:

sent_set = {}
for each entryid:
    if entryid in sent_set:
        continue
    if parentid < entryid:
        send entry
    else:
        parent_entry = get_entry(parentid)
        send parent_entry
        send entry
        send_set.insert(parentid)

I agree that keeping it consistent (ldbm2ldif) is a good choice though, but we do need to consider that sent_set could grow large / complex for lookups in some situtations.

389ds / 389-ds-base