LMDB - Deadlock when trying to write an update to the changelog during changelog purging

mreynolds389 commented 2 weeks ago

Issue Description

We have a deadlock when we try to write to the changelog and changelog trimming/purging is happening:

Thread 21 (Thread 0x7f2224a006c0 (LWP 2398444) "ns-slapd"):
#1  0x00007f27524afe41 in pthread_mutex_lock@@GLIBC_2.2.5 () at target:/lib64/libc.so.6
#2  0x00007f27507a0040 in cl5WriteOperationTxn (cldb=0x7f274d54d000, op=0x7f2214e05108, txn=0x7f2214e33080) at ../389-ds-base/ldap/servers/plugins/replication/cl5_api.c:860
#3  0x00007f27507d220d in write_changelog_and_ruv (pb=0x7f2214e01060) at ../389-ds-base/ldap/servers/plugins/replication/repl5_plugins.c:1079
#4  0x00007f27507d1b32 in multisupplier_be_betxnpostop_modify (pb=0x7f2214e01060) at ../389-ds-base/ldap/servers/plugins/replication/repl5_plugins.c:863
#5  0x00007f27507d1631 in multisupplier_mmr_postop (pb=0x7f2214e01060, flags=561) at ../389-ds-base/ldap/servers/plugins/replication/repl5_plugins.c:618
#6  0x00007f27528488fb in plugin_call_mmr_plugin_postop (pb=0x7f2214e01060, e=0x0, flags=561) at ../389-ds-base/ldap/servers/slapd/plugin_mmr.c:65
#7  0x00007f274daebc7f in ldbm_back_modify (pb=0x7f2214e01060) at ../389-ds-base/ldap/servers/slapd/back-ldbm/ldbm_modify.c:1029
#8  0x00007f27528280a3 in op_shared_modify (pb=0x7f2214e01060, pw_change=0, old_pw=0x0) at ../389-ds-base/ldap/servers/slapd/modify.c:1026
#9  0x00007f2752826b60 in modify_internal_pb (pb=0x7f2214e01060) at ../389-ds-base/ldap/servers/slapd/modify.c:572
#10 0x00007f27528266ac in slapi_modify_internal_pb (pb=0x7f2214e01060) at ../389-ds-base/ldap/servers/slapd/modify.c:442
#11 0x00007f27507d644d in replica_subentry_update (when=4510849, arg=0x7f274ce62d00) at ../389-ds-base/ldap/servers/plugins/replication/repl5_replica.c:562
#12 0x00007f27527ed8b9 in eq_call_all_rel () at ../389-ds-base/ldap/servers/slapd/eventq.c:278
#13 0x00007f27527eda86 in eq_loop_rel (arg=0x0) at ../389-ds-base/ldap/servers/slapd/eventq.c:324

Thread 2 (Thread 0x7f2212a006c0 (LWP 2398850) "ns-slapd"):
#0  0x00007f27524afa0e in __pthread_mutex_lock_full () at target:/lib64/libc.so.6
#1  0x00007f2750c8d812 in mdb_txn_renew0 () at target:/lib64/liblmdb.so.0.0.0
#2  0x00007f2750c8dfcc in mdb_txn_begin () at target:/lib64/liblmdb.so.0.0.0
#3  0x00007f274db3d4b4 in dbmdb_start_txn (funcname=0x7f274db91be0 <__FUNCTION__.9.lto_priv.1> "dbmdb_dbi_txn_begin", parent_txn=0x0, flags=0, txn=0x7f22129fed90) at ../389-ds-base/ldap/servers/slapd/back-ldbm/db-mdb/mdb_txn.c:169
#4  0x00007f274db43484 in dbmdb_dbi_txn_begin (env=0x7f2750af0e00, readonly=0, parent_txn=0x0, txn=0x7f22129fed90) at ../389-ds-base/ldap/servers/slapd/back-ldbm/db-mdb/mdb_layer.c:2620
#5  0x00007f274daaea2e in dblayer_dbi_txn_begin (be=0x7f274d503cc0, dbenv=0x7f2750af0e00, readonly=0, parent_txn=0x0, txn=0x7f22129fed90) at ../389-ds-base/ldap/servers/slapd/back-ldbm/dbimpl.c:342
#6  0x00007f27507a1d38 in _cl5Iterate (cldb=0x7f274d54d000, action_cb=0x7f27507a35ea <_cl5PurgeRidOnEntry>, dblcictx=0x7f22129fee10, readonly=0) at ../389-ds-base/ldap/servers/plugins/replication/cl5_api.c:1872
#7  0x00007f27507a371e in _cl5PurgeRID (data=0x7f22192145a0, cldb=0x7f274d54d000) at ../389-ds-base/ldap/servers/plugins/replication/cl5_api.c:2682
#8  0x00007f27507a346a in _cl5DoPurging (purge_data=0x7f22192145a0) at ../389-ds-base/ldap/servers/plugins/replication/cl5_api.c:2569
#9  0x00007f27507a85a5 in cldb_purge_rid (purge_data=0x7f22192145a0) at ../389-ds-base/ldap/servers/plugins/replication/cl5_api.c:4495
#10 0x00007f27507b9a6f in replica_cleanallruv_thread (arg=0x7f22192145a0) at ../389-ds-base/ldap/servers/plugins/replication/repl_cleanallruv.c:2371

The purging thread holds the changelog state lock and tries to start a transaction, but there is a thread trying to write a change to the changelog which is already holding the DB lock/txn and then it tries to take the changelog state lock ---> deadlock.

I'm not 100% sure this is LMDB specific, but it's happening pretty consistently while trying to test changelog purging. I did not see this happen once when testing changelog purging with bdb.

progier389 commented 2 weeks ago

IMHO the problem is that the purging thread should not keep holding the changelog state lock It should rather increase cldb->clThreads then release cldb->stLock and decrease cldb->clThreads with slapi_counter_decrement once it has finished using the changelog

mreynolds389 commented 2 weeks ago

IMHO the problem is that the purging thread should not keep holding the changelog state lock It should rather increase cldb->clThreads then release cldb->stLock and decrease cldb->clThreads with slapi_counter_decrement once it has finished using the changelog

Yeah you're right. I had not yet looked into this as I was wrapping up my other testing, but this does fix it. The state lock is just for checking if the changelog is open. So it was being misused and we were taking the main changelog lock in the next frame down anyway. It's been this way for a long time but the issue was only exposed with LMDB. I thought the trimming thread took the same approach but it does not - so only cleanAllRUV/purging would trigger this.

389ds / 389-ds-base

LMDB - Deadlock when trying to write an update to the changelog during changelog purging #6396