Closed 389-ds-bot closed 4 years ago
Comment from tbordaz (@tbordaz) at 2014-12-13 03:33:27
attachment stack.17379.2
Comment from tbordaz (@tbordaz) at 2014-12-13 03:33:43
attachment db_stat.out
Comment from tbordaz (@tbordaz) at 2014-12-13 03:34:28
attachment lot_mod.sh
Comment from nhosoi (@nhosoi) at 2015-03-03 03:20:05
Could this be still an issue on the latest versions? (When the deadlock was observed, what version of 389-ds-base was running?)
If the deadlock still occurs, I'm interested in reproducing this bug with the 389-ds-base only env...
What's the minimum condition to reproduce it?
Replication, relint, memberOf, linkedAttr, USN, DNA, and mep?
If you could attach the configurations for the plug-ins, I'd appreciate it.
Comment from tbordaz (@tbordaz) at 2015-03-05 15:31:24
Hello Noriko,
The VM where the problem occurred was no longer available. I prepared the test case on a new VM and attached the plugin config. The DS version is 1.3.2 (while I first hit the problem on 1.3.1). IMHO this is not that important as the DS plugin config depends on IPA setup.
Comment from tbordaz (@tbordaz) at 2015-03-05 15:31:53
attachment 47976_enabled_plugins.gz
Comment from tbordaz (@tbordaz) at 2015-09-24 15:06:54
Also reported in freeipa-users https://www.redhat.com/archives/freeipa-users/2015-September/msg00331.html
80003201 dd= 6 locks held 234 write locks 110 pid/thread 6134/140366703388416 (7FA9AB7E6700)flags 0 priority 100
..
80003201 WRITE 1 HELD userRoot/id2entry.db page 2
...
80003201 WRITE 4 HELD userRoot/objectclass.db page 2
80003201 READ 2 HELD userRoot/objectclass.db page 2
Thread 9 (Thread 0x7fa9ab7e6700 (LWP 6173)):
0 0x00007fa9dda41705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
1 0x00007fa9d860e2f3 in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so
2 0x00007fa9d860d640 in __db_tas_mutex_lock () from /lib64/libdb-5.3.so
3 0x00007fa9d86b7cea in __lock_get_internal () from /lib64/libdb-5.3.so
4 0x00007fa9d86b87d0 in __lock_get () from /lib64/libdb-5.3.so
5 0x00007fa9d86e4112 in __db_lget () from /lib64/libdb-5.3.so
6 0x00007fa9d862b5f5 in __bam_search () from /lib64/libdb-5.3.so
7 0x00007fa9d8616256 in __bamc_search () from /lib64/libdb-5.3.so
8 0x00007fa9d8617d0f in __bamc_get () from /lib64/libdb-5.3.so
9 0x00007fa9d86d0c56 in __dbc_iget () from /lib64/libdb-5.3.so
10 0x00007fa9d86dd843 in __db_get () from /lib64/libdb-5.3.so
11 0x00007fa9d86e1123 in __db_get_pp () from /lib64/libdb-5.3.so
12 0x00007fa9d248949b in id2entry () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
13 0x00007fa9d24af7dd in ldbm_back_delete () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
14 0x00007fa9dfc60190 in op_shared_delete () from /usr/lib64/dirsrv/libslapd.so.0
15 0x00007fa9dfc60342 in delete_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0
16 0x00007fa9d1da4739 in mep_del_post_op () from /usr/lib64/dirsrv/plugins/libmanagedentries-plugin.so
17 0x00007fa9dfcac280 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0
18 0x00007fa9dfcac4d8 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0
19 0x00007fa9d24ae42e in ldbm_back_delete () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
20 0x00007fa9dfc60190 in op_shared_delete () from /usr/lib64/dirsrv/libslapd.so.0
21 0x00007fa9dfc60453 in do_delete () from /usr/lib64/dirsrv/libslapd.so.0
22 0x00007fa9e017a37e in connection_threadmain ()
23 0x00007fa9de09c7bb in _pt_root () from /lib64/libnspr4.so
24 0x00007fa9dda3ddf5 in start_thread () from /lib64/libpthread.so.0
25 0x00007fa9dd76b1ad in clone () from /lib64/libc.so.6
36 dd=127 locks held 0 write locks 0 pid/thread 6134/140366703388416 (7FA9AB7E6700) flags 0 priority 100
36 READ 1 WAIT userRoot/id2entry.db page 2
Comment from nhosoi (@nhosoi) at 2015-10-20 22:57:08
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1273555
Comment from tbordaz (@tbordaz) at 2015-11-03 22:19:29
attachment 0001-Ticket-47976-deadlock-in-mep-delete-post-op.patch
Comment from tbordaz (@tbordaz) at 2015-11-03 22:19:52
attachment 0002-Ticket-47976-test-case.patch
Comment from rmeggins (@richm) at 2015-11-04 03:55:02
I agree with Ludwig on the email list that you should not have to do anything to mep - the txn should already be set correctly in ldbm_delete.c
It looks like 2f4f74b3d38eeb38f0794dfb2e880a95e7e9bd49 introduced a problem - every database argument must pass in the transaction handle, but this commit removed it (and mea culpa for not noticing this when I reviewed the patch :P )
diff --git a/ldap/servers/slapd/back-ldbm/ldbm_delete.c b/ldap/servers/slapd/back-ldbm/ldbm_delete.c
index 61af3a6..3556658 100644
--- a/ldap/servers/slapd/back-ldbm/ldbm_delete.c
+++ b/ldap/servers/slapd/back-ldbm/ldbm_delete.c
...
@@ -449,12 +446,31 @@ ldbm_back_delete( Slapi_PBlock *pb )
...
+ * Although a rare case, multiple parents from repl conflict could exist.
+ * In such case, if a parent entry is found just by parentsdn
+ * (find_entry2modify_only_ext), a wrong parent could be found,
+ * and numsubordinate count could get confused.
+ */
+ ID pid = (ID)strtol(pid_str, (char **)NULL, 10);
+ parent = id2entry(be, pid ,NULL, &retval);
+ if (parent && cache_lock_entry(&inst->inst_cache, parent)) {
+ /* Failed to obtain parent entry's entry lock */
+ CACHE_RETURN(&(inst->inst_cache), &parent);
+ goto error_return;
id2entry here should have used the &txn.
Comment from firstyear (@Firstyear) at 2015-11-04 04:27:13
While py.test may order the test cases, it's not a guaranteed ordering behaviour. What you have in test_ticket47976_1 and test_ticket47976_2 might be better suited in test_ticket47976_init, and then that itself should become either a decorator, fixture or called at the start of test_ticket47976_3. You should be able to just decorate like this:
@pytest.fixture(scope="module")
def ds_topology(request):
...
return TopologyStandalone(standalone)
@pytest.fixture()
def topology(ds_topology):
...
# Do your setup here from _init, _1, _2
...
return ds_topology #This is the ds instance from the former fixture.
# Renamed from _3
test_ticket47976(topology):
...
I hope that helps!
Comment from tbordaz (@tbordaz) at 2015-11-04 20:03:55
attachment 0001-Ticket-47976-2-deadlock-in-mep-delete-post-op.patch
Comment from tbordaz (@tbordaz) at 2015-11-04 21:27:20
Thanks Ludwig and Rich for the reviews !!
git merge ticket47976
Updating 0d1ce9d..83c98d9
Fast-forward
ldap/servers/slapd/back-ldbm/ldbm_delete.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
commit 83c98d9d325ec38396e1fb5fa12a9694b05dd77f
Author: Thierry Bordaz <tbordaz@redhat.com>
Date: Tue Nov 3 15:59:54 2015 +0100
git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 983 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
0d1ce9d..83c98d9 master -> master
Comment from tbordaz (@tbordaz) at 2015-11-04 21:32:51
Replying to [comment:12 Firstyear]:
While py.test may order the test cases, it's not a guaranteed ordering behaviour. What you have in test_ticket47976_1 and test_ticket47976_2 might be better suited in test_ticket47976_init, and then that itself should become either a decorator, fixture or called at the start of test_ticket47976_3. You should be able to just decorate like this:
@pytest.fixture(scope="module") def ds_topology(request): ... return TopologyStandalone(standalone) @pytest.fixture() def topology(ds_topology): ... # Do your setup here from _init, _1, _2 ... return ds_topology #This is the ds instance from the former fixture. # Renamed from _3 test_ticket47976(topology): ...
I hope that helps!
Hi,
Thanks for the review. In fact I was not really confident with the order of execution of the test cases. This order is important and your suggested fix would work. Now I think that if we rename 'topology' fixture into ds_topology we would need to change the create-test.py script.
I will do your suggested modifications and retest.
Comment from tbordaz (@tbordaz) at 2015-11-05 18:27:21
'''1.3.4''' git push origin 389-ds-base-1.3.4 Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 983 bytes, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git a8d30b3..55434d3 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
'''1.3.3''' git push origin 389-ds-base-1.3.3 Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 985 bytes, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 6844592..a4db5ec 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
'''1.2.11''' git push origin 389-ds-base-1.2.11 Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 974 bytes, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git bdaed4e..24a38f2 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
Comment from nhosoi (@nhosoi) at 2015-11-06 03:17:27
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1278585
Comment from nhosoi (@nhosoi) at 2016-05-13 01:18:21
Looks the fix is already pushed. Closing this ticket.
Comment from spichugi (@droideck) at 2016-08-02 21:34:25
Comment from spichugi (@droideck) at 2016-08-03 00:07:09
Thank you, Mark, for the review via email!
To ssh://git.fedorahosted.org/git/389/ds.git
Pushed to master: 7110db9..a12dc36 master -> master commit a12dc3690f49704d86f63a96bff81abe62267397 Author: Simon Pichugin droideck@redhat.com Date: Tue Aug 2 17:29:00 2016 +0200
Comment from tbordaz (@tbordaz) at 2017-02-11 23:09:56
Metadata Update from @tbordaz:
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/47976
A deadlock occurs in a single thread processing a DEL operation. The deadlock is dynamic and there is no clear testcase.
The deadlock occurs in the database lock while the DEL operation under transaction holds a db_lock in write, then issue an internal DEL without the parent txn. The internal DEL needs the db_lock in read, but as it is not using the same transaction the lock is not granted
The delete is done under backend-transcation and owns in write a id2entry page (2):
Then the internal delete is hanging waiting for this same page lock to do a read.
The test case I did is the following but I was only able to reproduce once: 1 - install ipa server + replica 2 - stop the replica (I reproduced without any replication session) 3 - run the script lot_mods.sh note: I was testing on 1.3.1 where MOD/REPL kept the delete values in the entry. So the entry was growing. This explain why the test create/delete the test entry. The growth of the entry can help to reproduce. In 1.3.3 and after MOD/REPL do not keep deleted values. The deadlock occured during the 'ipa user-del'