389ds / 389-ds-base

The enterprise-class Open Source LDAP server for Linux
https://www.port389.org/
Other
211 stars 92 forks source link

deadlock in mep delete post op #1307

Closed 389-ds-bot closed 4 years ago

389-ds-bot commented 4 years ago

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/47976


A deadlock occurs in a single thread processing a DEL operation. The deadlock is dynamic and there is no clear testcase.

The deadlock occurs in the database lock while the DEL operation under transaction holds a db_lock in write, then issue an internal DEL without the parent txn. The internal DEL needs the db_lock in read, but as it is not using the same transaction the lock is not granted

Thread 6 (Thread 0x7f52c1feb700 (LWP 17418)):
0  0x00007f52f2c0c705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
1  0x00007f52ed55d2f3 in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so
2  0x00007f52ed55c640 in __db_tas_mutex_lock () from /lib64/libdb-5.3.so
3  0x00007f52ed606cea in __lock_get_internal () from /lib64/libdb-5.3.so
4  0x00007f52ed6077d0 in __lock_get () from /lib64/libdb-5.3.so
5  0x00007f52ed633112 in __db_lget () from /lib64/libdb-5.3.so
6  0x00007f52ed57a5f5 in __bam_search () from /lib64/libdb-5.3.so
7  0x00007f52ed565256 in __bamc_search () from /lib64/libdb-5.3.so
8  0x00007f52ed566d0f in __bamc_get () from /lib64/libdb-5.3.so
9  0x00007f52ed61fc56 in __dbc_iget () from /lib64/libdb-5.3.so
10 0x00007f52ed62c843 in __db_get () from /lib64/libdb-5.3.so
11 0x00007f52ed630123 in __db_get_pp () from /lib64/libdb-5.3.so
12 0x00007f52e97535fb in id2entry (be=0x1cc48b0, id=id@entry=4, txn=txn@entry=0x0, err=err@entry=0x7f52c1fe6638) at ldap/servers/slapd/back-ldbm/id2entry.c:323
13 0x00007f52e9777c20 in ldbm_back_delete (pb=0x7f5284061cf0) at ldap/servers/slapd/back-ldbm/ldbm_delete.c:463
14 0x00007f52f4c1d380 in op_shared_delete (pb=pb@entry=0x7f5284061cf0) at ldap/servers/slapd/delete.c:364
15 0x00007f52f4c1d532 in delete_internal_pb (pb=pb@entry=0x7f5284061cf0) at ldap/servers/slapd/delete.c:242
16 0x00007f52f4c1d7e3 in slapi_delete_internal_pb (pb=pb@entry=0x7f5284061cf0) at ldap/servers/slapd/delete.c:185
17 0x00007f52e907c3cd in mep_del_post_op (pb=<optimized out>) at ldap/servers/plugins/mep/mep.c:2556
18 0x00007f52f4c638c5 in plugin_call_func (list=0x1bcae30, operation=operation@entry=563, pb=pb@entry=0x3184f90, call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:1453
19 0x00007f52f4c63a78 in plugin_call_list (pb=0x3184f90, operation=563, list=<optimized out>) at ldap/servers/slapd/plugin.c:1415
20 plugin_call_plugins (pb=pb@entry=0x3184f90, whichfunction=whichfunction@entry=563) at ldap/servers/slapd/plugin.c:398
21 0x00007f52e9776f07 in ldbm_back_delete (pb=0x3184f90) at ldap/servers/slapd/back-ldbm/ldbm_delete.c:1091
22 0x00007f52f4c1d380 in op_shared_delete (pb=pb@entry=0x3184f90) at ldap/servers/slapd/delete.c:364
23 0x00007f52f4c1d643 in do_delete (pb=0x3184f90) at ldap/servers/slapd/delete.c:128
24 0x0000000000417688 in connection_dispatch_operation (pb=<optimized out>, op=0x2dbc8e0, conn=0x7f52e02b5190) at ldap/servers/slapd/connection.c:587
25 connection_threadmain () at ldap/servers/slapd/connection.c:2372
26 0x00007f52f32679eb in _pt_root () from /lib64/libnspr4.so
27 0x00007f52f2c08df3 in start_thread () from /lib64/libpthread.so.0
28 0x00007f52f293601d in clone () from /lib64/libc.so.6

The delete is done under backend-transcation and owns in write a id2entry page (2):

    800006b9 dd= 8 locks held 185  write locks 85   pid/thread 17379/139993418741504 flags 0    priority 100
    800006b9 READ         30 HELD    ipaca/ancestorid.db       page          1
    ...
    800006b9 WRITE         1 HELD    userRoot/id2entry.db      page          2
    ...

Then the internal delete is hanging waiting for this same page lock to do a read.

       2 dd=103 locks held 0    write locks 0    pid/thread 17379/139993418741504 flags 0    priority 100
       2 READ          1 WAIT    userRoot/id2entry.db      page          2

The test case I did is the following but I was only able to reproduce once: 1 - install ipa server + replica 2 - stop the replica (I reproduced without any replication session) 3 - run the script lot_mods.sh note: I was testing on 1.3.1 where MOD/REPL kept the delete values in the entry. So the entry was growing. This explain why the test create/delete the test entry. The growth of the entry can help to reproduce. In 1.3.3 and after MOD/REPL do not keep deleted values. The deadlock occured during the 'ipa user-del'

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2014-12-13 03:33:27

attachment stack.17379.2

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2014-12-13 03:33:43

attachment db_stat.out

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2014-12-13 03:34:28

attachment lot_mod.sh

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2015-03-03 03:20:05

Could this be still an issue on the latest versions? (When the deadlock was observed, what version of 389-ds-base was running?)

If the deadlock still occurs, I'm interested in reproducing this bug with the 389-ds-base only env...

What's the minimum condition to reproduce it? Replication, relint, memberOf, linkedAttr, USN, DNA, and mep?
If you could attach the configurations for the plug-ins, I'd appreciate it.

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2015-03-05 15:31:24

Hello Noriko,

The VM where the problem occurred was no longer available. I prepared the test case on a new VM and attached the plugin config. The DS version is 1.3.2 (while I first hit the problem on 1.3.1). IMHO this is not that important as the DS plugin config depends on IPA setup.

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2015-03-05 15:31:53

attachment 47976_enabled_plugins.gz

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2015-09-24 15:06:54

Also reported in freeipa-users https://www.redhat.com/archives/freeipa-users/2015-September/msg00331.html

80003201 dd= 6 locks held 234  write locks 110  pid/thread 6134/140366703388416 (7FA9AB7E6700)flags 0    priority 100
..
80003201 WRITE         1 HELD    userRoot/id2entry.db      page          2
...
80003201 WRITE         4 HELD    userRoot/objectclass.db   page          2
80003201 READ          2 HELD    userRoot/objectclass.db   page          2

Thread 9 (Thread 0x7fa9ab7e6700 (LWP 6173)):
0  0x00007fa9dda41705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
1  0x00007fa9d860e2f3 in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so
2  0x00007fa9d860d640 in __db_tas_mutex_lock () from /lib64/libdb-5.3.so
3  0x00007fa9d86b7cea in __lock_get_internal () from /lib64/libdb-5.3.so
4  0x00007fa9d86b87d0 in __lock_get () from /lib64/libdb-5.3.so
5  0x00007fa9d86e4112 in __db_lget () from /lib64/libdb-5.3.so
6  0x00007fa9d862b5f5 in __bam_search () from /lib64/libdb-5.3.so
7  0x00007fa9d8616256 in __bamc_search () from /lib64/libdb-5.3.so
8  0x00007fa9d8617d0f in __bamc_get () from /lib64/libdb-5.3.so
9  0x00007fa9d86d0c56 in __dbc_iget () from /lib64/libdb-5.3.so
10 0x00007fa9d86dd843 in __db_get () from /lib64/libdb-5.3.so
11 0x00007fa9d86e1123 in __db_get_pp () from /lib64/libdb-5.3.so
12 0x00007fa9d248949b in id2entry () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
13 0x00007fa9d24af7dd in ldbm_back_delete () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
14 0x00007fa9dfc60190 in op_shared_delete () from /usr/lib64/dirsrv/libslapd.so.0
15 0x00007fa9dfc60342 in delete_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0
16 0x00007fa9d1da4739 in mep_del_post_op () from /usr/lib64/dirsrv/plugins/libmanagedentries-plugin.so
17 0x00007fa9dfcac280 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0
18 0x00007fa9dfcac4d8 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0
19 0x00007fa9d24ae42e in ldbm_back_delete () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
20 0x00007fa9dfc60190 in op_shared_delete () from /usr/lib64/dirsrv/libslapd.so.0
21 0x00007fa9dfc60453 in do_delete () from /usr/lib64/dirsrv/libslapd.so.0
22 0x00007fa9e017a37e in connection_threadmain ()
23 0x00007fa9de09c7bb in _pt_root () from /lib64/libnspr4.so
24 0x00007fa9dda3ddf5 in start_thread () from /lib64/libpthread.so.0
25 0x00007fa9dd76b1ad in clone () from /lib64/libc.so.6

      36 dd=127 locks held 0    write locks 0    pid/thread 6134/140366703388416 (7FA9AB7E6700) flags 0    priority 100
      36 READ          1 WAIT    userRoot/id2entry.db      page          2
389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2015-10-20 22:57:08

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1273555

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2015-11-03 22:19:29

attachment 0001-Ticket-47976-deadlock-in-mep-delete-post-op.patch

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2015-11-03 22:19:52

attachment 0002-Ticket-47976-test-case.patch

389-ds-bot commented 4 years ago

Comment from rmeggins (@richm) at 2015-11-04 03:55:02

I agree with Ludwig on the email list that you should not have to do anything to mep - the txn should already be set correctly in ldbm_delete.c

It looks like 2f4f74b3d38eeb38f0794dfb2e880a95e7e9bd49 introduced a problem - every database argument must pass in the transaction handle, but this commit removed it (and mea culpa for not noticing this when I reviewed the patch :P )

diff --git a/ldap/servers/slapd/back-ldbm/ldbm_delete.c b/ldap/servers/slapd/back-ldbm/ldbm_delete.c
index 61af3a6..3556658 100644
--- a/ldap/servers/slapd/back-ldbm/ldbm_delete.c
+++ b/ldap/servers/slapd/back-ldbm/ldbm_delete.c
...
@@ -449,12 +446,31 @@ ldbm_back_delete( Slapi_PBlock *pb )
...
+                                        * Although a rare case, multiple parents from repl conflict could exist.
+                                        * In such case, if a parent entry is found just by parentsdn
+                                        * (find_entry2modify_only_ext), a wrong parent could be found,
+                                        * and numsubordinate count could get confused.
+                                        */
+                                       ID pid = (ID)strtol(pid_str, (char **)NULL, 10);
+                                       parent = id2entry(be, pid ,NULL, &retval);
+                                       if (parent && cache_lock_entry(&inst->inst_cache, parent)) {
+                                               /* Failed to obtain parent entry's entry lock */
+                                               CACHE_RETURN(&(inst->inst_cache), &parent);
+                                               goto error_return;

id2entry here should have used the &txn.

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2015-11-04 04:27:13

While py.test may order the test cases, it's not a guaranteed ordering behaviour. What you have in test_ticket47976_1 and test_ticket47976_2 might be better suited in test_ticket47976_init, and then that itself should become either a decorator, fixture or called at the start of test_ticket47976_3. You should be able to just decorate like this:

@pytest.fixture(scope="module") 
def ds_topology(request):
    ...
    return TopologyStandalone(standalone)

@pytest.fixture() 
def topology(ds_topology):
    ...
    # Do your setup here from _init, _1, _2
    ...
    return ds_topology #This is the ds instance from the former fixture.

# Renamed from _3
test_ticket47976(topology):
    ...

I hope that helps!

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2015-11-04 20:03:55

attachment 0001-Ticket-47976-2-deadlock-in-mep-delete-post-op.patch

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2015-11-04 21:27:20

Thanks Ludwig and Rich for the reviews !!

git merge ticket47976
Updating 0d1ce9d..83c98d9
Fast-forward
 ldap/servers/slapd/back-ldbm/ldbm_delete.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

commit 83c98d9d325ec38396e1fb5fa12a9694b05dd77f
Author: Thierry Bordaz <tbordaz@redhat.com>
Date:   Tue Nov 3 15:59:54 2015 +0100

git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 983 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   0d1ce9d..83c98d9  master -> master
389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2015-11-04 21:32:51

Replying to [comment:12 Firstyear]:

While py.test may order the test cases, it's not a guaranteed ordering behaviour. What you have in test_ticket47976_1 and test_ticket47976_2 might be better suited in test_ticket47976_init, and then that itself should become either a decorator, fixture or called at the start of test_ticket47976_3. You should be able to just decorate like this:

@pytest.fixture(scope="module") 
def ds_topology(request):
    ...
    return TopologyStandalone(standalone)

@pytest.fixture() 
def topology(ds_topology):
    ...
    # Do your setup here from _init, _1, _2
    ...
    return ds_topology #This is the ds instance from the former fixture.

# Renamed from _3
test_ticket47976(topology):
    ...

I hope that helps!

Hi,

Thanks for the review. In fact I was not really confident with the order of execution of the test cases. This order is important and your suggested fix would work. Now I think that if we rename 'topology' fixture into ds_topology we would need to change the create-test.py script.

I will do your suggested modifications and retest.

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2015-11-05 18:27:21

'''1.3.4''' git push origin 389-ds-base-1.3.4 Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 983 bytes, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git a8d30b3..55434d3 389-ds-base-1.3.4 -> 389-ds-base-1.3.4

'''1.3.3''' git push origin 389-ds-base-1.3.3 Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 985 bytes, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 6844592..a4db5ec 389-ds-base-1.3.3 -> 389-ds-base-1.3.3

'''1.2.11''' git push origin 389-ds-base-1.2.11 Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 974 bytes, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git bdaed4e..24a38f2 389-ds-base-1.2.11 -> 389-ds-base-1.2.11

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2015-11-06 03:17:27

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1278585

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2016-05-13 01:18:21

Looks the fix is already pushed. Closing this ticket.

389-ds-bot commented 4 years ago

Comment from spichugi (@droideck) at 2016-08-02 21:34:25

attachment 0001-Ticket-47976-Add-fixed-CI-test-case.patch

389-ds-bot commented 4 years ago

Comment from spichugi (@droideck) at 2016-08-03 00:07:09

Thank you, Mark, for the review via email!

To ssh://git.fedorahosted.org/git/389/ds.git

Pushed to master: 7110db9..a12dc36 master -> master commit a12dc3690f49704d86f63a96bff81abe62267397 Author: Simon Pichugin droideck@redhat.com Date: Tue Aug 2 17:29:00 2016 +0200

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2017-02-11 23:09:56

Metadata Update from @tbordaz: