slapd entered to infinite loop during new index addition

389-ds-bot commented 4 years ago

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/494

Created at 2012-10-15 21:58:55 by pspacek
Closed as Fixed
Assigned to rmeggins (@richm)
Associated bugzillas
- https://bugzilla.redhat.com/show_bug.cgi?id=870158

During IPA 2.2.0->3.0.0 upgrade various indexes are added. During index addition 389 locks in a infinite loop. Stacktraces etc. attached.

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-15 21:59:10

attachment 389stacktrace.txt

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-15 21:59:21

attachment packages

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-15 21:59:28

attachment processes

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-15 22:08:14

kill 6285 led to message "slapd shutting down - waiting for 2 threads to terminate" in errors log and slapd started to cycle on TWO cores. Before SIGTERM 389 cycled on single core.

I captured coredump after SIGTERM. It is too big for Trac, so there is a link: http://people.redhat.com/~pspacek/a/2012-10-15/core.6285.bz2

GDB produced error message during core dump creation, see gcore.report file.

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-15 22:08:30

attachment gcore.report

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-15 22:10:10

attachment logs.tar.bz2

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-15 22:12:48

Slapd was killed after the message [15/Oct/2012:17:51:00 +0200]

Unfortunatelly DS was restarted by RPM scripts in few seconds, so database dumps are not exactly same (potentially).

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-15 22:54:52

I tried to export whole DB and again ended in infinite loop:

# /var/lib/dirsrv/scripts-E-ORG/db2ldif -n userRoot 
Exported ldif file: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184047.ldif
[15/Oct/2012:18:40:47 +0200] - /etc/dirsrv/slapd-E-ORG/dse.ldif: nsslapd-maxdescriptors: nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit).  Server will use a setting of 4096.
[15/Oct/2012:18:40:48 +0200] - Config Warning: - nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit).  Server will use a setting of 4096.
ldiffile: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184047.ldif
^C
[root@unused-4-107 ~]# /var/lib/dirsrv/scripts-E-ORG/db2ldif -n userRoot 
Exported ldif file: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184158.ldif
[15/Oct/2012:18:41:58 +0200] - /etc/dirsrv/slapd-E-ORG/dse.ldif: nsslapd-maxdescriptors: nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit).  Server will use a setting of 4096.
[15/Oct/2012:18:41:58 +0200] - Config Warning: - nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit).  Server will use a setting of 4096.
ldiffile: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184158.ldif

At this moment ns-slapd uses 100 % CPU (again) and LDIF file don't change its size.

Core captured from ns-slapd during the export can be downloaded from http://people.redhat.com/~pspacek/a/2012-10-15/core.12285.bz2 . Stacks and produced LDIF are attached.

File http://people.redhat.com/~pspacek/a/2012-10-15/dirsrv_data.tar.bz2 contains /etc/dirsrv and /var/lib/dirsrv.

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-15 22:55:31

attachment 389stacktrace-core.12285.txt

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-15 22:55:40

attachment core.12285.report

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-15 22:56:32

attachment E-ORG-userRoot-2012_10_15_184158.ldif

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2012-10-16 01:34:47

Looking at the attached stacktrace: "389stacktrace.txt (85.6 KB)", the main thread is waiting for the c_mutex lock in handle_pr_read_ready (Thread 12). And most likely, the lock is held by connection_enter_leave_turbo (Thread 1). I'm wondering the stacktrace is just a snapshot or it is showing a deadlock. To answer it, it'll be nice to take the stacktrace a couple of times and compare among them. Or attach gdb to the process which is hanging.

Do we still have the process? Is it difficult to reproduce the problem?

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-16 19:57:37

Well, CPU was under 100% load all the time. I killed the process after capturing stacktrace and coredump (see links in comment:1).

I'm able to reproduce the issue with current DB as described in comment:3. Can you download the http://people.redhat.com/~pspacek/a/2012-10-15/dirsrv_data.tar.bz2 and do the investigation?

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2012-10-18 03:18:34

Replying to [comment:5 pspacek]:

Well, CPU was under 100% load all the time. I killed the process after capturing stacktrace and coredump (see links in comment:1).

I'm able to reproduce the issue with current DB as described in comment:3. Can you download the http://people.redhat.com/~pspacek/a/2012-10-15/dirsrv_data.tar.bz2 and do the investigation?

Thanks, Petr. It looks you gave me a tar ball containing your environment. Could you tell me how to "reproduce" the problem? I have my f17 VM. I can override my IPA/DS env with yours. To cause the infinite loop, what operation should I send to the server?

Or could it be possible for you to make the infinite loop happen again. Then, generate multiple core files with gcore in gdb (probably, 5 or so?)? That way, it'll be clearer which loop cannot finish...

Thank you so much!

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-18 13:36:24

Hello Noriko,

comment:3 contains simple reproducer - run /var/lib/dirsrv/scripts-E-ORG/db2ldif -n userRoot (with my database from the tarball).

I was forced to destroy old DB and start with the new one (because this instance was installed directly on my laptop).

I can't simply reproduce it again on my HW. However, I did LVM snapshot before destroying the DB, so I will go back in time if it's really inevitable.

Did you look into both attached stack traces? I captured two stack snapshots (but from two different runs).

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2012-10-19 02:01:52

Thank you, Petr! I could see the infinite loop. The problem looks like caused by an USN tombstone generation. I see an odd entry:

id 35547
  rdn: nsuniqueid=b99eedd6-070411e2-ba7983cd-ea05094e,nsuniqueid=b99eedd6-070411e2-ba7983cd-ea05094e
  nsUniqueId: b99eedd6-070411e2-ba7983cd-ea05094e
  tXTRecord: "137"
  dNSTTL: 666
  objectClass: nsTombstone
  [...]
  idnsName: c178reproduce
  parentid: 35028
  entryid: 35547
  nsParentUniqueId: 95dba6e0-070411e2-ba7983cd-ea05094e
  nscpEntryDN: nsuniqueid=b99eedd6-070411e2-ba7983cd-ea05094e,idnsname=c178,idnsname=e.org,cn=dns,dc=e,dc=org
  entryusn: 208228

This rdn should not be repeated nsuniqueId... I'm wondering if you see this entry (e.g., tXTRecord: "137"), could you tell what operation could generate this odd entry? I'm hoping that the tXTRecord value is associated with some test case...

Also, could there be more access logs which include the operations you run? The particular entry idnsName=c178 is repeatedly deleted. But I cannot find DEL in the access log you attached.

If it's easy to reproduce the problem, could it be possible to run the test with the audit log enabled in dse.ldif? nsslapd-auditlog-logging-enabled: on This way, all the update operations are logged in the audit log.

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-19 15:02:35

Objects idnsname=c,idnsname=e.org,cn=dns,dc=e,dc=org were created by scripts loop.sh + nsupdate_add_del_gen.py + e.org.ldif (attached).

These scripts are targeted on stress-testing DNS dynamic update system, so objects and attributes are changed quickly in several parallel threads. As you can see, attributes with random value are added and deleted, so I can't point to exact case and time.

Is there some modifyTimestamp? If it is, you can search my logs: http://people.redhat.com/~pspacek/a/2012-10-15/log-slapd-E-ORG.tar.bz2

I uploaded all logs which I have. Records from last two days are from the new DS installation (after wiping old DB). Stress tests were conducted 3-4 weeks ago I think, but I'm not 100 % sure.

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-19 15:03:41

attachment e.org.ldif

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-19 15:03:48

attachment loop.sh

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-19 15:03:54

attachment nsupdate_add_del_gen.py

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2012-10-23 03:48:13

Hi Petr,

Thanks for the info. I think I figured the cause of the problem. There could be a small window in which it fails to check if the entry is already tombston'ed or not. If it is, the operation to delete the entry should back off. But the current code goes ahead and it tombstones the tombstone entry again, which breaks the dn... I've built a scratch build with the fix: http://koji.fedoraproject.org/koji/taskinfo?taskID=4616851

Could you please run your stress test again using this scratch build? Thank you so much, in advance...

(Just in case you wonder why I updated this link with the above one, this taskID=4616250 is based on the 389-ds-base-1.3.0.a1. http://koji.fedoraproject.org/koji/taskinfo?taskID=4616250 I think the version you are testing is 389-ds-base1.2.11.16. Thus I rebuilt taskID=4616851 from the 1.2.11 branch.)

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2012-10-25 00:24:21

git patch file (master) 0001-Trac-Ticket-494-slapd-entered-to-infinite-loop-durin.patch

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2012-10-25 00:24:53

Bug Description: An entry which rdn is nsuniqueid=...,nsuniqueid=... was accidentally generated. It broke the entryrdn tree and put the reindex and export into the the infinite loop.

Fix Description: Added an missing check if the retrieved entry is already an tombstone or not. This tombstone is the one converted by the entryusn plugin. Please note that replication has an urp code which handles such conflicts.

389-ds-bot commented 4 years ago

Comment from rmeggins (@richm) at 2012-10-25 00:48:08

patch looks good

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2012-10-25 22:40:45

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=870158

389-ds-bot commented 4 years ago

Comment from pspacek at 2012-10-26 16:42:55

Hello,

I did some tests with 389-ds-base-1.2.11.17-0.1.scratch0.fc17.x86_64 - export to LDIF worked well. Sorry for late message, I forgot to reply after running the test.

P.S. It is a race condition, you know, my test results can't be taken as definitive ...

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2012-10-26 21:55:20

Replying to [comment:16 pspacek]:

Hello,

I did some tests with 389-ds-base-1.2.11.17-0.1.scratch0.fc17.x86_64 - export to LDIF worked well. Sorry for late message, I forgot to reply after running the test.

P.S. It is a race condition, you know, my test results can't be taken as definitive ...

Thank you so much for testing my scratch build, Petr! Right. It's hard to "prove" the patch fixes all the bad results from the race condition. But definitely, it is a great indication that it passed your tests! Let us check in the patch and close the ticket once. If you run into the issue, please feel free to reopen this ticket, again.

Once again, thanks a lot!

389-ds-bot commented 4 years ago

Comment from nhosoi (@nhosoi) at 2012-10-26 23:06:56

Reviewed by Rich and tested by Petr. Thank you, both of you!!

Pushed to master.

$ git merge trac494
Updating 927b0ef..b963576
Fast-forward
 ldap/servers/slapd/back-ldbm/ldbm_delete.c | 7 +++++++
 1 file changed, 7 insertions(+)
$ git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 999 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   927b0ef..b963576  master -> master

Cherry-picked and pushed to origin/389-ds-base-1.2.11.

$ git push origin 389-ds-base-1.2.11-ext:389-ds-base-1.2.11
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.03 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
   56ebbb2..31d1578  389-ds-base-1.2.11-ext -> 389-ds-base-1.2.11

389-ds-bot commented 4 years ago

Comment from pspacek at 2017-02-11 22:48:31

Metadata Update from @pspacek:

Issue assigned to richm
Issue set to the milestone: 1.2.11.17

389ds / 389-ds-base

slapd entered to infinite loop during new index addition #494