Closed 389-ds-bot closed 4 years ago
Comment from pspacek at 2012-10-15 21:59:10
attachment 389stacktrace.txt
Comment from pspacek at 2012-10-15 21:59:21
attachment packages
Comment from pspacek at 2012-10-15 21:59:28
attachment processes
Comment from pspacek at 2012-10-15 22:08:14
kill 6285
led to message "slapd shutting down - waiting for 2 threads to terminate" in errors
log and slapd started to cycle on TWO cores. Before SIGTERM 389 cycled on single core.
I captured coredump after SIGTERM. It is too big for Trac, so there is a link: http://people.redhat.com/~pspacek/a/2012-10-15/core.6285.bz2
GDB produced error message during core dump creation, see gcore.report file.
Comment from pspacek at 2012-10-15 22:08:30
attachment gcore.report
Comment from pspacek at 2012-10-15 22:10:10
attachment logs.tar.bz2
Comment from pspacek at 2012-10-15 22:12:48
Slapd was killed after the message [15/Oct/2012:17:51:00 +0200]
Unfortunatelly DS was restarted by RPM scripts in few seconds, so database dumps are not exactly same (potentially).
Comment from pspacek at 2012-10-15 22:54:52
I tried to export whole DB and again ended in infinite loop:
# /var/lib/dirsrv/scripts-E-ORG/db2ldif -n userRoot
Exported ldif file: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184047.ldif
[15/Oct/2012:18:40:47 +0200] - /etc/dirsrv/slapd-E-ORG/dse.ldif: nsslapd-maxdescriptors: nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit). Server will use a setting of 4096.
[15/Oct/2012:18:40:48 +0200] - Config Warning: - nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit). Server will use a setting of 4096.
ldiffile: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184047.ldif
^C
[root@unused-4-107 ~]# /var/lib/dirsrv/scripts-E-ORG/db2ldif -n userRoot
Exported ldif file: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184158.ldif
[15/Oct/2012:18:41:58 +0200] - /etc/dirsrv/slapd-E-ORG/dse.ldif: nsslapd-maxdescriptors: nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit). Server will use a setting of 4096.
[15/Oct/2012:18:41:58 +0200] - Config Warning: - nsslapd-maxdescriptors: invalid value "8192", maximum file descriptors must range from 1 to 4096 (the current process limit). Server will use a setting of 4096.
ldiffile: /var/lib/dirsrv/slapd-E-ORG/ldif/E-ORG-userRoot-2012_10_15_184158.ldif
At this moment ns-slapd
uses 100 % CPU (again) and LDIF file don't change its size.
Core captured from ns-slapd
during the export can be downloaded from http://people.redhat.com/~pspacek/a/2012-10-15/core.12285.bz2 . Stacks and produced LDIF are attached.
File http://people.redhat.com/~pspacek/a/2012-10-15/dirsrv_data.tar.bz2 contains /etc/dirsrv
and /var/lib/dirsrv
.
Comment from pspacek at 2012-10-15 22:55:31
attachment 389stacktrace-core.12285.txt
Comment from pspacek at 2012-10-15 22:55:40
attachment core.12285.report
Comment from pspacek at 2012-10-15 22:56:32
attachment E-ORG-userRoot-2012_10_15_184158.ldif
Comment from nhosoi (@nhosoi) at 2012-10-16 01:34:47
Looking at the attached stacktrace: "389stacktrace.txt (85.6 KB)", the main thread is waiting for the c_mutex lock in handle_pr_read_ready (Thread 12). And most likely, the lock is held by connection_enter_leave_turbo (Thread 1). I'm wondering the stacktrace is just a snapshot or it is showing a deadlock. To answer it, it'll be nice to take the stacktrace a couple of times and compare among them. Or attach gdb to the process which is hanging.
Do we still have the process? Is it difficult to reproduce the problem?
Comment from pspacek at 2012-10-16 19:57:37
Well, CPU was under 100% load all the time. I killed the process after capturing stacktrace and coredump (see links in comment:1).
I'm able to reproduce the issue with current DB as described in comment:3. Can you download the http://people.redhat.com/~pspacek/a/2012-10-15/dirsrv_data.tar.bz2 and do the investigation?
Comment from nhosoi (@nhosoi) at 2012-10-18 03:18:34
Replying to [comment:5 pspacek]:
Well, CPU was under 100% load all the time. I killed the process after capturing stacktrace and coredump (see links in comment:1).
I'm able to reproduce the issue with current DB as described in comment:3. Can you download the http://people.redhat.com/~pspacek/a/2012-10-15/dirsrv_data.tar.bz2 and do the investigation?
Thanks, Petr. It looks you gave me a tar ball containing your environment. Could you tell me how to "reproduce" the problem? I have my f17 VM. I can override my IPA/DS env with yours. To cause the infinite loop, what operation should I send to the server?
Or could it be possible for you to make the infinite loop happen again. Then, generate multiple core files with gcore in gdb (probably, 5 or so?)? That way, it'll be clearer which loop cannot finish...
Thank you so much!
Comment from pspacek at 2012-10-18 13:36:24
Hello Noriko,
comment:3 contains simple reproducer - run /var/lib/dirsrv/scripts-E-ORG/db2ldif -n userRoot
(with my database from the tarball).
I was forced to destroy old DB and start with the new one (because this instance was installed directly on my laptop).
I can't simply reproduce it again on my HW. However, I did LVM snapshot before destroying the DB, so I will go back in time if it's really inevitable.
Did you look into both attached stack traces? I captured two stack snapshots (but from two different runs).
Comment from nhosoi (@nhosoi) at 2012-10-19 02:01:52
Thank you, Petr! I could see the infinite loop. The problem looks like caused by an USN tombstone generation. I see an odd entry:
id 35547
rdn: nsuniqueid=b99eedd6-070411e2-ba7983cd-ea05094e,nsuniqueid=b99eedd6-070411e2-ba7983cd-ea05094e
nsUniqueId: b99eedd6-070411e2-ba7983cd-ea05094e
tXTRecord: "137"
dNSTTL: 666
objectClass: nsTombstone
[...]
idnsName: c178reproduce
parentid: 35028
entryid: 35547
nsParentUniqueId: 95dba6e0-070411e2-ba7983cd-ea05094e
nscpEntryDN: nsuniqueid=b99eedd6-070411e2-ba7983cd-ea05094e,idnsname=c178,idnsname=e.org,cn=dns,dc=e,dc=org
entryusn: 208228
This rdn should not be repeated nsuniqueId... I'm wondering if you see this entry (e.g., tXTRecord: "137"), could you tell what operation could generate this odd entry? I'm hoping that the tXTRecord value is associated with some test case...
Also, could there be more access logs which include the operations you run? The particular entry idnsName=c178 is repeatedly deleted. But I cannot find DEL in the access log you attached.
If it's easy to reproduce the problem, could it be possible to run the test with the audit log enabled in dse.ldif? nsslapd-auditlog-logging-enabled: on This way, all the update operations are logged in the audit log.
Comment from pspacek at 2012-10-19 15:02:35
Objects idnsname=cloop.sh
+ nsupdate_add_del_gen.py
+ e.org.ldif
(attached).
These scripts are targeted on stress-testing DNS dynamic update system, so objects and attributes are changed quickly in several parallel threads. As you can see, attributes with random value are added and deleted, so I can't point to exact case and time.
Is there some modifyTimestamp? If it is, you can search my logs: http://people.redhat.com/~pspacek/a/2012-10-15/log-slapd-E-ORG.tar.bz2
I uploaded all logs which I have. Records from last two days are from the new DS installation (after wiping old DB). Stress tests were conducted 3-4 weeks ago I think, but I'm not 100 % sure.
Comment from pspacek at 2012-10-19 15:03:41
attachment e.org.ldif
Comment from pspacek at 2012-10-19 15:03:48
attachment loop.sh
Comment from pspacek at 2012-10-19 15:03:54
attachment nsupdate_add_del_gen.py
Comment from nhosoi (@nhosoi) at 2012-10-23 03:48:13
Hi Petr,
Thanks for the info. I think I figured the cause of the problem. There could be a small window in which it fails to check if the entry is already tombston'ed or not. If it is, the operation to delete the entry should back off. But the current code goes ahead and it tombstones the tombstone entry again, which breaks the dn... I've built a scratch build with the fix: http://koji.fedoraproject.org/koji/taskinfo?taskID=4616851
Could you please run your stress test again using this scratch build? Thank you so much, in advance...
(Just in case you wonder why I updated this link with the above one, this taskID=4616250 is based on the 389-ds-base-1.3.0.a1. http://koji.fedoraproject.org/koji/taskinfo?taskID=4616250 I think the version you are testing is 389-ds-base1.2.11.16. Thus I rebuilt taskID=4616851 from the 1.2.11 branch.)
Comment from nhosoi (@nhosoi) at 2012-10-25 00:24:21
git patch file (master) 0001-Trac-Ticket-494-slapd-entered-to-infinite-loop-durin.patch
Comment from nhosoi (@nhosoi) at 2012-10-25 00:24:53
Bug Description: An entry which rdn is nsuniqueid=...,nsuniqueid=... was accidentally generated. It broke the entryrdn tree and put the reindex and export into the the infinite loop.
Fix Description: Added an missing check if the retrieved entry is already an tombstone or not. This tombstone is the one converted by the entryusn plugin. Please note that replication has an urp code which handles such conflicts.
Comment from rmeggins (@richm) at 2012-10-25 00:48:08
patch looks good
Comment from nhosoi (@nhosoi) at 2012-10-25 22:40:45
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=870158
Comment from pspacek at 2012-10-26 16:42:55
Hello,
I did some tests with 389-ds-base-1.2.11.17-0.1.scratch0.fc17.x86_64
- export to LDIF worked well. Sorry for late message, I forgot to reply after running the test.
P.S. It is a race condition, you know, my test results can't be taken as definitive ...
Comment from nhosoi (@nhosoi) at 2012-10-26 21:55:20
Replying to [comment:16 pspacek]:
Hello,
I did some tests with
389-ds-base-1.2.11.17-0.1.scratch0.fc17.x86_64
- export to LDIF worked well. Sorry for late message, I forgot to reply after running the test.P.S. It is a race condition, you know, my test results can't be taken as definitive ...
Thank you so much for testing my scratch build, Petr! Right. It's hard to "prove" the patch fixes all the bad results from the race condition. But definitely, it is a great indication that it passed your tests! Let us check in the patch and close the ticket once. If you run into the issue, please feel free to reopen this ticket, again.
Once again, thanks a lot!
Comment from nhosoi (@nhosoi) at 2012-10-26 23:06:56
Reviewed by Rich and tested by Petr. Thank you, both of you!!
Pushed to master.
$ git merge trac494
Updating 927b0ef..b963576
Fast-forward
ldap/servers/slapd/back-ldbm/ldbm_delete.c | 7 +++++++
1 file changed, 7 insertions(+)
$ git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 999 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
927b0ef..b963576 master -> master
Cherry-picked and pushed to origin/389-ds-base-1.2.11.
$ git push origin 389-ds-base-1.2.11-ext:389-ds-base-1.2.11
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.03 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
56ebbb2..31d1578 389-ds-base-1.2.11-ext -> 389-ds-base-1.2.11
Comment from pspacek at 2017-02-11 22:48:31
Metadata Update from @pspacek:
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/494
During IPA 2.2.0->3.0.0 upgrade various indexes are added. During index addition 389 locks in a infinite loop. Stacktraces etc. attached.