Closed 389-ds-bot closed 4 years ago
Comment from wetpaste at 2015-05-29 04:59:09
attachment stacktrace.1432332248-edit.txt
Comment from nkinder (@nkinder) at 2015-05-29 05:39:31
Thread 1 seems to indicate that there might be memory corruption (the DN value is messed up, which leads to the crash). If possible, running ns-slapd under valgrind would help to pinpoint where the memory corruption is occuring. Details on doing this are here:
http://www.port389.org/docs/389ds/FAQ/faq.html#memory-growth
Comment from abbra at 2015-05-29 11:45:00
It looks like you are using compat tree with ID Views in FreeIPA. What exact IPA and slapi-nis versions do you have?
Comment from nhosoi (@nhosoi) at 2015-06-11 21:54:45
Hello wetpaste, did you have any luck to reproduce the crash and get the valgrind output?
Comment from wetpaste at 2015-06-18 03:38:39
Sorry I have not yet, I had turned off valgrind before I was able to replicate because of some strange issues I was having right before I left for vacation. I will work on getting it back in place and hopefully have some results in the next few days. Thanks for your help!
Comment from wetpaste at 2015-06-18 23:35:55
I'm having trouble replicating this when valgrind is running. It seems to slow things down to the point where I cannot get enough throughput to the server to cause a crash. I'm testing this very indirectly though, as it seems to happen when I am doing high-throughput ansible playbooks (around 15 ssh threads at once through a single jump-host). Each time ansible makes a sudo/ssh call it's doing lookups against ldap. However when valgrind is running, Its so slow that I get all kinds of timeouts from ansible so in order to get relief I bumped it down to 5 threads.
If I could figure out how to mimick the traffic that is normally sent over to the ldap server, I could get around trying to figure out where in the stack[ansible -> ssh -> ssh jump host -> host -> sudo??] it's timing out.
If anyone has any ideas let me know.
Comment from rmeggins (@richm) at 2015-06-19 00:03:31
Replying to [comment:6 wetpaste]:
snip
Yeah, valgrind makes the server crawl.
If I could figure out how to mimick the traffic that is normally sent over to the ldap server, I could get around trying to figure out where in the stack[ansible -> ssh -> ssh jump host -> host -> sudo??] it's timing out.
If anyone has any ideas let me know.
The first step would be to use logconv.pl to get a report of the operations.
The next step would be to try to replay the same sequence of operations. IHAGHS (I have a github script) https://github.com/richm/scripts/blob/master/replayAccessAudit.py that can replay a sequence of operations from the access (and audit log, if there are updates).
Comment from nhosoi (@nhosoi) at 2015-07-08 03:17:53
Hello wetpaste,
Could there be any progress in your investigation/debugging?
I think we haven't heard your 389-ds-base revision, yet? Output of "rpm -q 389-ds-base"?
Any logs in the error log? (/var/log/dirsrv/slapd-YOURSERVERID/errors)
It might be useful to see some operations served just before the server's crash. http://www.port389.org/docs/389ds/FAQ/faq.html#printing-access-log-buffer
In the access log, do you see lots of simple paged results requests? (notes=P)
Comment from rmeggins (@richm) at 2015-07-23 21:35:26
Any updates? If not, we are going to close this bug.
Comment from wetpaste at 2015-07-25 03:42:48
Current 389-ds-base:
$ rpm -q 389-ds-base
389-ds-base-1.3.3.1-16.el7_1.x86_64
I'm going to try using logconv and replay to valgrind. Looks like it crashed again a couple times 2 days ago.
nhosoi: nothing is in the error logs except for the startup output.
Comment from wetpaste at 2015-07-25 04:01:16
Replying to [comment:7 richm]:
The first step would be to use logconv.pl to get a report of the operations. Can you clarify what this is for? Do you want a type of statistics report?
The next step would be to try to replay the same sequence of operations. IHAGHS (I have a github script) https://github.com/richm/scripts/blob/master/replayAccessAudit.py that can replay a sequence of operations from the access (and audit log, if there are updates).
Trying this now. Thanks!
EDIT: one issue, I can't seem to get the script to work and am baffled to as why. I looked at the source and, with my limited python knowledge, it looks okay...
$ python replayAccessAudit.py -c access -b 4 -e 73773
Traceback (most recent call last):
File "replayAccessAudit.py", line 666, in <module>
if len(args.c) == 0 and len(args.u) == 0:
AttributeError: 'Namespace' object has no attribute 'c'
Comment from rmeggins (@richm) at 2015-07-30 21:44:46
Replying to [comment:11 wetpaste]:
Replying to [comment:7 richm]:
The first step would be to use logconv.pl to get a report of the operations. Can you clarify what this is for? Do you want a type of statistics report?
Yes. logconv.pl can produce a lot of statistics/reports from the access log. See man logconv.pl.
The next step would be to try to replay the same sequence of operations. IHAGHS (I have a github script) https://github.com/richm/scripts/blob/master/replayAccessAudit.py that can replay a sequence of operations from the access (and audit log, if there are updates).
Trying this now. Thanks!
EDIT: one issue, I can't seem to get the script to work and am baffled to as why. I looked at the source and, with my limited python knowledge, it looks okay...
$ python replayAccessAudit.py -c access -b 4 -e 73773 Traceback (most recent call last): File "replayAccessAudit.py", line 666, in <module> if len(args.c) == 0 and len(args.u) == 0: AttributeError: 'Namespace' object has no attribute 'c'
Fixed. Please try it again.
Comment from wetpaste at 2015-08-27 06:48:37
I'm kind of out of ideas. I have not been able to replicate the issue except for when I am in production. If I use the script to replay, I cannot get the same crash out of the app. Cannot get it to replicate while valgrind is running.
Comment from nhosoi (@nhosoi) at 2015-09-18 04:48:53
git patch file (master) 0001-Ticket-48188-segfault-in-ns-slapd-unknown-cause.patch
Comment from nhosoi (@nhosoi) at 2015-09-18 04:51:03
I revisited the stacktraces and noticed there is a possibility the dn/rawdn variables could be trashed by the IPA slapi-nis/compat plug-in.
Comment from nhosoi (@nhosoi) at 2015-09-18 04:53:53
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1264224
Comment from rmeggins (@richm) at 2015-09-18 07:58:37
in do_bind() dn points inside of sdn - so we need to set dn = NULL after the last time it is used before calling the PRE_BIND plugins, and set dn back to slapi_sdn_get_dn(sdn) after the PRE_BIND plugin (or not, if it is not used after that).
Comment from simo (@simo5) at 2015-09-19 00:04:22
LGTM
Comment from rmeggins (@richm) at 2015-09-19 00:16:45
if (pb_sdn != sdn) {
pb_sdn != sdn doesn't mean strcmp(pb_sdn.ndn, sdn.ndn). That is, although it doesn't make sense for the plugin to change the pointer without changing the actual dn, it is still possible. I suggest using slapi_sdn_compare to make sure they are really different.
Comment from nhosoi (@nhosoi) at 2015-09-19 00:17:14
git patch file (master) 0001-Ticket-48188-segfault-in-ns-slapd-due-to-accessing-S.patch
Comment from nhosoi (@nhosoi) at 2015-09-19 00:28:06
Replying to [comment:22 richm]:
if (pb_sdn != sdn) {
pb_sdn != sdn doesn't mean strcmp(pb_sdn.ndn, sdn.ndn). That is, although it doesn't make sense for the plugin to change the pointer without changing the actual dn, it is still possible. I suggest using slapi_sdn_compare to make sure they are really different.
But the issue is not the dn value, but the memory...
slapi_sdn_compare basically compares the normalized dn from the 2 Slapi_DN to compare.
If a pre-bind plug-in happens to recreate a new sdn from the same normalized dn, we crashes there... 1845 backend_bind_cb(Slapi_PBlock pb) 1914 slapi_pblock_get(pb, SLAPI_BIND_TARGET_SDN, &sdn); 1916 slapi_sdn_free(&sdn); 1918 sdn = slapi_sdn_new_dn_byref(ndn); 1919 slapi_pblock_set(pb, SLAPI_BIND_TARGET_SDN, (void) sdn);
Comment from rmeggins (@richm) at 2015-09-19 00:48:56
There are two issues
1) If the plugin changes the SLAPI_BIND_TARGET_SDN pointer, we need to use it instead of the current sdn
and dn
. For this, it is sufficient to do pb_sdn != sdn
.
2) If the plugin changes the SLAPI_BIND_TARGET_SDN value, we need to select a different backend. It is possible (but not very useful) for the plugin to change the pointer, but use the same value.
I'm also not sure if it is sufficient to call be = slapi_be_select(sdn);
. The bind code sets the be like this:
/* We could be serving multiple database backends. Select the appropriate one */
if (slapi_mapping_tree_select(pb, &be, &referral, errorbuf) != LDAP_SUCCESS) {
send_nobackend_ldap_result( pb );
be = NULL;
goto free_and_return;
}
Do we need to have similar logic in order to change the backend? What if the plugin selects a non-existent backend that causes be == defbackend_get_backend()
?
We also need to set the new backend in the pblock: slapi_pblock_set( pb, SLAPI_BACKEND, be );
Comment from nhosoi (@nhosoi) at 2015-09-19 04:28:08
git patch file (master) -- Additional fixes based upon the comments by Rich in comment 24 0001-Ticket-48188-segfault-in-ns-slapd-due-to-accessing-S.2.patch
Comment from nhosoi (@nhosoi) at 2015-09-19 05:58:24
Thanks to Simo for the patch & Thanks to Rich for the review and the suggestion.
Pushed to master: c41d36d..8212a89 master -> master commit 40e0d0f80d6fd1271431e105580293747c43c327 commit 8212a8913b748cd1f5e986a754c37ef41db8272a
Pushed to 389-ds-base-1.3.4: d9f03f5..a215c00 389-ds-base-1.3.4 -> 389-ds-base-1.3.4 commit 6871f4f6d14198563f7f3cb0646a00faa28d35ea commit a215c006e0900caaa555def9e047e295844d8652
Comment from wetpaste at 2017-02-11 23:07:39
Metadata Update from @wetpaste:
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/48188
Hopefully this is the appropriate place for this.
I'm running freeIPA on centos 7. Each of my client servers are doing authentication and group lookup via vanilla ldap. I am getting pretty consistent segfaults in the ns-slapd process, especially when I am doing mass changes via ansible. This triggers a flood of ldap requests that end up crashing the service, I haven't been able to pinpoint the exact conditions but it doesn't take long for it to buckle.
version 1.3.3.1 on centos 7