Closed 389-ds-bot closed 4 years ago
Comment from twalkertwosigma at 2017-08-21 18:45:34
Unclear whether the two attachments uploaded. Let me know if they didn't and I'll try again. Thanks!
Comment from mreynolds (@mreynolds389) at 2017-08-21 18:55:45
Unclear whether the two attachments uploaded. Let me know if they didn't and I'll try again. Thanks!
No attachments...
Comment from mreynolds (@mreynolds389) at 2017-08-21 18:55:45
Metadata Update from @mreynolds389:
Comment from twalkertwosigma at 2017-08-21 19:15:06
Comment from twalkertwosigma at 2017-08-21 19:15:49
Comment from firstyear (@Firstyear) at 2017-08-22 01:09:39
Hi @twalkertwosigma
Normally issues like this are better on our mailing list than our bug tracker for the future :)
I've seen similar cases like this recently. The last case I saw like this had a similar access log pattern and stack trace. I think it was their system tuning.
It looks like you have a lot of backends configured on these instances from the stack trace, so can you show me the output of:
ldapsearch -H <LDAPURIHERE> -D 'cn=Directory Manager' -x -W -b 'cn=config' '(|(cn=config)(cn=monitor)(objectClass=nsBackendInstance))' entrycachehitratio cn dncachehitratio normalizeddncachehitratio nsslapd-dncachememsize nsslapd-cachememsize dbcachehitratio nsslapd-dbcachesize nsslapd-ndn-cache-max-size nsslapd-conntablesize nsslapd-maxdescriptors nsslapd-threadnumber nsslapd-ioblocktimeout
Thanks,
Comment from twalkertwosigma at 2017-08-22 15:13:54
Apologies, our security folks preferred that even the redacted logs/backtraces remained somewhat private... These are each dedicated VMs with 4 cores and 8G RAM behind a load balancer and ~40000 entries. Any recommendations regarding ideal server sizing and/or tuning parameters would be greatly appreciated. Thanks!
#
#
dn: cn=config cn: config nsslapd-ndn-cache-max-size: 20971520 nsslapd-conntablesize: 16386 nsslapd-maxdescriptors: 16384 nsslapd-threadnumber: 16 nsslapd-ioblocktimeout: 1800000
dn: cn=config,cn=Account Policy Plugin,cn=plugins,cn=config cn: config
dn: cn=config,cn=chaining database,cn=plugins,cn=config cn: config
dn: cn=config,cn=ldbm database,cn=plugins,cn=config cn: config nsslapd-dbcachesize: 1258291200
dn: cn=monitor,cn=ldbm database,cn=plugins,cn=config cn: monitor dbcachehitratio: 653
dn: cn=userRoot,cn=ldbm database,cn=plugins,cn=config cn: userRoot nsslapd-dncachememsize: 52776556 nsslapd-cachememsize: 3355443200 nsslapd-dbcachesize: 3355443200
dn: cn=monitor,cn=userRoot,cn=ldbm database,cn=plugins,cn=config entrycachehitratio: 99 cn: monitor dncachehitratio: 92 normalizeddncachehitratio: 99
search: 4 result: 0 Success
Comment from twalkertwosigma at 2017-08-30 15:38:46
Hi, just wanted to follow-up to see if you had any configuration/sizing suggestions? Thanks...
Comment from tbordaz (@tbordaz) at 2017-09-04 12:27:27
Hi @twalkertwosigma ,
looking at backstack the server looks hanging because all workers threads are busy with requests. So the server can accept new connections/requests but can not process them. The workers threads are all hanging on db pages and likely it is related to the checkpointing thread that is running (thread 3).
The checkpointing seems to be sleeping, but should be confirm with 'top -H -p
Looking at the code (but I am not an BDB expert) it seems to be related to reduce the disk write rate. The db cache looks quite high (nsslapd-dbcachesize: 1Gb), did you change it recently ? Would you try to reduce this value (let's say to 400Mb), to see if it can workaround the hang effect.
Comment from tbordaz (@tbordaz) at 2017-09-04 13:44:35
@twalkertwosigma after a second look at the ticket, I wonder it is not an other flavor of https://bugzilla.redhat.com/show_bug.cgi?id=1349779.
The problem seems to happen frequently (1-2 times a week ) but without identified reproducible steps. It would be interesting to know if it continues to happen with a smaller dbcache. Also, would it be acceptable to run with a debug version (DS and/or BDB) to capture more traces ?
Comment from twalkertwosigma at 2017-09-11 16:56:56
1-2 times a week total over ~30 user facing LDAP servers (so each server runs into it once every month or so). If we had a "frequent" reproducer, we could certainly try a debug version but I'm reluctant to roll out a debug version to the whole (production) plant.
What kind of debugging specifically did you have in mind? I've tried enabling the verbose connection logging on a few servers but that seems to change timing enough that we haven't had one deadlock.
Rolling out the dbcache changes this week hopefully and will update once we've had it running for a bit.
Comment from tbordaz (@tbordaz) at 2017-09-12 09:32:15
@twalkertwosigma thanks for your feedback.
Let's try a reduced dbcache to check if it is related to the hangs.
Regarding debug version, I was thinking to get more traces about https://bugzilla.redhat.com/show_bug.cgi?id=1349779. But I realize that we already have a reproducer for that BZ so we can progress without bothering you with a debug version.
Comment from twalkertwosigma at 2017-09-12 16:36:58
Caught another one... this time with all FDs exhausted.
checkpoint thread:
0 0x00007fb4ee7dddf3 in select () at ../sysdeps/unix/syscall-template.S:82
No locals.
1 0x00007fb4e527be15 in os_sleep (usecs=
Attaching gdb and sticking a few breakpoints in indicates that the __memp_sync_int never returns (looks like we never get out of the "Walk the array, writing buffers" loop).
time_of_last_comapctdb_completion is from Friday but time_of_last_checkpoint_completion was 4:44:27 AM GMT this morning. For comparison, the last replication began at 4:45:06, last RESULT returned was 4:45:09, and last SRCH logged was 4:45:10.
Comment from twalkertwosigma at 2017-09-20 20:27:16
Well, finally got the dbcache shrunk on all servers Monday at ~5pm but just had one deadlock now (less than 48 hours later). time_of_last_comapctdb_completion was at dirsrv startup, time_of_last_checkpoint_completion was at 14:29:32 GMT, last SRCH and RESULT were logged at 14:30:10, and last replication began at 14:30:02 and was hung after 14:30:09.
Thoughts?
Comment from tbordaz (@tbordaz) at 2017-09-21 08:25:45
Thanks for your tests.
So my understanding is that the frequency of the hangs (with smaller dbcache) looks identical than before (with larger dbcache). You used to hit 1-2 deadlock a week and you just hit one 48h after starting the tests. So dbcache size seems to have no impact on deadlock neither on frequency.
Do you have a production system where the hang does not happen ?
As you we suspect an external issue (not DB or DS) like for example file system, do you know if the servers are up to date ?
Comment from twalkertwosigma at 2017-09-21 15:59:43
Yes, systems are up to date and using plain old ext4. The only systems on which this does not happen are the replication masters (which have no clients). When we brought a new datacenter online some months back we built out completely fresh VMs on new hypervisors (which have no other VMs running on them) and ran into the same deadlocks almost immediately. So not entirely sure it is external.
At this point, I think the next thing we're going to try is turning down the number of threads per server and monitor throughput closely.
Comment from tbordaz (@tbordaz) at 2017-09-21 16:08:20
Are the replication masters running on VM as well ? Are they at the same OS level as then hanging systems ?
Basically, if it does not hang on some systems we may identify the differences with the hanging ones. I understand that direct ldapclient request is one of the difference. I agree it would be interesting to check if reducing direct ldapclient request, on a system that use to hang, prevent or not the problem.
Comment from twalkertwosigma at 2017-09-21 17:40:21
Yes to both questions. The replication masters are actually smaller (2 core) VMs running the same OS, libdb, ds versions, and actually on the same hypervisor as at least one, sometimes 2 of the client facing servers. Hypervisors themselves are large, usually minimum 20c/40t w/ 256G RAM, usually shared with other infra VMs but half of the client facing servers in the new DC are by themselves an over-spec'd machine and still encounter deadlocks)
Comment from lkrispen (@elkris) at 2017-09-21 17:59:58
if you say that the masters have no client load, do you have the same index and idl scan limits on the masters as on the consumers ?
Comment from twalkertwosigma at 2017-09-21 23:13:09
Yes, configs on the replication masters (despite no clients) are the same as ones with clients, right on down to the number of threads. This appears to be something inherited from a prior owner. And I was incorrect, they're even the same size VMs as the servers with clients (4 core, not 2 as I had recalled).
Comment from firstyear (@Firstyear) at 2017-10-12 17:30:54
@tjaalton Can you comment about the available versions of the package on debian? Is there something newer we can upgrade too that may resolve the issue?
Comment from firstyear (@Firstyear) at 2018-01-30 00:18:28
@twalkertwosigma https://bugzilla.redhat.com/show_bug.cgi?id=1349779 this issue could be related, and has been solved. Perhaps @tjaalton can ensure the patch is included in .deb based systems?
Comment from twalkertwosigma at 2018-01-30 17:54:48
Interesting. Does anyone have a link handy to the lidb bug or patch? Yes, ushering it along with Debian upstream would be helpful...
Comment from firstyear (@Firstyear) at 2018-01-31 01:03:06
https://src.fedoraproject.org/rpms/libdb/blob/master/f/checkpoint-opd-deadlock.patch This is the required libdb patch here.
Comment from mreynolds (@mreynolds389) at 2018-06-14 19:59:14
Any update on this after trying the libdb patch?
FYI, we are currently only supporting 389-ds-base-1.3.8 and up. Are you still on 1.3.3?
Comment from mreynolds (@mreynolds389) at 2019-01-10 18:01:38
Closing ticket, but please reopen if you encounter similar problems on the latest version of 389-ds-base
Comment from mreynolds (@mreynolds389) at 2019-01-10 18:01:48
Metadata Update from @mreynolds389:
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49360
We've recently been encountering what appear to be deadlocks on our client facing 389ds servers with some regularity. In all cases we see the server get into a state where it is registering new connections but not getting any further with any of them (no new BINDs, SRCHs or RESULTs). Eventually it appears to stop even timing out existing connections and, as a result, fd usage spirals out of control and eventually hits the process limit at which point it needs to be manually killed (systemctl stop/restart hangs) and restarted.
In every case I've seen so far, an incoming replication connection is in progress during the hang and is usually one of the last successful connections logged. Each of the client facing servers is a slave (single master) and does not serve as a replication source itself.
We're currently running the servers on an old Debian Wheezy build using rebuilt Jessie packages (389ds 1.3.3.5-4 and libdb5.3 5.3.28-9).
While we run into this with some regularity, we're not able to reproduce it outside of our production environment. We have some ~30 client facing LDAP servers and tend to have 1-2 a week deadlock. I've attempted enabling debug 'Connection management' logging on a couple of servers that seem to encounter this more frequently than others but have, so far, not encountered the problem with it enabled.
Attaching access logs (error logs show nothing) and backtraces from one such hang, with user/group/netgroup/host names pretty heavily redacted (buffer lengths in backtraces will be off, etc). In this instance, the server was caught well before hitting the process fd limit but was pretty clearly on its way there. Multiple backtraces over a couple of minutes are always identical, although tracing some of the internal (eg. bdb deadlock detection) threads still shows movement.
I'm pretty stumped at this point and would greatly appreciate any guidance.
<!!image>
<!!image> <!!image>