389ds / 389-ds-base

The enterprise-class Open Source LDAP server for Linux
https://www.port389.org/
Other
211 stars 91 forks source link

Encountering server deadlocks on 1.3.3.5 #2419

Closed 389-ds-bot closed 4 years ago

389-ds-bot commented 4 years ago

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49360


We've recently been encountering what appear to be deadlocks on our client facing 389ds servers with some regularity. In all cases we see the server get into a state where it is registering new connections but not getting any further with any of them (no new BINDs, SRCHs or RESULTs). Eventually it appears to stop even timing out existing connections and, as a result, fd usage spirals out of control and eventually hits the process limit at which point it needs to be manually killed (systemctl stop/restart hangs) and restarted.

In every case I've seen so far, an incoming replication connection is in progress during the hang and is usually one of the last successful connections logged. Each of the client facing servers is a slave (single master) and does not serve as a replication source itself.

We're currently running the servers on an old Debian Wheezy build using rebuilt Jessie packages (389ds 1.3.3.5-4 and libdb5.3 5.3.28-9).

While we run into this with some regularity, we're not able to reproduce it outside of our production environment. We have some ~30 client facing LDAP servers and tend to have 1-2 a week deadlock. I've attempted enabling debug 'Connection management' logging on a couple of servers that seem to encounter this more frequently than others but have, so far, not encountered the problem with it enabled.

Attaching access logs (error logs show nothing) and backtraces from one such hang, with user/group/netgroup/host names pretty heavily redacted (buffer lengths in backtraces will be off, etc). In this instance, the server was caught well before hitting the process fd limit but was pretty clearly on its way there. Multiple backtraces over a couple of minutes are always identical, although tracing some of the internal (eg. bdb deadlock detection) threads still shows movement.

I'm pretty stumped at this point and would greatly appreciate any guidance.

<!!image>

<!!image> <!!image>

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2017-08-21 18:45:34

Unclear whether the two attachments uploaded. Let me know if they didn't and I'll try again. Thanks!

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2017-08-21 18:55:45

Unclear whether the two attachments uploaded. Let me know if they didn't and I'll try again. Thanks!

No attachments...

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2017-08-21 18:55:45

Metadata Update from @mreynolds389:

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2017-08-21 19:15:06

389ds-bt.cleaned

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2017-08-21 19:15:49

389ds-logs.cleaned

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2017-08-22 01:09:39

Hi @twalkertwosigma

Normally issues like this are better on our mailing list than our bug tracker for the future :)

I've seen similar cases like this recently. The last case I saw like this had a similar access log pattern and stack trace. I think it was their system tuning.

It looks like you have a lot of backends configured on these instances from the stack trace, so can you show me the output of:

ldapsearch -H <LDAPURIHERE> -D 'cn=Directory Manager' -x -W -b 'cn=config' '(|(cn=config)(cn=monitor)(objectClass=nsBackendInstance))' entrycachehitratio cn dncachehitratio normalizeddncachehitratio  nsslapd-dncachememsize nsslapd-cachememsize dbcachehitratio nsslapd-dbcachesize nsslapd-ndn-cache-max-size nsslapd-conntablesize nsslapd-maxdescriptors nsslapd-threadnumber nsslapd-ioblocktimeout 

Thanks,

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2017-08-22 15:13:54

Apologies, our security folks preferred that even the redacted logs/backtraces remained somewhat private... These are each dedicated VMs with 4 cores and 8G RAM behind a load balancer and ~40000 entries. Any recommendations regarding ideal server sizing and/or tuning parameters would be greatly appreciated. Thanks!

extended LDIF

#

LDAPv3

base with scope subtree

filter: (|(cn=config)(cn=monitor)(objectClass=nsBackendInstance))

requesting: entrycachehitratio cn dncachehitratio normalizeddncachehitratio nsslapd-dncachememsize nsslapd-cachememsize dbcachehitratio nsslapd-dbcachesize nsslapd-ndn-cache-max-size nsslapd-conntablesize nsslapd-maxdescriptors nsslapd-threadnumber nsslapd-ioblocktimeout

#

config

dn: cn=config cn: config nsslapd-ndn-cache-max-size: 20971520 nsslapd-conntablesize: 16386 nsslapd-maxdescriptors: 16384 nsslapd-threadnumber: 16 nsslapd-ioblocktimeout: 1800000

config, Account Policy Plugin, plugins, config

dn: cn=config,cn=Account Policy Plugin,cn=plugins,cn=config cn: config

config, chaining database, plugins, config

dn: cn=config,cn=chaining database,cn=plugins,cn=config cn: config

config, ldbm database, plugins, config

dn: cn=config,cn=ldbm database,cn=plugins,cn=config cn: config nsslapd-dbcachesize: 1258291200

monitor, ldbm database, plugins, config

dn: cn=monitor,cn=ldbm database,cn=plugins,cn=config cn: monitor dbcachehitratio: 653

userRoot, ldbm database, plugins, config

dn: cn=userRoot,cn=ldbm database,cn=plugins,cn=config cn: userRoot nsslapd-dncachememsize: 52776556 nsslapd-cachememsize: 3355443200 nsslapd-dbcachesize: 3355443200

monitor, userRoot, ldbm database, plugins, config

dn: cn=monitor,cn=userRoot,cn=ldbm database,cn=plugins,cn=config entrycachehitratio: 99 cn: monitor dncachehitratio: 92 normalizeddncachehitratio: 99

search result

search: 4 result: 0 Success

numResponses: 8

numEntries: 7

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2017-08-30 15:38:46

Hi, just wanted to follow-up to see if you had any configuration/sizing suggestions? Thanks...

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2017-09-04 12:27:27

Hi @twalkertwosigma ,

looking at backstack the server looks hanging because all workers threads are busy with requests. So the server can accept new connections/requests but can not process them. The workers threads are all hanging on db pages and likely it is related to the checkpointing thread that is running (thread 3). The checkpointing seems to be sleeping, but should be confirm with 'top -H -p ' when the hang occurs.

Looking at the code (but I am not an BDB expert) it seems to be related to reduce the disk write rate. The db cache looks quite high (nsslapd-dbcachesize: 1Gb), did you change it recently ? Would you try to reduce this value (let's say to 400Mb), to see if it can workaround the hang effect.

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2017-09-04 13:44:35

@twalkertwosigma after a second look at the ticket, I wonder it is not an other flavor of https://bugzilla.redhat.com/show_bug.cgi?id=1349779.

The problem seems to happen frequently (1-2 times a week ) but without identified reproducible steps. It would be interesting to know if it continues to happen with a smaller dbcache. Also, would it be acceptable to run with a debug version (DS and/or BDB) to capture more traces ?

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2017-09-11 16:56:56

1-2 times a week total over ~30 user facing LDAP servers (so each server runs into it once every month or so). If we had a "frequent" reproducer, we could certainly try a debug version but I'm reluctant to roll out a debug version to the whole (production) plant.

What kind of debugging specifically did you have in mind? I've tried enabling the verbose connection logging on a few servers but that seems to change timing enough that we haven't had one deadlock.

Rolling out the dbcache changes this week hopefully and will update once we've had it running for a bit.

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2017-09-12 09:32:15

@twalkertwosigma thanks for your feedback.

Let's try a reduced dbcache to check if it is related to the hangs.

Regarding debug version, I was thinking to get more traces about https://bugzilla.redhat.com/show_bug.cgi?id=1349779. But I realize that we already have a reproducer for that BZ so we can progress without bothering you with a debug version.

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2017-09-12 16:36:58

Caught another one... this time with all FDs exhausted.

checkpoint thread: 0 0x00007fb4ee7dddf3 in select () at ../sysdeps/unix/syscall-template.S:82 No locals. 1 0x00007fb4e527be15 in os_sleep (usecs=, secs=, env=0x560aeb039dd0) at ../src/os/os_yield.c:90 t = {tv_sec = 0, tv_usec = 807842} ret = 2 os_yield (env=0x560aeb039dd0, secs=, usecs=) at ../src/os/os_yield.c:48 No locals. 3 0x00007fb4e527747e in memp_sync_int (env=0x560aeb039dd0, dbmfp=, trickle_max=446, flags=0, wrote_totalp=0x0, interruptedp=0x0) at ../src/mp/mp_sync.c:483 bhp = bharray = 0x0 dbmp = 0x560aeb033010 hp = c_mp = mp = 0x7fb48cc52138 mfp = mutex = last_mf_offset = ar_cnt = 2 ar_max = 444 i = 0 n_cache = 0 remaining = 2 wrote_total = wrote_cnt = dirty = filecnt = maxopenfd = 0 required_write = 0 ret = t_ret = 4 0x00007fb4e52878b4 in __txn_checkpoint (env=env@entry=0x560aeb039dd0, kbytes=kbytes@entry=0, minutes=, minutes@entry=0, flags=flags@entry=0) at ../src/txn/txn_chkpt.c:242 dblp = ckp_lsn = {file = 1840, offset = 2259098} last_ckp = {file = 1840, offset = 767155} msg_lsn = {file = 1840, offset = 2268786} mgr = region = 0x7fb4db7afea8 lp = 0x7fb4db68c328 renv = infop = last_ckp_time = now = 94604892292976 bytes = 906608 id = 3106197615 logflags = mbytes = 0 op = ret = 0 5 0x00007fb4e5287d44 in txn_checkpoint_pp (dbenv=, kbytes=0, minutes=0, flags=0) at ../src/txn/txn_chkpt.c:81 rep_check = 0 __t_ret = -514 ip = 0x0 env = 0x560aeb039dd0 ret = 0 6 0x00007fb4e5500d29 in checkpoint_threadmain (param=) at ../ldap/servers/slapd/back-ldbm/dblayer.c:4740 time_of_last_checkpoint_completion = 1505191467 interval = 250 rval = priv = 0x560aeaeb29d0 li = debug_checkpointing = 0 checkpoint_interval = home_dir = list = 0x0 listp = penv = 0x560aeb030860 time_of_last_comapctdb_completion = 1504910533 compactdb_interval = 2592000 txn = {back_txn_txn = 0x0} 7 0x00007fb4ef0f9583 in ?? () from /usr/lib/x86_64-linux-gnu/libnspr4.so No symbol table info available. 8 0x00007fb4eea99b50 in start_thread (arg=) at pthread_create.c:304 res = pd = 0x7fb48c450700 unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140413424174848, -4971012721935698668, 140736498165136, 140413424175552, 140415106166848, 3, 4929878969768162580, 4929954053154092308}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0``` not_first_call = freesize = __PRETTY_FUNCTION__ = "start_thread" 9 0x00007fb4ee7e430d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 No locals. 10 0x0000000000000000 in ?? ()

Attaching gdb and sticking a few breakpoints in indicates that the __memp_sync_int never returns (looks like we never get out of the "Walk the array, writing buffers" loop).

time_of_last_comapctdb_completion is from Friday but time_of_last_checkpoint_completion was 4:44:27 AM GMT this morning. For comparison, the last replication began at 4:45:06, last RESULT returned was 4:45:09, and last SRCH logged was 4:45:10.

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2017-09-20 20:27:16

Well, finally got the dbcache shrunk on all servers Monday at ~5pm but just had one deadlock now (less than 48 hours later). time_of_last_comapctdb_completion was at dirsrv startup, time_of_last_checkpoint_completion was at 14:29:32 GMT, last SRCH and RESULT were logged at 14:30:10, and last replication began at 14:30:02 and was hung after 14:30:09.

Thoughts?

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2017-09-21 08:25:45

Thanks for your tests.

So my understanding is that the frequency of the hangs (with smaller dbcache) looks identical than before (with larger dbcache). You used to hit 1-2 deadlock a week and you just hit one 48h after starting the tests. So dbcache size seems to have no impact on deadlock neither on frequency.

Do you have a production system where the hang does not happen ?

As you we suspect an external issue (not DB or DS) like for example file system, do you know if the servers are up to date ?

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2017-09-21 15:59:43

Yes, systems are up to date and using plain old ext4. The only systems on which this does not happen are the replication masters (which have no clients). When we brought a new datacenter online some months back we built out completely fresh VMs on new hypervisors (which have no other VMs running on them) and ran into the same deadlocks almost immediately. So not entirely sure it is external.

At this point, I think the next thing we're going to try is turning down the number of threads per server and monitor throughput closely.

389-ds-bot commented 4 years ago

Comment from tbordaz (@tbordaz) at 2017-09-21 16:08:20

Are the replication masters running on VM as well ? Are they at the same OS level as then hanging systems ?

Basically, if it does not hang on some systems we may identify the differences with the hanging ones. I understand that direct ldapclient request is one of the difference. I agree it would be interesting to check if reducing direct ldapclient request, on a system that use to hang, prevent or not the problem.

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2017-09-21 17:40:21

Yes to both questions. The replication masters are actually smaller (2 core) VMs running the same OS, libdb, ds versions, and actually on the same hypervisor as at least one, sometimes 2 of the client facing servers. Hypervisors themselves are large, usually minimum 20c/40t w/ 256G RAM, usually shared with other infra VMs but half of the client facing servers in the new DC are by themselves an over-spec'd machine and still encounter deadlocks)

389-ds-bot commented 4 years ago

Comment from lkrispen (@elkris) at 2017-09-21 17:59:58

if you say that the masters have no client load, do you have the same index and idl scan limits on the masters as on the consumers ?

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2017-09-21 23:13:09

Yes, configs on the replication masters (despite no clients) are the same as ones with clients, right on down to the number of threads. This appears to be something inherited from a prior owner. And I was incorrect, they're even the same size VMs as the servers with clients (4 core, not 2 as I had recalled).

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2017-10-12 17:30:54

@tjaalton Can you comment about the available versions of the package on debian? Is there something newer we can upgrade too that may resolve the issue?

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2018-01-30 00:18:28

@twalkertwosigma https://bugzilla.redhat.com/show_bug.cgi?id=1349779 this issue could be related, and has been solved. Perhaps @tjaalton can ensure the patch is included in .deb based systems?

389-ds-bot commented 4 years ago

Comment from twalkertwosigma at 2018-01-30 17:54:48

Interesting. Does anyone have a link handy to the lidb bug or patch? Yes, ushering it along with Debian upstream would be helpful...

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2018-01-31 01:03:06

https://src.fedoraproject.org/rpms/libdb/blob/master/f/checkpoint-opd-deadlock.patch This is the required libdb patch here.

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2018-06-14 19:59:14

Any update on this after trying the libdb patch?

FYI, we are currently only supporting 389-ds-base-1.3.8 and up. Are you still on 1.3.3?

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2019-01-10 18:01:38

Closing ticket, but please reopen if you encounter similar problems on the latest version of 389-ds-base

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2019-01-10 18:01:48

Metadata Update from @mreynolds389: