cyrusimap / cyrus-imapd

Cyrus IMAP is an email, contacts and calendar server
http://cyrusimap.org
Other
544 stars 149 forks source link

Cyrus 2.5.10 - centos 7.3. Cyrus process hangs causing a lock that spawns lots of other hanging processes come back. #2012

Open rendxq opened 7 years ago

rendxq commented 7 years ago

Today I've cought bug such as discribed in https://github.com/cyrusimap/cyrus-imapd/issues/1599. Behavior almost the same - lots of imap-processes stand in waiting for "fcntl(16, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}" for cyrus. index and cyrus.header files. And one process hangs waiting for "write(1, "&=\263`\217v9\5\215\275\2243\32\32n 6\230\261\212\24U\277{\342\36\235\7\v\252\271\210"..., 2979" for cyrus.header. After killing that process users had been returned access to shared imap folder. How to collect more extra info for such case as this?

elliefm commented 7 years ago

That sounds like some sort of lock contention for the shared mailbox, do you have a lot of users that access it simultaneously? Is it on local or remote storage?

elliefm commented 7 years ago

And, is this a recurring problem, or has it just happened once so far?

The problem reported in #1599 is specifically for the 2.4.x version of idled, which was completely replaced in 2.5. But you could use a similar strategy as discussed in the comments there to identify which process is holding the lock and what it's blocked if it happens again

rendxq commented 7 years ago

That sounds like some sort of lock contention for the shared mailbox, do you have a lot of users that access it simultaneously? Is it on local or remote storage?

10 - 15 simultaneously users' connections and two robots which read headers for all messages twice a day. 2-10 new messages every minutes. Shared folder has from 60000 to 100000 messages.

I don't know how to give correct answer about storage... I think both variant (local and remote). I'll try discribe storage level. Mail server (CYRUS) is "2 node cluster+DRBD+KMV" KMV VM is CENTOS: mail_spool <- xfs <- LVM <- vda (virtio driver) HOST system1: RAID10+cache2GB+battary -> sda -> LMV -> DRBD | HOST system2: RAID10+cache2GB+battary -> sda -> LMV -> DRBD

Also when imap-processes hung i see lots of connections with CLOSE_WAIT status during looooooong time, for example: tcp 32 0 x.x.x.115:993 y.y.y.12:55530 CLOSE_WAIT 22151/imapd off (0.00/0/0) tcp 32 0 x.x.x.115:993 y.y.y.86:58811 CLOSE_WAIT 22294/imapd off (0.00/0/0) tcp 38 0 x.x.x.115:993 y.y.y.119:50500 CLOSE_WAIT 22512/imapd off (0.00/0/0) tcp 32 0 x.x.x.115:993 y.y.y.51:50809 CLOSE_WAIT 10606/imapd off (0.00/0/0) tcp 32 0 x.x.x.115:993 y.y.y.26:52979 CLOSE_WAIT 22155/imapd off (0.00/0/0) tcp 32 0 x.x.x.115:993 y.y.y.26:52976 CLOSE_WAIT 19452/imapd off (0.00/0/0) tcp 38 0 x.x.x.115:993 y.y.y.160:58200 CLOSE_WAIT 18394/imapd off (0.00/0/0) tcp 32 0 x.x.x.115:993 y.y.y.86:58813 CLOSE_WAIT 22987/imapd off (0.00/0/0) tcp 38 0 x.x.x.115:993 y.y.y.24:57034 CLOSE_WAIT 22730/imapd off (0.00/0/0)

elliefm commented 7 years ago

Also when imap-processes hung i see lots of connections with CLOSE_WAIT status during looooooong time, for example:

I think these are connections where the client end has given up waiting for a response and disconnected. The Cyrus end of the sockets haven't been closed yet because those imapd processes are blocked waiting for the lock to become available. When the lock becomes available, each one will wake up in turn, notice the client has disconnected, and shut itself down. But in the meantime, those sockets will hang around in CLOSE_WAIT state.

And one process hangs waiting for "write(1, "&=\263`\217v9\5\215\275\2243\32\32n 6\230\261\212\24U\277{\342\36\235\7\v\252\271\210"..., 2979" for cyrus.header. After killing that process users had been returned access to shared imap folder. How to collect more extra info for such case as this?

When this happens again, could you please find this process (i.e. the one that is holding the lock and is hanging in the write() call), connect to it with gdb, run the command bt, and paste the gdb output here? Thanks