ceph-dovecot / dovecot-ceph-plugin

Dovecot plugin for storing mails in a Ceph cluster
Other
131 stars 23 forks source link

dovecot.index reset, view is inconsistent #205

Closed jrse closed 6 years ago

jrse commented 6 years ago

During Imaptest view is now inconsistent messages in dovecot log

During long time imaptests the following situation occurred. Log contains messages like:

Oct 30 07:23:01 imap(t382): Warning: rbox /opt/mail/rbox/t382/mailboxes/Trash/rbox-Mails: Rebuilding index, guid: 6d2093208ff7d75baa6e0000d561c877 , mailbox_name: Trash, alt_storage: (null) Oct 30 07:25:13 imap(t382): Error: /opt/mail/rbox/t382/mailboxes/INBOX/rbox-Mails/dovecot.index reset, view is now inconsistent

The index file look consistent, but there are unreferenced objects in the mail object store. There where errors with insufficient system limits -u which lead to log messages like:

Oct 30 07:21:01 lmtp: Error: Thread::try_create(): pthread_create failed with error 11/home/jrse/projects/ceph/src/common/Thread.cc: In function 'void Thread::create(const char, size_t)' thread 151948b76000 time 2018-10-30 07:21:01.090860 Oct 30 07:21:01 lmtp: Error: /home/jrse/projects/ceph/src/common/Thread.cc: 155: FAILED assert(ret == 0) Oct 30 07:21:01 lmtp: Error: ceph version 14.0.0-2359-g33bc68caf4 (33bc68caf4cbadbf0df86e82d6dbbf0b04540450) nautilus (dev) Oct 30 07:21:01 lmtp: Error: 1: (ceph::__ceph_assert_fail(char const, char const, int, char const)+0x152) [0x15193dede5f5] Oct 30 07:21:01 lmtp: Error: 2: (ceph::ceph_assertf_fail(char const, char const, int, char const, char const, ...)+0) [0x15193dede777] Oct 30 07:21:01 lmtp: Error: 3: (()+0x2a4012) [0x15193df52012] Oct 30 07:21:01 lmtp: Error: 4: (MgrClient::init()+0x30) [0x15193e1d6520] Oct 30 07:21:01 lmtp: Error: 5: (librados::RadosClient::connect()+0x1119) [0x151947198ac9] Oct 30 07:21:01 lmtp: Error: 6: (librmb::RadosClusterImpl::connect()+0x43) [0x151947479f0f] Oct 30 07:21:01 lmtp: Error: 7: (librmb::RadosClusterImpl::io_ctx_create(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, librados::IoCtx)+0x67) [0x15194747a18d] Oct 30 07:21:01 lmtp: Error: 8: (librmb::RadosStorageImpl::create_connection(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)+0x4b) [0x15194747b5a5] Oct 30 07:21:01 lmtp: Error: 9: (librmb::RadosStorageImpl::open_connection(std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)+0xa6) [0x15194747b506] Oct 30 07:21:01 lmtp: Error: 10: (rbox_open_rados_connection()+0x11e) [0x1519476d8f3a] Oct 30 07:21:01 lmtp: Error: 11: (rbox_mail_storage_copy(mail_save_context, mail)+0xab) [0x1519476c8af6] Oct 30 07:21:01 lmtp: Error: 12: (rbox_mail_copy(mail_save_context, mail*)+0xcb) [0x1519476c7225] Oct 30 07:21:01 lmtp: Error: 13: (()+0x5d49) [0x15194877bd49] Oct 30 07:21:01 lmtp: Error: 14: (()+0x4b4f1) [0x1519484764f1] Oct 30 07:21:01 lmtp: Error: 15: (mail_deliver_save()+0x190) [0x15194877c710] Oct 30 07:21:01 lmtp: Error: 16: (mail_deliver()+0x18b) [0x15194877cbfb] Oct 30 07:21:01 lmtp: Error: 17: (()+0x73ac) [0x55b5d76d23ac] Oct 30 07:21:01 lmtp: Error: 18: (io_loop_call_io()+0x52) [0x1519481add02] Oct 30 07:21:01 lmtp: Error: 19: (io_loop_handler_run_internal()+0x12e) [0x1519481af41e] Oct 30 07:21:01 lmtp: Error: 20: (io_loop_handler_run()+0x36) [0x1519481add96] Oct 30 07:21:01 lmtp: Error: 21: (io_loop_run()+0x38) [0x1519481adf48] Oct 30 07:21:01 lmtp: Error: 22: (master_service_run()+0x13) [0x1519481317c3] Oct 30 07:21:01 lmtp: Error: 23: (main()+0x1cc) [0x55b5d76d048c] Oct 30 07:21:01 lmtp: Error: 24: (__libc_start_main()+0xe7) [0x151947d26b97] Oct 30 07:21:01 lmtp: Error: 25: (_start()+0x2a) [0x55b5d76d055a]

To Reproduce Steps to reproduce the behavior:

  1. setup rbox plugin with dovecot
  2. run imaptest (https://github.com/ceph-dovecot/imaptest) clients=1000 standard mailbox and imap-profile.conf ../imaptest/src/
  3. wait.... (it took my system ~24h)
  4. See dovecot log.
  5. use "doveadm rmb ls -u - | more" to see if there are unreferenced objects.

Expected behavior No matter if a connection error occurs, index should be in sync with the objects in the mailstore.

Logfiles, screenshots If applicable, add Logfiles or screenshots to help explain your problem. dovecot.log.zip

$ doveadm rmb ls -u t591 -| more time elapsed loading objects: 26 INDEX: Check box: INBOX.Drafts mails total: 0, missing mails in objectstore: 0 box: INBOX mails total: 911, missing mails in objectstore: 0 box: INBOX.Trash mails total: 0, missing mails in objectstore: 0 box: INBOX.Spam mails total: 303, missing mails in objectstore: 0 box: INBOX.Sent mails total: 0, missing mails in objectstore: 0 mailbox_count: 2

MAILBOX: M(mailbox_guid)=2289751867bdd65b6b1f0000d561c877 B(mailbox_orig_name)=INBOX mail_total=310, mails_displayed=310 mailbox_size=1027685 bytes

     MAIL:   U(uid)=304
             oid = 015ade17e1ded75be9290000d561c877
             R(receive_time)=2018-10-30 05:32:30
             save_time=2018-10-30 07:22:38
             Z(phy_size)=1973 V(v_size) = 2022 stat_size=1973
             M(mailbox_guid)=2289751867bdd65b6b1f0000d561c877
             B(mailbox_orig_name)=INBOX
             G(mail_guid)=5a03eb18deded75b6f0a0000d561c877
             I(rbox_version): 0.1

     <<<   MAIL OBJECT HAS NO INDEX REFERENCE <<<<
     MAIL:   U(uid)=304
             oid = a9b4d01dacf8d75bf1430000d561c877
             R(receive_time)=2018-10-29 15:23:00
             save_time=2018-10-30 07:22:36
             Z(phy_size)=2820 V(v_size) = 2895 stat_size=2820
             M(mailbox_guid)=2289751867bdd65b6b1f0000d561c877
             B(mailbox_orig_name)=INBOX
             G(mail_guid)=aa5dca16c417d75b4d790000d561c877
             I(rbox_version): 0.1

     MAIL:   U(uid)=305
             oid = e1b17a2aa3dfd75b5a010000d561c877
             R(receive_time)=2018-10-30 05:35:44
             save_time=2018-10-30 07:22:38
             Z(phy_size)=773 V(v_size) = 791 stat_size=773
             M(mailbox_guid)=2289751867bdd65b6b1f0000d561c877
             B(mailbox_orig_name)=INBOX
             G(mail_guid)=329f220ea0dfd75be6520000d561c877
             I(rbox_version): 0.1

     <<<   MAIL OBJECT HAS NO INDEX REFERENCE <<<<
     MAIL:   U(uid)=305
             oid = 997c8826acf8d75bf1430000d561c877
             R(receive_time)=2018-10-30 02:22:55
             save_time=2018-10-30 07:22:36
             Z(phy_size)=2596 V(v_size) = 2661 stat_size=2596
             M(mailbox_guid)=2289751867bdd65b6b1f0000d561c877
             B(mailbox_orig_name)=INBOX
             G(mail_guid)=daaa1e186fb2d75ba80b0000d561c877
             I(rbox_version): 0.1

     <<<   MAIL OBJECT HAS NO INDEX REFERENCE <<<<
     MAIL:   U(uid)=306
             oid = 516aec2bacf8d75bf1430000d561c877
             R(receive_time)=2018-10-29 18:13:51
             save_time=2018-10-30 07:22:36
             Z(phy_size)=2754 V(v_size) = 2835 stat_size=2754
             M(mailbox_guid)=2289751867bdd65b6b1f0000d561c877
             B(mailbox_orig_name)=INBOX
             G(mail_guid)=92f6140acf3fd75bd6470000d561c877

Server (please complete the following information):

Client (please complete the following information):

Additional context Test ceph cluster: 4 osd (arm32), 4 mon(arm32), 1 mds (arm64), 1 mgr (arm64).

ulimits: stack size unlimited open files 52k max user processes: 63723

jrse commented 6 years ago

increasing user process limit and, applying the patch seemed to fix this issue. Problem didn't occur anymore.