djcb / mu

maildir indexer/searcher + emacs mail client + guile bindings
http://www.djcbsoftware.nl/code/mu
GNU General Public License v3.0
1.59k stars 384 forks source link

[mu bug] Repeated database corruption #2706

Open peterhoeg opened 1 month ago

peterhoeg commented 1 month ago

Describe the bug I have been experiencing repeated database corruption on 2 different machines (a desktop and a laptop). I don't know if this is a xapian issue or mu-specific but it's been happening for a while now since the 1.11.x version of mu.

The message about corruption is different from time to time. It just happened again which is why I'm opening this issue and this time the message is:

xapian-check xapian/
docdata:
blocksize=8K items=792095 firstunused=75954 revision=1474 levels=2 root=73078
B-tree checked okay
docdata table structure checked OK

termlist:
blocksize=8K items=1584190 firstunused=143755 revision=1474 levels=2 root=143383
B-tree checked okay
termlist table structure checked OK

postlist:
blocksize=8K items=3771042 firstunused=203935 revision=1474 levels=3 root=155464
xapian-check: DatabaseCorruptError: Block overwritten - run xapian-check on this database

As far as I recall, the error has always been with the postlist although the messages differ.

To Reproduce There are no specific steps to reproduce - this has been happening once a week or so for quite some time.

Environment NixOS 24.05, Emacs 29.3, latest mu (1.12.5 presently), Xapian 1.4.25

Checklist

djcb commented 1 month ago

I've seen this occasionally... what Xapian version are you using?

peterhoeg commented 1 month ago

I should have added that in earlier, apologies.

It's 1.4.25

All dependencies are shown below.

``` /nix/store/gm61h1y42pqyl6178g90x8zm22n6pyy5-libunistring-1.1 /nix/store/ddfzjdykw67s20c35i7a6624by3iz5jv-libidn2-2.3.7 /nix/store/nrwkk6ak3rgkrxbqhsscb01jpzmslf2r-xgcc-13.2.0-libgcc /nix/store/apab5i73dqa09wx0q27b6fbhd1r18ihl-glibc-2.39-31 /nix/store/k927dryij6k8ga8slxs39pjdsmsny8c8-gcc-13.2.0-libgcc /nix/store/p3ffjixpnfgkqh20nsrc13vrj3yfi0nj-gcc-13.2.0-lib /nix/store/0fn1lbdb83gx7n917akp76djfkkc00x2-cld2-unstable-2015-08-21 /nix/store/117avr6c47nih6gy28wid7yr89xvp94s-libgpg-error-1.48 /nix/store/14b6spahplg5gsk3qna7g2vanqrji3pk-libksba-1.6.6 /nix/store/2bbjyrnwajslpr4kyyf5k0snv3cs5kw7-libgcrypt-1.10.3 /nix/store/nw1bnjadmzk9hnd4nbvxmqlh074vr16j-zlib-1.3.1 /nix/store/2lzvm03ybzp2zbvhj1y283087ghjw0k8-kexec-tools-2.0.28 /nix/store/jrrnf6kak9fbfm910506ppws50d33p9f-ncurses-6.4 /nix/store/2xjpd6ikhjajmdik680cvd8md6cikkp3-readline-8.2p10 /nix/store/h3bhzvz9ipglcybbcvkxvm4vg9lwvqg4-bash-5.2p26 /nix/store/xs3y3icbd26gxr617m1alhjba12pkm92-util-linux-minimal-2.39.3-lib /nix/store/2zna3wb0sd0rx6xqjmihaqy1wjzczzyb-xapian-1.4.25 /nix/store/7mnvfrpsbivkxyfbmqczdq5pbb1n59j1-zstd-1.5.6 /nix/store/n87dnp3cw99gssql0xk9yycrhrg50gir-xz-5.4.6 /nix/store/33dl0i8mjh76kanigvkgmwyxp04mv436-kmod-31-lib /nix/store/3iiml9pwff4y6zsis1gp4apbry1h1wfr-xz-5.4.6-bin /nix/store/wwg6376ymkhpjxy6c3w9gkbhv6l2gmqy-libxcrypt-4.4.36 /nix/store/3q6f0k00yg2ddra93ybism3hy1ha3azk-util-linux-minimal-2.39.3-login /nix/store/8f10pxi30lvkdjrkzfw5gzz4rzia8163-openssl-3.0.13 /nix/store/q7640gaa4r0jmjsysa2i7ckai7zr9pkr-keyutils-1.6.3-lib /nix/store/pjmhai9as695ph4ykrx6990pj7xwhpyx-libkrb5-1.21.2 /nix/store/qkpxz2s5aq6lly1lbmaa2ggwmnlslwna-db-5.3.28 /nix/store/4mlz8w1m4420s89sy1rg6d997mbkznsb-cyrus-sasl-2.1.28 /nix/store/q4k8gmw2gpncglwdvz9fmqqwnynsb143-bzip2-1.0.8 /nix/store/5giwsjx504zrgzl7cjzxm0s1423wndif-bzip2-1.0.8-bin /nix/store/5gzyfqc5vzm4gyijkyxw8bl8i91zfk77-gmp-with-cxx-6.3.0 /nix/store/hffrnc162j2j1rvrdnfiy8jflcxzf9mn-json-c-0.17 /nix/store/xzmgdglbldim3k39pjn16vrgxb3i72as-publicsuffix-list-0-unstable-2024-01-07 /nix/store/g6as2fdsiar8wzkkyiaj62z4knmxaxwl-libpsl-0.21.5 /nix/store/nn2ri4n3klfz71f3aiwycykz9c8xqm9p-libssh2-1.11.0 /nix/store/qa2j4ql5vb7n1zyi8ai5acivsjkg0l8m-brotli-1.1.0-lib /nix/store/wi6w7jbnskypgc1dd3asv8p6rmw12ih8-nghttp2-1.61.0-lib /nix/store/his3qhkzxg1s2j34spm27w2f4mvh2y84-curl-8.7.1 /nix/store/5s0glc9fcb902kabb16yl1ak20xnp16w-tpm2-tss-4.1.0 /nix/store/5w1acngnkxqsbidwbfbrrrhrzyjik3d5-libffi-3.4.6 /nix/store/7fmplwm445l3qhp2ck9p8bcbz1aaj5qp-dns-root-data-2023-11-27 /nix/store/v4jrk736s8a8mk3vcf50hx1l0dwa4bdz-libcap-2.69-lib /nix/store/8f64rn26nd0cyy0brblzyn064fmcmy79-systemd-minimal-libs-255.4 /nix/store/rkj3k7pflfvjf7shraqvb1qd9s2dz1zs-gmp-with-cxx-6.3.0 /nix/store/adalw59bvrk2fg2fl662xnc4y2bqn438-nettle-3.9.1 /nix/store/sqgnfysp5n513gh4qqcs69yx9gxylzmv-libevent-2.1.12 /nix/store/dq7kpdqsxm7z7p26f7ppc8bjyfpwqm03-unbound-1.19.3-lib /nix/store/nqd74akpcgp1xfaqbdlfsim807ixnfbc-libtasn1-4.19.0 /nix/store/v48dbyr8cmci56xp91m50z6x65mqws9x-p11-kit-0.25.3 /nix/store/a8j5yhyqcqahvrm6p99gkqs1yjmv9wcn-gnutls-3.8.5 /nix/store/sik7a8dic5lxhm4cp9npibk97x9jcxwf-pcre2-10.43 /nix/store/d0amq09n85152751g6zspl60mrjbadvz-libselinux-3.6 /nix/store/bcms6vbac6jyffwyj8y630brpy1il4ks-glib-2.80.0 /nix/store/yb39i0a63bx8kx2bgmx1p4dx1hcwmnzh-attr-2.5.2 /nix/store/y4psv76i8pllhnsd1dg7zbqsj8a2sz64-acl-2.3.2 /nix/store/ddypgcb25zgi0gwvklxlzazjrh9s63dg-coreutils-9.5 /nix/store/wy37jk2hirzqzx0666w1849kjdgzdam6-gnugrep-3.11 /nix/store/g1zh8pzpgkyl7ja4bbi4syjnif18xasp-zstd-1.5.6-bin /nix/store/hragmx5p3izmwinv4441hj5706sxy4xs-gzip-1.13 /nix/store/bjzji701dz67bfrrwx9smqqyy1pnf2c9-kbd-2.6.4 /nix/store/bwcl7y02qj8mxhwpqflz56wjkaaxmq22-pcsclite-2.1.0-lib /nix/store/gbx1y99nzj2vj03vn23mbnsfx8dwf9kd-sqlite-3.45.3 /nix/store/hdqqlyrv0gwvxnwpayvl7y8glmx81m44-libsodium-1.0.19 /nix/store/in8kwd36w4fcqd8xizr2r8qy1dx0rbx6-libtool-2.4.7-lib /nix/store/fdiy11m9nkph9hrl2k0ibl4xmra9icmp-util-linux-minimal-2.39.3-swap /nix/store/q66l12afk5bly0aprvdxjkjrsz2dbma8-glibc-2.39-31-getent /nix/store/g11dc1qynwyvbqpf40d81f9205w48m76-getent-glibc-2.39-31 /nix/store/hx4sz22j1gij2gshi9w110x39zl7hpd8-libseccomp-2.5.5-lib /nix/store/nnvsjd3f3dh9wdl4s9mwg5cfri8kds5j-bash-interactive-5.2p26 /nix/store/wyk82rbsdshy57l4y5zl5bfnyq03q85f-kmod-31 /nix/store/yi18kda2q14sqkimq75qpn2w1kk2g6d3-util-linux-minimal-2.39.3-mount /nix/store/yhnhwqm09w9302rx24804i67jfp8ar4h-systemd-minimal-255.4 /nix/store/qvgz18lsxl233ph5kk3i1p7lv9wh6p8q-openldap-2.6.7 /nix/store/rs028f1bbm7i0qc3k2gd7j9rk6hywc45-libassuan-2.5.7 /nix/store/vlcly0k4fys9x3ad64faxcyz33a7c00h-libusb-1.0.27 /nix/store/zaf8aj5sl1ml9hcz9b32vyhvjs66wkxv-npth-1.7 /nix/store/fxyrm6pws83nmq89sbxlkaq3ln9kjbl6-gnupg-2.4.5 /nix/store/f59g2b8xbri3gplcai9p90y1p0i0c2yz-gpgme-1.23.2 /nix/store/qsd48ndjzzpy70y0clxjxs2f968lk0hq-gmime-3.2.14 /nix/store/z2ccqp8yipjdk8w2qw9fpgb88swa60h0-mu-1.12.5 ```
peterhoeg commented 1 month ago

Just happened again but with a different error:

Database couldn't be opened for reading: DatabaseCorruptError: Expected block 104050 to be level 3, not 0
Continuing check anyway
docdata:
blocksize=8K items=791426 firstunused=77160 revision=236 levels=2 root=6263
B-tree checked okay
docdata table structure checked OK

termlist:
blocksize=8K items=1582852 firstunused=142454 revision=236 levels=2 root=14600
B-tree checked okay
termlist table structure checked OK

postlist:
xapian-check: DatabaseCorruptError: Expected block 104050 to be level 3, not 0
peterhoeg commented 1 month ago

And a different error again:

docdata:
blocksize=8K items=790419 firstunused=77826 revision=1773 levels=2 root=6812
B-tree checked okay
docdata table structure checked OK

termlist:
blocksize=8K items=1580838 firstunused=143486 revision=1773 levels=2 root=13888
B-tree checked okay
termlist table structure checked OK

postlist:
blocksize=8K items=3734630 firstunused=203197 revision=1773 levels=3 root=22505
xapian-check: DatabaseError: Block 45602 item 51: key >= right dividing key in level above
djcb commented 1 month ago

... working on this.

peterhoeg commented 1 month ago

Is there anything I can do or any additional information that would be helpful?

djcb commented 1 month ago

I'll let you know when I have something for testing. Hopefully this week.

peterhoeg commented 1 month ago

I didn't mean to chase/rush you - I was just thinking if there was anything I could do to help.

There's a new error just now:

docdata:
blocksize=8K items=790760 firstunused=52993 revision=2499 levels=2 root=1389
B-tree checked okay
docdata table structure checked OK

termlist:
blocksize=8K items=1581520 firstunused=147135 revision=2499 levels=2 root=144033
B-tree checked okay
termlist table structure checked OK

postlist:
blocksize=8K items=3751217 firstunused=211301 revision=2499 levels=3 root=55552
xapian-check: DatabaseError: Block 206577 item 65: not in sorted order
djcb commented 1 month ago

I think the problem is with multiple writers / transactions to the Xapian database. It's not really documented exactly what Xapian allows, but likely mu broke a few rules.

Anyway, there's wip/djcb/store-worker which ensures all writes and mu4e commands are handled in a single thread. Works fine for me, with a few (cosmetic) things I want to fix before merging into master. But, should hopefully address the errors you're seeing.

peterhoeg commented 1 month ago

I am still seeing corruption with that branch (commit: 920dbb6a796b6b87da8d44fb491190f918efd7a9):

position: blocksize=8K items=147226877 firstunused=510123 revision=2056 levels=3 root=445762 B-tree checked okay docdata: blocksize=8K items=791953 firstunused=78160 revision=2056 levels=2 root=270 B-tree checked okay docdata table structure checked OK

termlist: blocksize=8K items=1583906 firstunused=144285 revision=2056 levels=2 root=294 B-tree checked okay termlist table structure checked OK

postlist: blocksize=8K items=3751260 firstunused=204576 revision=2056 levels=3 root=95984 B-tree checked okay Failed to unpack wdf termfreq 496 != # of entries 58751 postlist table errors found: 2

position: blocksize=8K items=147226877 firstunused=510123 revision=2056 levels=3 root=445762 B-tree checked okay position table structure checked OK

spelling: Lazily created, and not yet used.

synonym: Lazily created, and not yet used.

Total errors found: 2

On a separate note, due to how things work on NixOS, changing mu also causes a rebuild of mu4e and mu4e doesn’t like the version tag you used 1.12.6-α. I had to patch it to 1.12.5.1 for mu4e to not blow up.

peterhoeg commented 1 month ago

mu4e has also been hanging for with this branch when performing searches. Sending USR2 to the emacs process makes it respond again for a few seconds before hanging again, so I have taken a screenshot of the errors rather than copying out the text as I simply wasn’t fast enough to “get it out” before emacs would hang again.

I have confirmed no errors on the DB by running xapian-check.

EDIT: Let me just elaborate what happens when it hangs:

  1. I use mu4e as normal to search/read/write and everything is normal
  2. After a while I perform a search (either manual or via a bookmark - no difference)
  3. mu4e will say "Searching..." but never return anything and is completely unresponsive
  4. I send USR2 to the emacs process which gives me the debugger (as shown in the screenshot)
  5. emacs is now responsive for a few seconds
  6. emacs hangs again
  7. if I send USR2 and manage to quit mu4e, I can then launch it again which tells me that a stale mu server is being killed and I can search again, which works normally until it at some point stops again
djcb commented 4 weeks ago

Thanks for testing; yes, there's some issue the indexing in the indexing state machine, working on that. For now 1.12.5 is probably better for now...

However, did you see database corruption with the branch with xapian-check? Your two messages seems a bit a contradictory. Thanks!

peterhoeg commented 4 weeks ago

Sequence of events:

  1. re-indexed with 1.12.5 after the most recent corruption
  2. changed to your test branch
  3. after some time, I saw the corruption issue mentioned in https://github.com/djcb/mu/issues/2706#issuecomment-2152425141
  4. re-indexed with your test branch
  5. I haven't seen any corruption issues, but still get the hang mentioned in https://github.com/djcb/mu/issues/2706#issuecomment-2154492275
peterhoeg commented 3 weeks ago

I'm wondering if there is anything special about my setup that's triggering this considering nobody else is experiencing it. Anything you can suggest/any workarounds to try out?

djcb commented 23 hours ago

I don't know... I'm seeing it occasionally.

I only started seeing it recently, perhaps after some system upgrade. I suspect it's some unsafe access to the Xapian database, and some newer versions of Xapian are less forgiving. But.. this is all just speculation.

The "unsafest" time might be trying to use mu while indexing is underway; so not doing that may help. But who knows.