erthink / libmdbx

One of the fastest embeddable key-value ACID database without WAL. libmdbx surpasses the legendary LMDB in terms of reliability, features and performance.
https://erthink.github.io/libmdbx/
Other
1.16k stars 111 forks source link

Corrupted DB on Linux after closing down gracefully #188

Closed AlexeyAkhunov closed 3 years ago

AlexeyAkhunov commented 3 years ago

I have performed the full sync of turbo-geth node with the new version of MDBX (after ReadAhead issue was fixed), the sync started on 18th of April 2021. I have been running it until 21st of April and showdown gracefully. In our logs, these were the messages at shutdown:

INFO [04-21|13:17:28.278] Commit cycle                             in=4.607146565s
INFO [04-21|13:17:34.181] Got interrupt, shutting down...          sig=terminated
INFO [04-21|13:17:34.181] Got interrupt, shutting down...
INFO [04-21|13:17:34.188] Ethereum protocol stopped
INFO [04-21|13:17:34.189] Blockchain stopped
INFO [04-21|13:17:34.193] Transaction pool stopped
INFO [04-21|13:17:34.701] database closed (LMDB)                   lmdb=nodes
INFO [04-21|13:17:35.685] database closed (MDBX)                   mdbx=chaindata

MDBX version used for the above is #define MDBX_BUILD_SOURCERY ae952107e9ef79b10938353246db3c1ad3472520ce329559e432faf866794c89_v0_9_3_102_gb39cdcab

Then, I tried to start using the same database again, and it reported corruption:

INFO [04-25|09:51:36.070] Starting Turbo-Geth on Ethereum mainnet...
INFO [04-25|09:51:36.071] Bumping default cache on mainnet         provided=0 updated=4096
INFO [04-25|09:51:36.071] Dropping default light client cache      provided=0 updated=128
INFO [04-25|09:51:36.072] Maximum peer count                       ETH=50 LES=0 total=50
INFO [04-25|09:51:36.073] Set global gas cap                       cap=25000000
INFO [04-25|09:51:36.079] database closed (MDBX)                   mdbx=chaindata
INFO [04-25|09:51:36.080] Opening Database (MDBX)                  mapSize="0 B"
INFO [04-25|09:51:36.080] Writing default main-net genesis block
Fatal: Failed to register the Ethereum service: mdbx_cursor_put: MDBX_CORRUPTED: Database is corrupted

MDBX version used for the above is #define MDBX_BUILD_SOURCERY 8d84a8c94bf33a8a4b959516546f5a33458865e5e2d0f744710c43cf76e37cbd_v0_9_3_125_g2c717590

As @AskAlexSharov suggested, I turned on these extra debug things:

_ = env.SetDebug(mdbx.LogLvlExtra, mdbx.DbgAssert, mdbx.LoggerDoNotChange)

and tried again, and here is what I got:

INFO [04-26|09:29:49.172] Build info                               git_branch=mdbx_debug git_commit=df9b68b609678f92f61063250a6608c3bd964f5c
INFO [04-26|09:29:49.172] Starting Turbo-Geth on Ethereum mainnet...
INFO [04-26|09:29:49.173] Maximum peer count                       ETH=50 total=50
INFO [04-26|09:29:49.173] Set global gas cap                       cap=25000000
INFO [04-26|09:29:49.213] Opening Database (MDBX)                  mapSize="0 B"
mdbx_setup_dxb:14398 updating meta.geo: from l3-n273096532-u536870912/s65536-g65536 (txn#81721), to l3-n273096532-u536870912/s1048576-g524288 (txn#81722)
mdbx_set_readahead:8470 readahead OFF 0..273096532
badpage: corrupted branch-page #2326087, mod-txnid 81721
badpage: page mod-txnid (81721) > parent (81719)
panic: create bucket: snap, mdbx_dbi_open: MDBX_PAGE_NOTFOUND: Requested page not found

goroutine 1 [running]:
github.com/ledgerwatch/turbo-geth/turbo/node.RegisterEthService(...)
        github.com/ledgerwatch/turbo-geth/turbo/node/node.go:94
github.com/ledgerwatch/turbo-geth/turbo/node.New(0xc0002c82c0, 0xc000426850, 0x1f5ab20, 0x28, 0x1f235d8, 0xa, 0x0, 0x2)
        github.com/ledgerwatch/turbo-geth/turbo/node/node.go:83 +0x44b
main.runTurboGeth(0xc0002c82c0)
        github.com/ledgerwatch/turbo-geth/cmd/tg/main.go:55 +0x3e5
github.com/urfave/cli.HandleAction(0x13898a0, 0x1df03f8, 0xc0002c82c0, 0xc0002c82c0, 0x0)
        github.com/urfave/cli@v1.22.4/app.go:528 +0x59
github.com/urfave/cli.(*App).Run(0xc0002c6fc0, 0xc00003c0a0, 0x5, 0x5, 0x0, 0x0)
        github.com/urfave/cli@v1.22.4/app.go:288 +0x675
main.main()
        github.com/ledgerwatch/turbo-geth/cmd/tg/main.go:27 +0x87

For the step above, the version of MDBX was #define MDBX_BUILD_SOURCERY 8d84a8c94bf33a8a4b959516546f5a33458865e5e2d0f744710c43cf76e37cbd_v0_9_3_125_g2c717590

Also, as @AskAlexSharov suggested, I ran mdbx_chk on the database, and this is the output I got:

alexey@tgtest:~/turbo-geth$ ./build/bin/mdbx_chk /home/alexey/data1/tg/chaindata/
mdbx_chk v0.9.2-14-g166ed1c7 (2020-12-15T15:43:19+03:00, T-07621995c7dbb4f58175bed4a7b13928915f9744)
Running for /home/alexey/data1/tg/chaindata/ in 'read-only' mode...
Traversal b-tree by txn#81721...
Iterating DBIs...
 ! corrupted page #2326087, mod-txnid 81721
 ! page mod-txnid (81721) > parent (81719)
 ! mdbx_cursor_get failed, error -30796 MDBX_CORRUPTED: Database is corrupted
 - problems: different number of entries (1)
Total 4 errors are detected, elapsed 19529.329 seconds.

For the step above, version of MDBX was #define MDBX_BUILD_SOURCERY 8d84a8c94bf33a8a4b959516546f5a33458865e5e2d0f744710c43cf76e37cbd_v0_9_3_125_g2c717590

I can keep the database file around for further investigations, if required

Thanks a lot in advance

AlexeyAkhunov commented 3 years ago

Perhaps this was caused by incompatible versions we used (DB was created by version from devel branch, and reported corruption was by code from master branch. When I tried to same with the devel branch, I got no corruption on opening, but I could not write:

INFO [04-26|09:53:31.807] Opening Database (MDBX)                  mapSize="0 B"
panic: create bucket: clique-snapshots-, mdbx_dbi_open: operation not permitted

goroutine 1 [running]:
github.com/ledgerwatch/turbo-geth/turbo/node.RegisterEthService(...)

For the step above, the version of MDBX was #define MDBX_BUILD_SOURCERY bbcb0fd59ce19d2a8624d61216a54afc8ff2e62bc38e846a0ea2f96f3999bda5_v0_9_3_150_g31cfce4c

erthink commented 3 years ago

As I understand it this is the same problem that is specified in https://github.com/erthink/libmdbx/issues/187 However, some differences in information significantly change the essence. Please specify which version (i.e. the commit hash) you used for each run/step.

Nonetheless, the page mod-txnid (81721) > parent (81719) error cannot be due to incompatibility with previous versions of mdbx_chk.

For mdbx_dbi_open: operation not permitted - this is a very unexpected error. Are you sure that no other errors were returned before, that a read-write transaction is being used, or that the table being opened exists?

AlexeyAkhunov commented 3 years ago

Are you sure that no other errors were returned before, that a read-write transaction is being used, or that the table being opened exists?

It tries to create the table (it shows as bucket in our logs) when it hits is operation not permitted. I will debug to check if the tx is in fact read-write

AlexeyAkhunov commented 3 years ago

Please specify which version (i.e. the commit hash) you used for each run/step.

I've updated my comments with versions of MDBX used

erthink commented 3 years ago

Thanks for exactly version information.

  1. It looks like the main problem (DB corruption through the wrong mod-txnid of a parent-page) was a temporary regression that is already fixed between b39cdcab and 2c717590. In any way, it makes no sense to dig into the old version, but it is more rational to repeat the check with the new one, since there were many changes (both useful and significantly changing the picture). The current devel (3e272d339a336657d0032369efbaa54b965348b6) branch is quite suitable for such a check, even without an experimental a8b6a30a2450747272dca2b7c49bd3ca9b9c607d commit. In addition, the "Spilling with LRU policy" feature for https://github.com/erthink/libmdbx/issues/186 will be ready soon (probably today). So reasonable you should wait and check out the latest version.
  2. I guess that there is some eligible reason for the operation not allowed error. Otherwise I suggest create an separate issue for it.
AlexeyAkhunov commented 3 years ago

The not permitted error seems to happen because the corruption was detected:

INFO [04-26|13:08:35.248] Build info                               git_branch=mdbx_debug git_commit=790ee7663f80d2b4c52e46bfccb8617b8a57c28a
INFO [04-26|13:08:35.248] Starting Turbo-Geth on Ethereum mainnet...
INFO [04-26|13:08:35.249] Maximum peer count                       ETH=50 total=50
INFO [04-26|13:08:35.249] Set global gas cap                       cap=25000000
INFO [04-26|13:08:35.288] Opening Database (MDBX)                  mapSize="0 B"
mdbx_set_readahead:8674 readahead OFF 0..273096532
badpage: corrupted branch-page #2326087, mod-txnid 81721
badpage: invalid page txnid (81721) for parent-page' txnid (81719)

panic: create bucket 2: snap, mdbx_dbi_open: operation not permitted

Should I abandon this DB file and redo the sync?

AlexeyAkhunov commented 3 years ago

I have started the new sync with the version #define MDBX_BUILD_SOURCERY bbcb0fd59ce19d2a8624d61216a54afc8ff2e62bc38e846a0ea2f96f3999bda5_v0_9_3_150_g31cfce4c

If you think this should not contain the problem, please close the issue

erthink commented 3 years ago

The not permitted error seems to happen because the corruption was detected:

So there is a minor issue of losing the original error code. I think I'll fix this today.

Should I abandon this DB file and redo the sync?

Most likely, the error is only in the incorrect mod-txnid value(s) and there is actually no other damage in the database. So there are three options:

erthink commented 3 years ago

I have started the new sync with the version #define MDBX_BUILD_SOURCERY bbcb0fd59ce19d2a8624d61216a54afc8ff2e62bc38e846a0ea2f96f3999bda5_v0_9_3_150_g31cfce4c

If you think this should not contain the problem, please close the issue

I suggest close if there are no problems with the current version.

AlexeyAkhunov commented 3 years ago

Looks like it works now, closing