decred / dcrdata

Decred block explorer, with packages and apps for data collection and storage. Written in Go.
https://dcrdata.decred.org/
ISC License
129 stars 128 forks source link

About to connect the wrong block #1562

Open mkingori opened 4 years ago

mkingori commented 4 years ago

Hi,

When i launch dcrdata, with the command ./dcrdata, I am getting a panic of about to connect the wrong block. See the logs.

2019-09-29 14:12:38.440 [INF] DATD: Log folder:  /home/kingmoses/.dcrdata/logs/mainnet
2019-09-29 14:12:38.440 [INF] DATD: Config file: /home/kingmoses/.dcrdata/dcrdata.conf
2019-09-29 14:12:38.441 [INF] DATD: dcrdata version 5.2.0-pre+dev (Go version go1.12.9)
2019-09-29 14:12:38.473 [INF] RPCC: Established connection to RPC server localhost:9109
2019-09-29 14:12:38.474 [INF] DATD: Connected to dcrd (JSON-RPC API v6.1.0) on MainNet
2019-09-29 14:12:38.474 [INF] SKDB: Loading ticket pool DB. This may take a minute...
2019-09-29 14:12:38.558 [INF] SKDB: badger: All 1 tables opened in 83ms
2019-09-29 14:12:38.561 [INF] SKDB: Loading all ticket pool diffs...
2019-09-29 14:12:55.277 [INF] SKDB: Successfully loaded 383908 ticket pool diffs
2019-09-29 14:12:55.390 [INF] SKDB: Advancing ticket pool DB to tip via diffs...
2019-09-29 14:12:55.828 [INF] SKDB: Pre-populating live ticket cache and computing pool value...
2019-09-29 14:12:59.121 [INF] DATD: Loaded StakeDatabase at height 383907
2019-09-29 14:12:59.122 [INF] DATD: Setting up the Politeia's proposals clone repository. Please wait...
2019-09-29 14:13:00.819 [INF] DATD: Address cache capacity: 986895 rows, 134217728 bytes
2019-09-29 14:13:00.826 [INF] PSQL: Switching PostgreSQL time zone to UTC for this session.
2019-09-29 14:13:00.828 [INF] PSQL: PostgreSQL 11.5 (Debian 11.5-2+b1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 9.2.1-8) 9.2.1 20190909, 64-bit
2019-09-29 14:13:00.837 [INF] PSQL: DB schema version 1.2.0
2019-09-29 14:13:02.161 [INF] PSQL: Pre-loading unspent ticket info for InsertVote optimization.
2019-09-29 14:13:02.442 [INF] PSQL: Storing data for 43413 unspent tickets in cache.
2019-09-29 14:13:02.451 [INF] PSQL: Setting PostgreSQL DB statement timeout to 1h0m0s.
2019-09-29 14:13:02.675 [INF] EXPR: Mean Voting Blocks calculated: 7860
2019-09-29 14:13:02.708 [INF] EXPR: Starting WebsocketHub run loop.
2019-09-29 14:13:02.711 [INF] PUBS: Starting WebsocketHub run loop.
2019-09-29 14:13:02.724 [INF] IAPI: Started Insight socket.io server for up to 16384 clients.
2019-09-29 14:13:02.726 [INF] DATD: Now serving the explorer and APIs on http://127.0.0.1:7777/
2019-09-29 14:13:02.976 [INF] DATD: Starting blockchain sync...
2019-09-29 14:13:02.977 [INF] PSQL: Current best block (dcrd):        383910
2019-09-29 14:13:02.977 [INF] PSQL: Current best block (primary db):  383908
2019-09-29 14:13:02.977 [INF] PSQL: Current best block (stakedb):     383907
2019-09-29 14:13:02.977 [INF] PSQL: Processing blocks 383909 to 383910...
panic: about to connect the wrong block: 383909, 383907

goroutine 383963 [running]:
github.com/decred/dcrdata/db/dcrpg/v4.(*ChainDB).SyncChainDB(0xc00037a2c0, 0x12fd500, 0xc00007bf00, 0x130c420, 0xc001839280, 0xc00a880000, 0x0, 0xc001d1f0e0, 0x0, 0x0, ...)
    /home/kingmoses/go/src/github.com/decred/dcrdata/db/dcrpg/sync.go:414 +0x4905
github.com/decred/dcrdata/db/dcrpg/v4.(*ChainDB).SyncChainDBAsync(0xc00037a2c0, 0x12fd500, 0xc00007bf00, 0xc001842660, 0x130c420, 0xc001839280, 0x0, 0x0, 0xc001d1f0e0)
    /home/kingmoses/go/src/github.com/decred/dcrdata/db/dcrpg/sync.go:112 +0xaf
created by main._main.func10
    /home/kingmoses/go/src/github.com/decred/dcrdata/main.go:812 +0x23f
chappjc commented 4 years ago

Some info that would help:

  1. Details of how you got to this point. Had you been running without trouble prior to this? Did the initial sync on a previous run complete all the way? Did dcrdata crash without a clean shutdown? Can you supply logs?
  2. Are you able to reproduce this from scratch? Not just from the broken data folder, but from an empty DB and data folder, and with a certain set of steps.

To recover, you can usually start dcrdata with the following switch added to your usual command: --purge-n-blocks N, where N is at least the number that "stakedb" is behind "primary db". In this case, just use 32 or more.

mkingori commented 4 years ago

Seems dcrdata took too long to shutdown. I was forced to close the terminal instead. The command --purge-n-blocks N has helped me solve the issues.

chappjc commented 4 years ago

I'm glad that resolved it for you.

Since it seems to be relatively common that dcrdata gets killed forcibly without shutting down gracefully, and that the message says nothing about how to recover by purging blocks (nor does it try to do it automatically). I'm going to reopen this issue so we can resolve it properly.

ukane-philemon commented 2 years ago

@chappjc I have been facing this issue for sometime now, purging didn't help. And in my case, dcrdata was running on a dev server without any sort of interruption. Solution that worked for me but ain't certain if if has consequences.

Also, I found out duplicate rows are not removed for votes, misses and tickets table. https://github.com/decred/dcrdata/blob/fab8a7e7bde6f4382ae9613db706a41788e7a0f3/db/dcrpg/sync.go#L393-L402 https://github.com/decred/dcrdata/blob/fab8a7e7bde6f4382ae9613db706a41788e7a0f3/db/dcrpg/tables.go#L199-L312

Addressed HERE.

chappjc commented 2 years ago

Good finds.

Any idea how to reproduce that error? I literally never hit it, and I run several servers in production. Unclean shutdown is the only thing that can cause it afaik

ukane-philemon commented 2 years ago

I dropped dcrdata db, deleted data dir, and ran dcrdata afresh uninterrupted(pm2 mode) , still panic persist. Until I applied a fast-forward to take stakedb to main db height.

chappjc commented 2 years ago

You have to be dropping the wrong data. How else would it report any db heights?

chappjc commented 2 years ago

Wait, how does pm2 start and stop dcrdata? Also, what does that do for you? You can run in a tmux.

ukane-philemon commented 2 years ago

You have to be dropping the wrong data. How else would it report any db heights?

I removed ~/.dcrdata/data path and droped dcrdata db, so am starting afresh. Pm2 helps me keep dcrdata process running on my dev server while I am offline. Maybe I will give tmux a try.

chappjc commented 2 years ago

What I don't understand about pm2 is how it stops dcrdata. You cannot just kill it.

I removed ~/.dcrdata/data path and droped dcrdata db, so am starting afresh.

If dcrdata reports any positive DB heights other than dcrd itself, you've not started fresh. Paste the part of the startup logs that looks like this:

[INF] PSQL: DB schema version 1.11.0
[DBG] PSQL: meta height 628189 / blocks height 628189
...
[DBG] DATD: chainDB height: 628189
...
[INF] DATD: Starting blockchain sync...
[INF] PSQL: Current best block (dcrd):        631276
[INF] PSQL: Current best block (primary db):  628189
[INF] PSQL: Current best block (stakedb):     628189

Also please paste the error message with the block heights.

chappjc commented 2 years ago

BTW, when I ask how to "reproduce that error", I'm asking for steps to causing the error from a functioning setup. I understand you are getting it repeatedly, but a reproduction is how you get to that state.

ukane-philemon commented 2 years ago

First log output from dcrdata

2022-01-19 20:59:07.531 [INF] PSQL: Empty database "dcrdata". Creating tables...
2022-01-19 20:59:07.532 [INF] PSQL: Creating the "meta" table.
2022-01-19 20:59:07.552 [INF] PSQL: Creating the "blocks" table.
2022-01-19 20:59:07.560 [INF] PSQL: Creating the "transactions" table.
2022-01-19 20:59:07.564 [INF] PSQL: Creating the "vins" table.
2022-01-19 20:59:07.578 [INF] PSQL: Creating the "vouts" table.
2022-01-19 20:59:07.583 [INF] PSQL: Creating the "block_chain" table.
2022-01-19 20:59:07.590 [INF] PSQL: Creating the "addresses" table.
2022-01-19 20:59:07.595 [INF] PSQL: Creating the "tickets" table.
2022-01-19 20:59:07.598 [INF] PSQL: Creating the "votes" table.
2022-01-19 20:59:07.602 [INF] PSQL: Creating the "misses" table.
2022-01-19 20:59:07.607 [INF] PSQL: Creating the "agendas" table.
2022-01-19 20:59:07.611 [INF] PSQL: Creating the "agenda_votes" table.
2022-01-19 20:59:07.613 [INF] PSQL: Creating the "testing" table.
2022-01-19 20:59:07.615 [INF] PSQL: Creating the "stats" table.
2022-01-19 20:59:07.620 [INF] PSQL: Creating the "treasury" table.
2022-01-19 20:59:07.622 [INF] PSQL: Creating the "swaps" table.
2022-01-19 20:59:07.630 [INF] PSQL: Pre-loading unspent ticket info for InsertVote optimization.
2022-01-19 20:59:07.631 [INF] PSQL: Setting PostgreSQL DB statement timeout to 20m0s.
2022-01-19 20:59:07.648 [INF] DATD: chainDB block summary table is empty.
2022-01-19 20:59:07.654 [INF] AGDB: agendas.db version 1.0.0 was set
2022-01-19 20:59:07.662 [INF] PRDB: proposals.db version 2.0.0 was set
2022-01-19 20:59:07.838 [INF] EXPR: Mean Voting Blocks calculated: 1006
2022-01-19 20:59:07.886 [INF] EXPR: Starting WebsocketHub run loop.
2022-01-19 20:59:07.887 [INF] PUBS: Starting WebsocketHub run loop.
2022-01-19 20:59:07.914 [INF] DATD: chainDB block summary table is empty.
2022-01-19 20:59:07.914 [INF] IAPI: Started Insight socket.io server.
2022-01-19 20:59:07.916 [INF] DATD: Now serving the explorer and APIs on http://localhost:17778/
2022-01-19 20:59:08.167 [INF] DATD: Starting blockchain sync...
2022-01-19 20:59:08.168 [INF] PSQL: Tables are empty, starting fresh.
2022-01-19 20:59:08.169 [INF] PSQL: Current best block (dcrd):        341542
2022-01-19 20:59:08.169 [INF] PSQL: Current best block (primary db):  -1
2022-01-19 20:59:08.169 [INF] PSQL: Current best block (stakedb):     0
2022-01-19 20:59:08.169 [INF] PSQL: Collecting all UTXO data prior to height 0...
2022-01-19 20:59:08.169 [INF] PSQL: Pre-warming UTXO cache with 0 UTXOs...
2022-01-19 20:59:08.169 [INF] PSQL: UTXO cache is ready.
2022-01-19 20:59:08.169 [INF] PSQL: Large bulk load: Removing indexes and disabling duplicate checks.
2022-01-19 20:59:08.172 [INF] PSQL: Creating temporary index on addresses(tx_vin_vout_row_id).
2022-01-19 20:59:08.174 [INF] PSQL: Beginning SYNC STAGE 1 of 5 (block data import).
2022-01-19 20:59:08.174 [INF] PSQL: Scanning genesis block into chain db.
2022-01-19 20:59:08.187 [INF] PSQL: Processing blocks 1 to 500...
2022-01-19 20:59:15.555 [INF] PSQL: Processing blocks 501 to 1000...
2022-01-19 20:59:25.511 [INF] PSQL: Processing blocks 1001 to 1500...
2022-01-19 20:59:28.193 [INF] PSQL: ( 53 blk/s,  480 tx/s,  562 vin/sec, 1697 vout/s)
2022-01-19 20:59:48.182 [INF] PSQL: ( 21 blk/s,  540 tx/s,  746 vin/sec, 2011 vout/s)
2022-01-19 20:59:48.635 [INF] PSQL: Processing blocks 1501 to 2000...
2022-01-19 21:00:07.542 [INF] PSQL: Processing blocks 2001 to 2500...

Everything went well until:

panic: About to connect the wrong block: 549218, 546806
The stake database is corrupted. Restart with --purge-n-blocks=4824 to recover.

goroutine 1 [running]:
github.com/decred/dcrdata/db/dcrpg/v7.(*ChainDB).SyncChainDB.func6(0x86162, 0x128cf48, 0x32, 0xc017afa1a0)
    /root/dcrdata/db/dcrpg/sync.go:308 +0x13bf
github.com/decred/dcrdata/db/dcrpg/v7.(*ChainDB).SyncChainDB(0xc0004c8000, 0x161d5f8, 0xc000357200, 0x161da20, 0xc017277180, 0x101, 0x0, 0xc017436540, 0x0, 0x0, ...)
    /root/dcrdata/db/dcrpg/sync.go:361 +0xfc3
main._main.func10(0x0, 0x0, 0x0)
    /root/dcrdata/cmd/dcrdata/main.go:812 +0xd7
main._main(0x161d5f8, 0xc000357200, 0x0, 0x0)
    /root/dcrdata/cmd/dcrdata/main.go:825 +0x3839
main.main()
    /root/dcrdata/cmd/dcrdata/main.go:60 +0x85
panic: About to connect the wrong block: 549217, 546806
The stake database is corrupted. Restart with --purge-n-blocks=4822 to recover.

Since then, the panic persisted at different heights. At some point I did purge main db(took over 8hours), but as soon as it synced for sometime, I saw same panic err.

chappjc commented 2 years ago

Thank you for pasting that. I am surprised anything like that happened during the initial sync.

If the initial sync (pre-indexing) is interrupted, you really shouldn't bother trying to purge anything, just drop the whole table and datadir. So have you not been able to get a full sync of dcrdata so far because of this? Always panic during stage 1 of 5?

ukane-philemon commented 2 years ago

So have you not been able to get a full sync of dcrdata so far because of this? Always panic during stage 1 of 5?

yes.

chappjc commented 2 years ago

That sucks. I don't know what to recommend. I assume you're not out of disk space? What did you mean by "fast forward"? Like this? https://github.com/decred/dcrdata/blob/fab8a7e7bde6f4382ae9613db706a41788e7a0f3/db/dcrpg/chkdcrpg/main.go#L197-L221

BTW, those blks/sec are slow. What kind of disk are you using, not that it would cause the error, just slowness.

chappjc commented 2 years ago

2022-01-19 20:59:08.169 [INF] PSQL: Current best block (dcrd): 341542

Do you know why this is not more like 870897? Is dcrd syncing at the same time?

ukane-philemon commented 2 years ago

Do you know why this is not more like 870897? Is dcrd syncing at the same time?

As of then, yes. dcrd finished syncing in less than 8hours.

chappjc commented 2 years ago

OK, don't try to do dcrdata sync until dcrd has finished it's sync/IBD (initial block download). Does it still panic even if dcrd is idle and all synchronized with the network?

ukane-philemon commented 2 years ago

What did you mean by "fast forward"? Like this?

Yeah, the "fast forward" loop runs in the main import loop instead of a panic.

Does it still panic even if dcrd is idle and all synchronized with the network?

It still panics.

chappjc commented 2 years ago

OK I'm going to run a fresh sync on both mainnet and testnet. The fast forward loop is not a solution though. We need to get to the root cause.

ukane-philemon commented 2 years ago

I am running a dev server with Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-96-generic x86_64) 154GB space.

chappjc commented 2 years ago

And how is disk utilization then? The production testnet machine uses up almost all of that on just the postgresql DB:

postgres=# SELECT datname, pg_size_pretty(pg_database_size(datname)) db_size FROM pg_database ORDER BY db_size;
  datname  | db_size 
-----------+---------
 dcrdata   | 145 GB

The dcrdata datadir will take about 1.5GB on top of that.

The dcrd datadir is about 21G.

df -H says the drive is at 221G utilization with everything else.

ukane-philemon commented 2 years ago

df -H says the drive is at 221G utilization with everything else.

That's a lot of space, I might be able to add more space soon. But without the "fast-forward" I can't get past init block import.

chappjc commented 2 years ago

I think you're out of disk space... When you hit the error, check df -h and report back. Also check dcrd logs and journalctl -e

ukane-philemon commented 2 years ago

I think you're out of disk space...

Yeah, noticed it during setting spending info in addresses table.

chappjc commented 2 years ago

That's a lot of space

Yeah, 120 GB is a surprising amount. On mainnet:

postgres=# SELECT datname, pg_size_pretty(pg_database_size(datname)) db_size FROM pg_database ORDER BY db_size;
     datname     | db_size 
-----------------+---------
 dcrdata_mainnet | 85 GB

I'm not sure why testnet surged so much recently.

ukane-philemon commented 2 years ago

du -h report:

udev            3.9G     0  3.9G   0% /dev
tmpfs           796M  1.1M  795M   1% /run
/dev/vda1       155G  143G   13G  93% /
tmpfs           3.9G   16K  3.9G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/vda15      105M  5.2M  100M   5% /boot/efi
/dev/loop2       68M   68M     0 100% /snap/lxd/21835
/dev/loop1       68M   68M     0 100% /snap/lxd/21545
/dev/loop3       44M   44M     0 100% /snap/snapd/14295
/dev/loop4       44M   44M     0 100% /snap/snapd/14549
/dev/loop5       62M   62M     0 100% /snap/core20/1270
tmpfs           796M     0  796M   0% /run/user/0
/dev/loop6       62M   62M     0 100% /snap/core20/1328
chappjc commented 2 years ago

Scraping the limits of usable space there. You should check the postgresql logs as well as dcrd logs and journalctl. I'd be very surprised if at least one of those wasn't throwing messages about failure to allocate. Please try on a machine with more space. However, you might be able to sync mainnet on that machine since for whatever reason it needs less space than testnet. Even that will be tight with both dcrdata and dcrd on the same drive.

ukane-philemon commented 2 years ago

dcrd is has been running smooth. For postgresql and stakeDB, space wasn't a concern during block import stage. Although, I will add more space soon since a complete sync would require about 150GB dedicated to dcrdata alone.

chappjc commented 2 years ago

fresh mainnet sync stage 1 passed:

2022-02-06 17:44:07.553 [INF] PSQL: Processing blocks 631001 to 631317...
2022-02-06 17:44:15.453 [INF] PSQL: ( 40 blk/s,  725 tx/s, 4134 vin/sec, 4953 vout/s)
2022-02-06 17:44:15.887 [INF] PSQL: Avg. speed: 1145 tx/s, 5152 vout/s
2022-02-06 17:44:15.887 [INF] PSQL: Block import elapsed: 162.01 minutes, 631318 blocks (64.95 blocks/s)
2022-02-06 17:44:15.887 [INF] PSQL: Beginning SYNC STAGE 2 of 5 (duplicate row removal).
...
2022-02-06 17:49:23.124 [INF] PSQL: Beginning SYNC STAGE 3 of 5 (table indexing and analyzing).
...
2022-02-06 17:56:38.334 [INF] PSQL: Beginning SYNC STAGE 4 of 5 (deep database ANALYZE).
2022-02-06 17:57:38.349 [INF] PSQL: Beginning SYNC STAGE 5 of 5 (setting spending info in addresses table). This will take a while.
2022-02-06 17:57:38.351 [INF] PSQL: Updating address rows for blocks [0,9999]...
2022-02-06 17:57:52.104 [INF] PSQL: Updating address rows for blocks [10000,19999]...
...
2022-02-06 18:17:49.033 [INF] PSQL: Updating address rows for blocks [630000,631317]...
2022-02-06 18:17:52.871 [INF] PSQL: Updated 33460663 rows of addresses table.
2022-02-06 18:17:52.872 [INF] PSQL: Indexing addresses table on matching_tx_hash...
2022-02-06 18:19:16.418 [INF] PSQL: Performing an ANALYZE(600) on addresses table...
2022-02-06 18:19:19.576 [INF] PSQL: Catching up with network at block height 631326 from 631317...
2022-02-06 18:19:20.610 [INF] PSQL: SYNC COMPLETED at height 631326. Delta:
                631327 blocks
              11138463 transactions
              37259547 ins
              50082487 outs
              72150839 addresses

will update when I do testnet3

chappjc commented 2 years ago

testnet

2022-02-06 22:08:39.126 [INF] PSQL: Processing blocks 870501 to 871000...
2022-02-06 22:08:46.132 [INF] PSQL: ( 42 blk/s,  691 tx/s, 4080 vin/sec, 4828 vout/s)
2022-02-06 22:08:51.330 [INF] PSQL: Processing blocks 871001 to 871106...
2022-02-06 22:08:53.892 [INF] PSQL: Avg. speed: 961 tx/s, 4721 vout/s
2022-02-06 22:08:53.892 [INF] PSQL: Block import elapsed: 223.13 minutes, 871107 blocks (65.07 blocks/s)
2022-02-06 22:08:53.893 [INF] PSQL: Beginning SYNC STAGE 2 of 5 (duplicate row removal).
...
2022-02-06 22:15:37.713 [INF] PSQL: Beginning SYNC STAGE 3 of 5 (table indexing and analyzing).
...
2022-02-06 22:24:56.163 [INF] PSQL: Beginning SYNC STAGE 4 of 5 (deep database ANALYZE).
2022-02-06 22:26:11.787 [INF] PSQL: Beginning SYNC STAGE 5 of 5 (setting spending info in addresses table). This will take a while.
2022-02-06 22:26:11.794 [INF] PSQL: Updating address rows for blocks [0,9999]...
2022-02-06 22:26:33.473 [INF] PSQL: Updating address rows for blocks [10000,19999]...
...
2022-02-06 22:55:34.173 [INF] PSQL: Updating address rows for blocks [870000,871106]...
2022-02-06 22:55:38.609 [INF] PSQL: Updated 43088652 rows of addresses table.
2022-02-06 22:55:38.609 [INF] PSQL: Indexing addresses table on matching_tx_hash...
2022-02-06 22:57:45.107 [INF] PSQL: Performing an ANALYZE(600) on addresses table...
2022-02-06 22:57:52.259 [INF] PSQL: Catching up with network at block height 871135 from 871106...
2022-02-06 22:57:54.661 [INF] PSQL: SYNC COMPLETED at height 871135. Delta:
                871136 blocks
              12866327 transactions
              48427059 ins
              63206847 outs
              91843276 addresses
ukane-philemon commented 2 years ago

Mainet Stage 1 syncing in < 3hr and testnet in < 4hrs is fast, and no connect block error. Will run testnet once I top my server space.