cculianu / Fulcrum

A fast & nimble SPV Server for BCH, BTC, and LTC
Other
341 stars 77 forks source link

FATAL: Caught exception: It appears that Fulcrum was forcefully killed in the middle of committing a block... #41

Open ghost opened 4 years ago

ghost commented 4 years ago

Great SPV server! It is fast and just what I needed.

I've encountered a problem when restarting Fulcrum:

FATAL: Caught exception: It appears that Fulcrum was forcefully killed in the middle of committng a block to the db. We cannot figure out where exactly in the update process Fulcrum was killed, so we cannot undo the inconsistent state caused by the unexpected shutdown. Sorry!

I restarted the server (via systemd) and also created an image snapshot for the server and then this error appears.

Having used Electrum Cash for a while, I never had any problems with this before (for years) even thought I would restart/terminate instances abruptly.

Is there anyway to prevent this problem from happening or gracefully recover? Thank you!

cculianu commented 4 years ago

Yeah it's a known issue with the way I did the data layout. I will have to redesign the data layout to avoid this in a future version. The recommended way to stop Fulcrum is to send it SIGINT and wait a good 60 seconds. (Usually it's done in 5-10s). See if you can configure systemd to send SIGINT or SIGTERM and have it wait for completion and not kill the process right away. I believe on most systems by default it does wait 30s or more...

You will have to resynch, unfortunately. :/ Sorry about that.

A future version will try to be ACID -- but for now I took speed shortcuts -- so hard shutdown runs the risk of this issue happening if you shut down in the middle of when a block arrived and the DB was being updated.

I understand that ElectrumX did not suffer from this. It was also slower. :)

I will see if I can do ACID without too much of a perf. hit in a future version. For now you will have to resynch from scratch though. Sorry...

If this makes you worried you can always also backup the synched DB (with Fulcrum stopped). That way you can always restore from backup. FWIW I have been running my server for months now and never had to restore from backup.

Sorry about that.

ghost commented 4 years ago

Awesome! Thanks for the quick reply.

Yeah it's a known issue with the way I did the data layout. I will have to redesign the data layout to avoid this in a future version.

Would you be able to give me a hint on where to look so that I can implement it?

The problem is not that I force-kill a process, the problem is the server could be under highroad and the OS terminates the process abruptly.

I'm curious about the locking here:

https://github.com/cculianu/Fulcrum/blob/master/src/Storage.cpp#L1232

Does this mean that while we are processing a block, that no one can query mempool/utxo's (ie: blocked threads) until the block is committed?

Thanks so much!

georgengelmann commented 4 years ago

I restarted the server (via systemd) and also created an image snapshot for the server and then this error appears.

What's in your systemd file? I don't have RestartSec=60s in it, but it never crashed.

cculianu commented 4 years ago

@atteeela --

Would you be able to give me a hint on where to look so that I can implement it?

It's not a trivial fix. It would require redesigning the database to use a single table with different "column families" for each of the logical pieces: utxo_set, scripthash_history, headers, etc. If all of the data lives in a single table it's possible to do "begin transaction", "end transaction" pairs when updating the data when new blocks arrive, and it would be ACID -- in that case even yanking the power chord or whatever will not lead to any corruption (just a rollback). So.. given that, it's more than a quick hackjob -- a person with significant experience in rocksdb and C++ would be needed to do this. I can do this myself -- and the reason why I didn't do it initially was because it was slower than the data layout I have now. I wanted to design this server to be as fast as possible.

Potentially the new data layout would be optional and only for users that prefer ACID over speed. So -- the database layer would need to be abstracted a bit to handle both data layouts.

It's not a small job; I can do it myself -- it just would take me a lot of time.

Does this mean that while we are processing a block, that no one can query mempool/utxo's (ie: blocked threads) until the block is committed?

Yes, that's correct. Typically the locks are held for less than 5ms (sometimes less than 1ms). And blocks arrive once every 10 minutes. It's not exactly a huge deal. This amount to: 0.00083% of the time that locks are being held exclusively, on average. You may get more of a burp and slowdown from your OS kernel than from this. The rest of the time everything is very much parallel.

vul-ture commented 4 years ago

I'm running into this issue. It's annoying that I have to start sync over from the beginning. I tried tweaking the db_max_open_files, max_pending_connections, and bitcoind_throttle values down but it's still happening around block 450,000. Can anyone recommend a workaround? Thanks

cculianu commented 4 years ago

@vul-ture Is it happening on initial synch? Or later?

Don't kill Fulcrum with kill -9 -- make sure you wait for it to gracefully shut down...

vul-ture commented 4 years ago

it happens on initial sync, I'm not even sending a signal to fulcrum. It could be that my RPC connections are getting saturated (? I'm enabling debugging and stats and will update.

cculianu commented 4 years ago

Huh? If you haven't synched yet RPC is not even up yet.

There are two possibilities here:

  1. You are out of disk space
  2. You are out of disk space

Please ensure that the directory you use is on a filesystem that has ~40GB free for mainnet.

cculianu commented 4 years ago

Also please ensure you are using a filesystem that supports >2GB files... (e.g. no FAT32 or other ancient filesystem).

vul-ture commented 4 years ago

I meant the RPC connections to bitcoin daemon using ext4 fs with 120GB free

cculianu commented 4 years ago

RPC to bitcoind being too slow shouldn't lead to this error message -- you would instead see some warnings about connections dropped / reconnect to bitcoind -- but it would recover from that.

vul-ture commented 4 years ago

Right, the daemon connection is fast as well. It must be DB write issue. Performance of this app is excellent, ~10k transactions/sec. If I can work around this error and sync it should work fine. My specs are pretty good so I don't think it's a hardware issue.

cculianu commented 4 years ago

Ok well without more info I can't help. I still think somehow your data dir is not on a filesystem with enough space ...

Maybe some verbose logging will elucidate things, one might hope.

vul-ture commented 4 years ago

OK I'm able to reproduce the problem, this might be a different issue than the original bug.

[2020-10-04 12:19:27.431] Verifying headers ... [2020-10-04 12:19:27.431] (Debug) Verifying 481460 headers ... [2020-10-04 12:19:28.518] (Debug) Read & verified 481460 headers from db in 1086.771 msec [2020-10-04 12:19:28.518] Initializing header merkle cache ... [2020-10-04 12:19:29.066] (Debug) Merkle cache initialized to length 481460 [2020-10-04 12:19:29.079] (Debug) Read TxNumNext from file: 248286376 [2020-10-04 12:19:29.079] Checking tx counts ... [2020-10-04 12:19:30.933] 248286376 total transactions [2020-10-04 12:19:30.933] UTXO set: 50838980 utxos, 4270.474 MB [2020-10-04 12:19:30.990] (Debug) Storage starting thread [2020-10-04 12:19:30.991] BitcoinDMgr: starting 3 bitcoin rpc clients ... [2020-10-04 12:19:30.991] (Debug) Changed pingtime_ms: 10000 [2020-10-04 12:19:30.991] (Debug) BitcoinD.1 starting thread [2020-10-04 12:19:30.991] (Debug) Changed pingtime_ms: 10000 [2020-10-04 12:19:30.991] (Debug) BitcoinD.2 starting thread [2020-10-04 12:19:30.991] (Debug) Changed pingtime_ms: 10000 [2020-10-04 12:19:30.991] (Debug) BitcoinD.3 starting thread [2020-10-04 12:19:30.991] (Debug) BitcoinDMgr starting thread [2020-10-04 12:19:30.992] BitcoinDMgr: started ok [2020-10-04 12:19:30.992] (Debug) Controller starting thread [2020-10-04 12:19:30.991] (Debug) TCP BitcoinD.1 (id: 2) socket state: 1 [2020-10-04 12:19:30.991] (Debug) TCP BitcoinD.2 (id: 3) socket state: 1 [2020-10-04 12:19:30.991] (Debug) TCP BitcoinD.2 (id: 3) socket state: 2 [2020-10-04 12:19:30.991] (Debug) TCP BitcoinD.1 (id: 2) socket state: 2 [2020-10-04 12:19:30.992] (Debug) TCP BitcoinD.3 (id: 4) socket state: 1 [2020-10-04 12:19:30.992] (Debug) TCP BitcoinD.3 (id: 4) socket state: 2 [2020-10-04 12:19:31.002] (Debug) TCP BitcoinD.2 (id: 3) 10.10.1.2:8332 socket state: 3 [2020-10-04 12:19:31.002] (Debug) on_connected 3 [2020-10-04 12:19:31.002] (Debug) TCP BitcoinD.1 (id: 2) 10.10.1.2:8332 socket state: 3 [2020-10-04 12:19:31.002] (Debug) TCP BitcoinD.3 (id: 4) 10.10.1.2:8332 socket state: 3 [2020-10-04 12:19:31.002] (Debug) on_connected 2 [2020-10-04 12:19:31.002] (Debug) on_connected 4 [2020-10-04 12:19:31.004] (Debug) Auth recvd from bicoind with id: 3, proceeding with processing ... [2020-10-04 12:19:31.006] (Debug) Refreshed version info from bitcoind, version: 0.16.3, subversion: /Satoshi:0.16.3/ [2020-10-04 12:19:31.007] (Debug) Refreshed genesis hash from bitcoind: 000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f [2020-10-04 12:19:31.101] Block height 651269, downloading new blocks ... [2020-10-04 12:19:31.101] (Debug) Task.DL 481460 -> 651269 starting thread [2020-10-04 12:19:31.101] (Debug) Task.DL 481461 -> 651269 starting thread [2020-10-04 12:19:31.101] (Debug) Task.DL 481462 -> 651269 starting thread [2020-10-04 12:19:31.101] (Debug) Task.DL 481463 -> 651269 starting thread [2020-10-04 12:19:31.102] (Debug) Task.DL 481464 -> 651269 starting thread [2020-10-04 12:19:31.102] (Debug) Task.DL 481465 -> 651269 starting thread [2020-10-04 12:19:31.102] (Debug) Task.DL 481466 -> 651269 starting thread [2020-10-04 12:20:32.059] <Task.DL 481460 -> 651269> [Qt Warning] Qt has caught an exception thrown from an event handler. Throwing exceptions from an event handler is not supported in Qt. You must not let any exception whatsoever propagate through Qt code. If that is not possible, in Qt 5 you must at least reimplement QCoreApplication::notify() and catch all exceptions there. (:0, )

console reports (:0, ) what(): ReadCompactSize(): size too large: iostream error

Looks like my bitcoind might have a corrupted block? investigating

cculianu commented 4 years ago

Oh man! You’re on bitcoin core (btc).

Fulcrum is for Bitcoin Cash.

However you piqued my curiosity. I will try and get it working with bitcoin btc (maybe) in the coming week or two.

Yeah fulcrum is for bch...

That explains it!

vul-ture commented 4 years ago

Lol oops, sorry about that! I will use it for BCH then and thanks for taking a look at Core. Looks like I synched a little past the BCH fork block (478559) and then died

cculianu commented 3 years ago

Hey @vul-ture if you want you can try using Fulcrum with BTC if you start bitcoind on BTC with -rpcserialversion=0. I think in that case the bitcoind will "speak the same language" as Fulcrum and it may succeed in syncing.

I am going to test this myself but from my reading of the bitcoin core sourcecode it should work.

vul-ture commented 3 years ago

testing now with the rpcserialversion flag

cculianu commented 3 years ago

Hey! @vul-ture So actually latest Fulcrum release fully supports BTC. Don't use that flag -- it will give you problems later when you connect Electrum to it.

Just get rid of that flag, blow away your existing Fulcrum db, and resynch again from scratch. When it finishes synching you can serve up BTC. I have been doing so for the past week without problems. I have 222 users connected right now.

vul-ture commented 3 years ago

Working, thanks! I think this is the fastest electrum server I've seen.

cculianu commented 3 years ago

Wow man thanks for the compliment. :) Yeah I tried to make it fast FAST. That was my #1 design goal. (And of course also correctness was too).

Thanks!

cculianu commented 3 years ago

@vul-ture PS: Be sure to be on latest Fulcrum 1.4.0 since in the 1.3.x series BTC support was still beta and the mempool code wasn't as fast as it is in 1.4.x series... If you aren't already on Fulcrum 1.4.x, upgrading is simple and doesn't require a db resynch.

apemithrandir commented 2 years ago

Just tagging a comment to say I ran into this issue too. I had to forcibly kill my VM where Fulcrum was running on. Then when I brought the VM back up I got this error. Will have to re-sync now. Aside from this very happy with the server.

cculianu commented 2 years ago

@apemithrandir I see. Sorry to hear that. I do plan on making Fulcrum more resilient to unfortunately-timed crashes in the future. It is a partially solvable problem, meaning that we can get it to a state where for 99% of crashes it should be able to auto-recover. It just requires more logic in the code to diagnose what went wrong and backtrack a bit. Thanks for the feedback. I have never had to re-sync fulcrum after hard system reset and I've had my server randomly lose power or be forced to randomly reset about 12 times in the last 3 years (I was running my server out of my apartment at one point, without battery backup, ha ha).

I sort of optimistically thought that such corruption issues would be rare. But this just proves that anything that can go wrong will, eventually, given a large enough install base.

apemithrandir commented 2 years ago

I think it was a combination of my VM being off for a few days and I was bringing it back online. Bitcoind was still grabbing blocks and fulcrum was also still updating. My VM was acting laggy/unresponsive and then I went to restart. When restarting the VM, it just hung on a black screen and I had to do a force turn off. It is the first time that my fulcrum has required a re-sync like this. My machine has crashed before, but normally when the chain was up to date. With an up to date chain, the chance that a block was being processed at the time of the crash is very low.

craigraw commented 2 years ago

Although it pains me to say it, I would give up a little of Fulcrum's stunning performance (if necessary) to have this implemented. Although it tends to happen more during initial indexing (often due to an overly optimistic fast-sync configuration), as you note it's inevitable that with a large enough user base there will be abnormal shutdowns during indexing, particularly with many RPi nodes running without UPS backup. This has been the main technical consideration I have come across from implementors looking to switch from Electrs.

cculianu commented 2 years ago

Yeah, it only happens if Fulcrum was in the process of writing out data for a new block -- so during a catch-up phase or a sync it can happen after a non-graceful exit. Under normal operation it's unlikely since a block arrives once every 10 mins and only takes 20-100msec to process, depending on CPU and HD speed..

But yes, this is solvable and I will focus on that in the future. 100% agreed.

apemithrandir commented 2 years ago

I am struggling to get through the re-sync without hitting this error. I'm on my 3rd attempt now. This is the most recent forceful kill log:

  1. Mar 19 XX:23:47 XXXX-ubuntu Fulcrum[3429]: [2022-03-19 XX:23:47.832] Processed height: 442000, 60.7%, 4.09 blocks/sec, 7561.0 txs/sec, 27767.0 addrs/sec
  2. Mar 19 XX:24:25 XXXX-ubuntu Fulcrum[3429]: [2022-03-19 XX:24:25.277] Storage UTXO Cache: Flushing to DB ...
  3. Mar 19 XX:25:23 XXXX-ubuntu kernel: [14761.013748] [ 3429] 1000 3429 3940306 2735552 28819456 63613 0 Fulcrum
  4. Mar 19 XX:25:23 XXXX-ubuntu kernel: [14761.013981] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/fulcrum.service,task=Fulcrum,pid=3429,uid=1000
  5. Mar 19 XX:25:23 XXXX-ubuntu kernel: [14761.014012] Out of memory: Killed process 3429 (Fulcrum) total-vm:15761224kB, anon-rss:10942208kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:28144kB oom_score_adj:0
  6. Mar 19 XX:25:24 XXXX-ubuntu kernel: [14762.247928] oom_reaper: reaped process 3429 (Fulcrum), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  7. Mar 19 XX:25:24 XXXX-ubuntu systemd[1]: Stopped Fulcrum.
  8. Mar 19 XX:25:24 XXXX-ubuntu systemd[1]: Started Fulcrum.
  9. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.198] Loaded SSL certificate: Internet Widgits Pty Ltd expires: Sun February 8 2032 XX:29:08
  10. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.200] Loaded key type: private algorithm: RSA
  11. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] Enabled JSON parser: simdjson
  12. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] simdjson implementations:
  13. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] haswell: Intel/AMD AVX2 [supported]
  14. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] westmere: Intel/AMD SSE4.2 [supported]
  15. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] fallback: Generic fallback implementation [supported]
  16. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.202] active implementation: haswell
  17. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] jemalloc: version 5.2.1-0-gea6b3e9
  18. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] Qt: version 5.15.2
  19. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] rocksdb: version 6.14.6-ed43161
  20. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] simdjson: version 0.6.0
  21. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] ssl: OpenSSL 1.1.1f 31 Mar 2020
  22. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] zmq: libzmq version: 4.3.3, cppzmq version: 4.7.1
  23. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] Fulcrum 1.6.0 (Release 5e95c0f) - Sat Mar 19, 2022 XX:25:25.205 XXXX - starting up ...
  24. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.205] Max open files: 8192
  25. Mar 19 XX:25:25 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:25:25.207] Loading database ...
  26. Mar 19 XX:26:09 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:26:09.186] DB memory: 1024.00 MiB
  27. Mar 19 XX:26:09 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:26:09.187] Coin: BTC
  28. Mar 19 XX:26:09 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:26:09.188] Chain: main
  29. Mar 19 XX:26:09 XXXX-ubuntu Fulcrum[3940]: [2022-03-19 XX:26:09.195] FATAL: Caught exception: It appears that Fulcrum was forcefully killed in the middle of committing a block to the db. We cannot figure out where exactly in the update process Fulcrum was killed, so we cannot undo the inconsistent state caused by the unexpected shutdown. Sorry!
  30. Mar 19 XX:26:09 XXXX-ubuntu Fulcrum[3940]: The database has been corrupted. Please delete the datadir and resynch to bitcoind.

I was using this in fulcrum.conf:

bitcoind_timeout = 300 bitcoind_clients = 1 worker_threads = 1 db_mem=1024 db_max_open_files=200 fast-sync = 1024

Any suggestions?

apemithrandir commented 2 years ago

Maybe it is a problem with my VM, but twice now when I rebooted my machine after Fulcrum failed I have been booted into a initramfs prompts/busybox screen and had to manually perform fsck command to recover my file system.

cculianu commented 2 years ago

Hmm. You specified 1GB for fast sync, and 1GB for db_mem. However the process ended up using 15GB. Strange. Did you compile this yourself? Or use the static binary? How much RAM does the VM have? Also -- be sure to enable swap. Do you have swap enabled? cat /proc/swaps to see. I suspect the process just ran out of RAM, was OOM killed.

As for the filesystem corruption -- that could be some kernel bug actually under low memory conditions. Fulcrum can't, even if it tried to, corrupt your filesystem. It runs as a user process and doesn't have that kind of access. So I suspect just the kernel or filesystem driver or whatever has some issues under low memory conditions.

Definitely enable swap if you can. Make a swapfile that's at least as large as RAM, or maybe 2x RAM to be paranoid.

apemithrandir commented 2 years ago

The VM has 12GB of ram allocated. Initially my Swap was too small (only 1G). I added another 5Gb swap file for my most recent attempt at re-sync, to give it 6gb total, but still failed (the log is from that failure). I'm not sure why the process ends up using so much memory. I used a pre-built binary from your repository.

cculianu commented 2 years ago

Ok, so it definitely is using too much memory. Hmm. This is worrying. I wonder what's going on. I have never seen it use that much RAM, tbh, during a fast-sync. Can you try without fasty-sync or with a smaller fast-sync? Give it like 200 MB: fast-sync=200. Even with 200MB there's a huge payoff in terms of speed. If all else fails, you can disable fast-sync altogether too..

This is worrying. I wonder what's going on. Hmm.

apemithrandir commented 2 years ago

I will try making the swap file 24G (!) and set the fast-sync=200. 🤞

apemithrandir commented 2 years ago

I will try making the swap file 24G (!) and set the fast-sync=200. crossed_fingers

After nearly 24 hrs of syncing it ran out of memory again. I'm at a loss now. PC didn't crash it just kept accumulating memory without properly flushing.

  1. Mar 20 XX:51:10 XXXX-ubuntu Fulcrum[1010]: [2022-03-20 XX:51:10.154] Processed height: 640000, 87.9%, 1.83 blocks/sec, 4222.8 txs/sec, 17826.0 addrs/sec
  2. Mar 20 XX:51:39 XXXX-ubuntu Fulcrum[1010]: [2022-03-20 XX:51:39.626] Storage UTXO Cache: Flushing to DB ...
  3. Mar 20 XX:51:46 XXXX-ubuntu kernel: [62077.561639] [ 1010] 1000 1010 11111890 2592488 77901824 6277573 0 Fulcrum
  4. Mar 20 XX:51:46 XXXX-ubuntu kernel: [62077.563345] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/fulcrum.service,task=Fulcrum,pid=1010,uid=1000
  5. Mar 20 XX:51:46 XXXX-ubuntu kernel: [62077.563376] Out of memory: Killed process 1010 (Fulcrum) total-vm:44447560kB, anon-rss:10369952kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:76076kB oom_score_adj:0
  6. Mar 20 XX:51:51 XXXX-ubuntu kernel: [62082.277422] oom_reaper: reaped process 1010 (Fulcrum), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  7. Mar 20 XX:51:51 XXXX-ubuntu systemd[1]: Stopped Fulcrum.
  8. Mar 20 XX:51:51 XXXX-ubuntu systemd[1]: Started Fulcrum.
  9. Mar 20 XX:51:51 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.963] Loaded SSL certificate: Internet Widgits Pty Ltd expires: Sun February 8 2032 XX:29:08
  10. Mar 20 XX:51:51 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.965] Loaded key type: private algorithm: RSA
  11. Mar 20 XX:51:51 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.965] Enabled JSON parser: simdjson
  12. Mar 20 XX:51:51 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.965] simdjson implementations:
  13. Mar 20 XX:51:51 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.965] haswell: Intel/AMD AVX2 [supported]
  14. Mar 20 XX:51:51 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.965] westmere: Intel/AMD SSE4.2 [supported]
  15. Mar 20 XX:51:51 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.965] fallback: Generic fallback implementation [supported]
  16. Mar 20 XX:51:51 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.965] active implementation: haswell
  17. Mar 20 XX:51:52 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.967] jemalloc: version 5.2.1-0-gea6b3e9
  18. Mar 20 XX:51:52 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.967] Qt: version 5.15.2
  19. Mar 20 XX:51:52 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.967] rocksdb: version 6.14.6-ed43161
  20. Mar 20 XX:51:52 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.967] simdjson: version 0.6.0
  21. Mar 20 XX:51:52 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.967] ssl: OpenSSL 1.1.1f 31 Mar 2020
  22. Mar 20 XX:51:52 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.967] zmq: libzmq version: 4.3.3, cppzmq version: 4.7.1
  23. Mar 20 XX:51:52 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.967] Fulcrum 1.6.0 (Release 5e95c0f) - Sun Mar 20, 2022 XX:51:51.967 XXXX - starting up ...
  24. Mar 20 XX:51:52 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.967] Max open files: 8192
  25. Mar 20 XX:51:52 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:51:51.969] Loading database ...
  26. Mar 20 XX:52:23 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:52:23.439] DB memory: 1024.00 MiB
  27. Mar 20 XX:52:23 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:52:23.439] Coin: BTC
  28. Mar 20 XX:52:23 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:52:23.439] Chain: main
  29. Mar 20 XX:52:23 XXXX-ubuntu Fulcrum[5394]: [2022-03-20 XX:52:23.478] FATAL: Caught exception: It appears that Fulcrum was forcefully killed in the middle of committing a block to the db. We cannot figure out where exactly in the update process Fulcrum was killed, so we cannot undo the inconsistent state caused by the unexpected shutdown. Sorry!
  30. Mar 20 XX:52:23 XXXX-ubuntu Fulcrum[5394]: The database has been corrupted. Please delete the datadir and resynch to bitcoind.
apemithrandir commented 2 years ago

I will give it one more go. this time with fast-sync=200 commented out. ################################# Update: CPU maxed out and VM died.

cculianu commented 2 years ago

I am really sorry this is happening. So there must be a bug in either the jemalloc allocator Fulcrum uses or, alternatively, in the robin_hood::unordered_map (internal data structure used to store the UTXOs while synching). I suspect actually robin_hood may have issues since it has had some issues in the past. It's worrying that it takes memory and never gives it up... in some situations.

The strange thing is that I have synched recently on BTC just to test things out and I never observed this behavior. So it may be something specific that triggers it.

Yes, the more I think about it.. it could be that robin_hood is to blame. I will investigate this further. Thank you for the info.

cculianu commented 2 years ago

Follow-up: I predict without fast-sync it won't fail. This would be evidence that robin_hood is to blame since it's only used for synching.

apemithrandir commented 2 years ago

Ok. I assume commenting out the fast-sync line in my fulcrum.conf is how I run without fast-sync. I did run into my CPU maxing out during the last run I did, so I've set the worker_threads back to 1 from 2. I'll give it one more go. Also let me know if you want me to DM you more logs or anything else that you might need to bug hunt. ########################################## Edit: I got this when I started it this time: "\<Controller> fast-sync: Not enabled" So I will see how I get on.

apemithrandir commented 2 years ago

After over 2 days on the Sync (with fast sync disabled) at block height 653,000 my CPU locked up again. Since I had the worker_threads=1, the CPU locked at < 100%. The VM was still unresponsive though.

apemithrandir commented 2 years ago

Sorry to say, I was unable to get Fulcrum up and fully sync'd after a week of trying. My VM kept crashing or freezing during the re-sync, forcing me to do another re-sync. I'm not willing to re-build my VM from scratch at the moment so I will have to settle for the less performant ElectrumX server for now.

cculianu commented 2 years ago

I'm sorry to hear that, @apemithrandir . I have had little trouble synching it even on old Windows 7 boxes (yes, there is a windows .exe available) with like 4GB of RAM and HDD. It's perplexing that it would fail on what sounds like more generous hardware. I'm just curious -- can you provide more details about your setup? Like host os, guest os, VM software, VM configuration, host machine specs, and relevant parts of config file, etc. Anything helps. I want to see if I can reproduce the issues you experienced.

Sorry to hear you are going :(.

What about running on bare metal outside a VM? Or using Docker?

apemithrandir commented 2 years ago

I'm sorry to hear that, @apemithrandir . I have had little trouble synching it even on old Windows 7 boxes (yes, there is a windows .exe available) with like 4GB of RAM and HDD. It's perplexing that it would fail on what sounds like more generous hardware. I'm just curious -- can you provide more details about your setup? Like host os, guest os, VM software, VM configuration, host machine specs, and relevant parts of config file, etc. Anything helps. I want to see if I can reproduce the issues you experienced.

Sorry to hear you are going :(.

What about running on bare metal outside a VM? Or using Docker?

Happy to share any and all details with you one on one over private message/email, if it might help you with development.

caheredia commented 2 years ago

Yeah it's a known issue with the way I did the data layout. I will have to redesign the data layout to avoid this in a future version. The recommended way to stop Fulcrum is to send it SIGINT and wait a good 60 seconds. (Usually it's done in 5-10s). See if you can configure systemd to send SIGINT or SIGTERM and have it wait for completion and not kill the process right away. I believe on most systems by default it does wait 30s or more...

You will have to resynch, unfortunately. :/ Sorry about that.

A future version will try to be ACID -- but for now I took speed shortcuts -- so hard shutdown runs the risk of this issue happening if you shut down in the middle of when a block arrived and the DB was being updated.

I understand that ElectrumX did not suffer from this. It was also slower. :)

I will see if I can do ACID without too much of a perf. hit in a future version. For now you will have to resynch from scratch though. Sorry...

If this makes you worried you can always also backup the synched DB (with Fulcrum stopped). That way you can always restore from backup. FWIW I have been running my server for months now and never had to restore from backup.

Sorry about that.

I just experienced the same thing. My VM rebooted for updates.

[2022-05-21 20:06:24.266] Coin: BTC

[2022-05-21 20:06:24.266] Chain: main

[2022-05-21 20:06:24.267] FATAL: Caught exception: It appears that Fulcrum was forcefully killed in the middle of committing a block to the db. We cannot figure out where exactly in the update process Fulcrum was killed, so we cannot undo the inconsistent state caused by the unexpected shutdown. Sorry!

The database has been corrupted. Please delete the datadir and resynch to bitcoind.

[2022-05-21 20:06:24.268] Stopping Controller ... 

[2022-05-21 20:06:24.268] Closing storage ...

[2022-05-21 20:06:24.341] Shutdown complete
cculianu commented 2 years ago

Yes, I'm sorry. This happens if it's killed while it's busy processing a block. Perhaps you should setup fulcrum to be a systemd service this way any reboot of the node will send it a SIGTERM or whatever to shutdown gracefully?

I'm sorry. You must resynch now...

caheredia commented 2 years ago

Yes, I'm sorry. This happens if it's killed while it's busy processing a block. Perhaps you should setup fulcrum to be a systemd service this way any reboot of the node will send it a SIGTERM or whatever to shutdown gracefully?

I'm sorry. You must resynch now...

I'm running it inside of a docker container, so I'll have to figure out a graceful exit strategy. I appreciate the reply.

cculianu commented 2 years ago

Yeah, I'm sorry. I will have to redo the data model to be fully ACID and then this can never happen. This is on the to-do list for Fulcrum 2.0. Sorry about that.

caheredia commented 2 years ago

Yeah, I'm sorry. I will have to redo the data model to be fully ACID and then this can never happen. This is on the to-do list for Fulcrum 2.0. Sorry about that.

Looking forward to it. Thanks for prioritizing it in the project. I'd offer to help, but I mostly code in python.

jonscoresby commented 2 years ago

Just wanted to say that I downloaded and synced fulcrum about a month ago and had syncing problems. I tried syncing fulcrum on bitcoin 3 times and it would crash somwhere after block 400,000 after running out of memory. I tried disabling autosuspend as suggested here, but the sync still failed. After disabling fast sync however, the sync was successful.

cculianu commented 2 years ago

Thanks for the feedback @jonscoresby ... Perhaps I need to go back and see if I can make --fast-sync more resilient to such conditions. Just curious: were you using swap at all or did you have swap disabled?

RequestPrivacy commented 2 years ago

Just wanted to swing by and report that I seem to have the same problem: I tried to index on a Raspberry Pi with 4GBs and it flooded my RAM up till the point that it exited with the above error message (first try with fast-sync = 1024, second try with fast-sync = 512 and as I noticed that it filled my RAM again set it to 200MBs. But it crashed once more.

So I disabled fast-sync and now its humming away slowly but steady at 2.7 GBs (baseload was something like 1.3GBs of RAM).