Open pkoutsogiannis opened 7 months ago
We are using Fulcrum 1.9.7 (Release https://github.com/cculianu/Fulcrum/commit/f27fc28fa25f950bb4ada4361e05972fe183dd0c) We encountered the following issue 2 times in the past month:
Fulcrum 1.9.7 has only been out for ~1 week. There was indeed a hang bug back in version 1.9.4 or so.
I see from the log this hang happened today -- but were you for sure on 1.9.7?
The first occurrence was with 1.9.6 last month and this is why we upgraded to 1.9.7
The log is from today.
Darn. Ok.. I will investigate. I added some optimizations to make mempool synch much faster but they had a bunch of bugs. I thought I squashed them all but apparently maybe not. Will investigate.
In the meantime you could just go back to Fulcrum 1.9.3 I guess or.. hang in there.
We are now running fulcrum with -d so that we can catch any helpful information for you.
We are now running fulcrum with -d so that we can catch any helpful information for you.
Yes, this is extremely helpful. Thank you.
I forgot to mention that we are using the windows binary on windows server 2016.
Keep up the good work.
I forgot to mention that we are using the windows binary on windows server 2016.
Keep up the good work.
Ahhh! That is helpful information! Thank you., I pray this is a windows-specific problem (but it may not be).
Question: Were you running Fulcrum previous to 1.9.4 (1.9.3, etc) for any extended periods and if so did you ever noticed this problem then?
It started after upgrading from 1.9.3 to 1.9.6
We had 1.9.3 running for an extended period indeed without this issue.
We had 1.9.3 running for at least a month on the windows 2016 machine.
We also have an 1.9.7 instance running on a windows 11 machine and is still error free. We had also a 1.9.6 running there without issues as well. The only difference is the windows version and that we have fast-sync=4098 and db_max_open_files=500 set.
The only difference is the windows version and that we have fast-sync=4098 and db_max_open_files=500 set.
Yeah that shouldn't matter. I am curious if the Windows 11 machine ever has problems or not. Keep me updated. I will thoroughly review the code.
FWIW I actually have a windows laptiop here (windows 10) that's been running BTC Fulcrum for a week now with no hang (and before that 1.9.6 with no hang). I will continue to monitor the situation and also look for bugs in my code.
:/
Do let me know what happens I'll investigate this further in the meantime.
Note: The windows 11 machine is much faster than the windows 2016 machine, I am mentioning this just in case of some race condition.
What are the specs on the slow machine? And.. is bitcoind running locally on both machines or is one connecting to the bitcoind process on the other?
There are 2 separate and unrelated machines running bitcoin and fulcrum locally on the same machine respectively.
Windows 2016:
cpu: intel xeon e5-2620 2,10GHz memory: 64G disk: 2TB ssd
bitcoind config:
txindex=1 server=1 listen=0 rpcbind=127.0.0.1 rpcallowip=127.0.0.1 rpcuser = redacted rpcpassword = redacted rpcworkqueue=1000 zmqpubhashblock=tcp://127.0.0.1:8433
Windows 11:
cpu: amd ryzen 5 5560U memory: 16G disk: 2TB ssd (Samsung 990 PRO NVMe M.2 SSD, 2TB, PCIe 4.0)
bitcoind config:
txindex=1 server=1 listen=0 rpcbind=127.0.0.1 rpcallowip=127.0.0.1 rpcuser = redacted rpcpassword = redacted rpcworkqueue=1000 zmqpubhashblock=tcp://127.0.0.1:8433
You know in my experience setting the rpcworkqueue=1000
on bitcoind is asking for trouble. If bitcoind can't keep up with requests, it's best for it to error-out early. Having a queue of 1000 requests lined up, may lead to ridiculous timeouts. You are better off having bitcoind saturate its rpcworkqueue
early. There is a reason why Core has this defaulting to 16... I am not sure what docs you read that recommended this be raised.. can you tell me where you read that you should raise this?
Question: Are you hitting bitcoind directly to do any processing outside of Fulcrum? For example: are you doing expensive calls to bitcoind (such as mining, scantxoutset
, etc) outside of Fulcrum via bitcoind's RPC?
The rpcworkqueue was set to 1000 for no actual reason. We found that as a recommendation from someone on the team few months ago.
Both bitcoind are used solely by fulcrum only. Fulcrum on windows 2016 (the one which hang) is not even used by any client since it serves as a backup service. It just sits there idle.
Shall we change the rpcworkqueue back to 16 and restart fulcrum in debug mode again?
Well I actually don't think that was the problem -- since anyway Fulcrum should have been able to exit in a timely manner. It shouldn't hang like that either way. And if you say RPC is only used by Fulcrum.. anyway Fulcrum doesn't make "expensive" calls that eat a ton of time (such as mining or scantxoutset).
Your choice .. can leave it as-is.. or set it to default just to see if "that fixed it". Up to you.
Since there are no other rpc calls except fulcrum I will leave it running as it is and will update you if it hangs again with the debug log.
Is no news good news? Has it been running smoothly all this time?
I am monitoring it everyday and till now there was no incident.
We got bad news. Unfortunately it stopped processing mempool txs. Also, after issuing a stop command, it got stuck in joining thread log line and I had to kill the process.
[2024-01-19 05:37:21.127]
So there must be some issue at least on windows. You used the provided windows binary correct ?
I’ll have to investigate this when I get some free time.
So there must be some issue at least on windows. You used the provided windows binary correct ?
Correct.
I have reverted back to 1.9.3 and I will monitor this as well.
Kudos for the excellent work.
Yeah if 1.9.3 never hangs I can just undo the optimization I added for a threaded prefetcher of coins. It only shaves a few seconds off the synchmempool on large mempools (60k txns+).. but if it means there is some instability with it for whatever reason it's gone. Do let me know how 1.9.3 works out.
Fulcrum (1.9.3) hang and we had to kill the process after it did not stop after issuing a stop command. Maybe the problem is with the specific os (windows Server 2012 R2) since the other instance running on Windows 11 never hang sofar.
[2024-01-23 02:47:42.621]
The instance (1.9.7) running on Windows 11 that never hang is up and running since Dec 6th 2023.
And just to be clear — the one that hung was 1.9.3 right? So it definitely isn’t my new mempool changes.
Ok in a way this is good news but in another way it’s bad since if Fulcrum is triggering some OS specific issues that’s incredibly hard to troubleshoot.
Good to know it’s not my recent changes though. That’s a relief!
Is there any way you can install a service pack or somehow update the Windows Server 2012 box? Who knows maybe that magically fixes it?
I have all service packs already installed on windows server 2012. I will continue monitoring the windows 11 instance though to ensure that the problem was os specific.
Keep up that the good work!
Thanks man. This was a relief though to learn that it's not specific to 1.9.5+, but some other unknown issue. Oh -- there is a new 1.9.8 FYI -- the major change is it calculates fees more accurately for BTC.
I am starting to suspect the hang somehow may happen within rocksdb. One thing I could do is make a custom build of the Windows binary that uses the latest RocksDB 8.10.0 -- that's one option here (but that would require me to spend 3-4 hours mucking about the docker builder to build it, and I am not sure I have that much free time this week for that).
We are using Fulcrum 1.9.7 (Release f27fc28)
We encountered the following issue 2 times in the past month:
Fullcrum stopped processing mempool txs without any log entry. We issued a stop command but fulcrum hang and we had to kill the process and restart it.
[2023-12-01 11:11:35.940] 51632 mempool txs involving 323803 addresses
[2023-12-01 11:12:45.967] 51897 mempool txs involving 324605 addresses
[2023-12-01 11:13:55.989] 52183 mempool txs involving 325474 addresses
[2023-12-01 11:15:05.989] 52451 mempool txs involving 326368 addresses
[2023-12-01 11:16:16.037] 52718 mempool txs involving 327421 addresses
[2023-12-01 11:17:26.076] 53005 mempool txs involving 328511 addresses
[2023-12-01 13:03:37.850] <AdminSrv 127.0.0.1:8000> New TCP Client.3419140 127.0.0.1:55881, 1 client total
[2023-12-01 13:03:37.959] Received 'stop' command from admin RPC, shutting down ...
[2023-12-01 13:03:37.959] Shutdown requested
[2023-12-01 13:03:37.959] Stopping Stats HTTP Servers ...
[2023-12-01 13:03:37.959] Stopping Controller ...
(we had to kill the process after 5 minutes)
The conf file:
datadir = d:\fulcrum_data bitcoind = 127.0.0.1:8332 rpcuser = redacted rpcpassword = redacted tcp = 10.190.89.8:50001 peering = false announce = false public_tcp_port = 50001 admin = 8000 stats = 8081 db_mem = 1024