Closed matthewdarwin closed 5 years ago
This came up yesterday in https://t.me/eosfullnodes as well:
EOSUSA Michael, [15.03.19 13:20]
13:14:15 kernel: nodeos[2511]: segfault at 8 ip 00000000004be5c2 sp 00007ffdfbb9dd30 error 4 in nodeos[400000+28b3000]
Api node as well or just p2p? Do you have a core file or stacktrace?
The API nodes and BP were fine. The only API requests the p2p nodes handle is request for status updates (like /v1/chain/get_info every few seconds).
Sorry, I don't have core file or stack trace. If it keeps happening, I will enable core file generation.
p2p machines have 32GB RAM.
I also had it crash on 2 of my nodes immediately after upgrading and catching my logs up to the current block. I've already rolled my nodes back to 1.6.3 so can't provide any current diags/logs but can spin off a clone from the snapshot if you want additional information from it.
Both nodes are on Mainnet running API/P2P/StateHist plugins but no extras added in. At the time, neither server was exposed externally servicing requests. I also have another API/P2P node (externally exposed) that seems to be running along with no issues (fingers crossed).
Also, Todd had mentioned it might have been the OOM killer, so I did disable on both crashing servers and try them again but they both failed almost immediately after syncing current blocks)
Nodes are Ubuntu 18.04 and have 16GB RAM, although I monitored and the crashes happened when not even 8GB was being allocated (fresh boot). Here are the only 2 entries from the System log at the time of the crash if they help:
13:14:15 kernel: nodeos[2511]: segfault at 8 ip 00000000004be5c2 sp 00007ffdfbb9dd30 error 4 in nodeos[400000+28b3000] 13:14:15 kernel: show_signal_msg: 24 callbacks suppressed
That is helpful. Seems can probably rule out http threading. My guess it is the net_plugin threading. I actually found an issue Friday that I have a fix for in a PR. I can get that fix into a 1.7.1 next week.
A stack trace would help determine if that is indeed the case.
Crash again today, sorry no stack trace. This is on p2p node.
Mar 18 12:44:55 mainnet-public2 nodeos[32730]: info 2019-03-18T12:44:55.704 thread-0 producer_plugin.cpp:345 on_incoming_block ] Received block 23ecd532e0639c12... #48258609 @ 2019-03-18T12:44:55.500 signed by eosnewyorkio [trxs: 15, lib: 48258282, conf: 0, latency: 204 ms]
Mar 18 12:44:56 mainnet-public2 nodeos[32730]: info 2019-03-18T12:44:56.281 thread-0 producer_plugin.cpp:345 on_incoming_block ] Received block a1b2011de9f4236f... #48258610 @ 2019-03-18T12:44:56.000 signed by eosnewyorkio [trxs: 24, lib: 48258282, conf: 0, latency: 281 ms]
Mar 18 12:44:56 mainnet-public2 kernel: [576634.616974] nodeos[32730]: segfault at 8 ip 00000000004be5c2 sp 00007fffd90888b0 error 4 in nodeos[400000+28b3000]
I had a crash today. nodeos 1.7.0 from deb, Ubuntu 18.10 using eosio_1.7.0-1-ubuntu-18.04_amd64.deb
plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
plugin = eosio::db_size_api_plugin
plugin = eosio::state_history_plugin
trace-history = true
chain-state-history = true
[1342852.265624] nodeos[4640]: segfault at 8 ip 00000000004be8f2 sp 00007f4b6c610970 error 4 in nodeos[400000+28c2000]
I upgraded to 1.7.1... let's see how it goes.
From Jungle TestNet chat from 2 hour ago (Dioni EOSMetal):
Crashed again with 1.7.1...
This is the latest: <4>warn 2019-04-05T10:20:48.803 thread-0 controller.cpp:249 emit ] signal handler threw exception
From EOS Mainnet chat 4 hours ago (Eric - sw/eden):
Two of my api nodes started lagging and then died, one 1.7.0 and 1.7.1. My third at 1.7.1 survived...
Jungle issue referenced above is non-issue... person is still on 1.7.0
I'm going to close this issue as 1.7.1 is out and it addresses the segfault originally reported in this issue. Please open a new issue with any 1.7.1 issues.
Thanks @matthewdarwin @cc32d9 @eosusa for reporting these issues and providing information about the crashes and for testing.
I had 2 machines crash within seconds of each other earlier today. They are both basically p2p machines with 200 concurrent connections, running EOS Mainnet.
Ubuntu 18.04 nodeos 1.7.0 ( my own compiled binaries)
I am logging this in case someone else notices the same thing, so then there may be a pattern to investigate.
Server 1
Server 2
Other p2p nodes unaffected.
config.ini: