segfault on nodeos 1.7.0 ubutu 18.04

matthewdarwin commented 5 years ago

I had 2 machines crash within seconds of each other earlier today. They are both basically p2p machines with 200 concurrent connections, running EOS Mainnet.

Ubuntu 18.04 nodeos 1.7.0 ( my own compiled binaries)

I am logging this in case someone else notices the same thing, so then there may be a pattern to investigate.

Server 1

Mar 16 15:40:45 mainnet-public1 nodeos[8420]: info  2019-03-16T15:40:45.685 thread-0  producer_plugin.cpp:345       on_incoming_block    ] Received block 2c347a62650cf516... #47934547 @ 2019-03-16T15:40:45.500 signed by zbeosbp11111 [t
rxs: 25, lib: 47934215, conf: 0, latency: 185 ms]
Mar 16 15:40:46 mainnet-public1 nodeos[8420]: info  2019-03-16T15:40:46.187 thread-0  producer_plugin.cpp:345       on_incoming_block    ] Received block ed3747cf523efade... #47934548 @ 2019-03-16T15:40:46.000 signed by zbeosbp11111 [t
rxs: 17, lib: 47934215, conf: 0, latency: 187 ms]
Mar 16 15:40:46 mainnet-public1 kernel: [414443.919280] nodeos[8420]: segfault at 0 ip 00000000004b5519 sp 00007fff3e25f900 error 4 in nodeos[400000+28b3000]

Server 2

Mar 16 15:40:51 mainnet-public2 nodeos[21980]: info  2019-03-16T15:40:51.625 thread-0  producer_plugin.cpp:345       on_incoming_block    ] Received block f8dcf8ec6b5f245d... #47934559 @ 2019-03-16T15
:40:51.500 signed by atticlabeosb [trxs: 18, lib: 47934227, conf: 0, latency: 125 ms]
Mar 16 15:40:52 mainnet-public2 nodeos[21980]: info  2019-03-16T15:40:52.105 thread-0  producer_plugin.cpp:345       on_incoming_block    ] Received block 9e361d46e88c3ae8... #47934560 @ 2019-03-16T15
:40:52.000 signed by atticlabeosb [trxs: 24, lib: 47934227, conf: 0, latency: 105 ms]
Mar 16 15:40:52 mainnet-public2 kernel: [414392.189321] nodeos[22011]: segfault at 8 ip 00000000004be5c2 sp 00007fe1282c3970 error 4 in nodeos[400000+28b3000]
Mar 16 15:40:52 mainnet-public2 kernel: [414392.189334] nodeos[21980]: segfault at 8 ip 00000000004be5c5 sp 00007ffc69fd3510 error 4 in nodeos[400000+28b3000]

Other p2p nodes unaffected.

config.ini:

chain-threads = 2
blocks-dir = "blocks"
wasm-runtime = wabt
chain-state-db-size-mb = 24576
reversible-blocks-db-size-mb = 2048
contracts-console = false
https-client-validate-peers = 1
http-server-address = 0.0.0.0:8888
http-validate-host = false
access-control-allow-credentials = false
max-body-size = 1048576
p2p-listen-endpoint = 0.0.0.0:9876
p2p-max-nodes-per-host = 1
agent-name = "EOS Nation"
allowed-connection = any
max-clients = 200
connection-cleanup-period = 30
network-version-match = 0
sync-fetch-span = 2500
enable-stale-production = false
pause-on-startup = false
max-transaction-time = 30
max-irreversible-block-age = -1
keosd-provider-timeout = 5
txn-reference-block-lag = 0
abi-serializer-max-time-ms = 2000
verbose-http-errors = true
plugin = eosio::http_plugin
plugin = eosio::chain_api_plugin
plugin = eosio::net_api_plugin
plugin = eosio::producer_api_plugin
p2p-peer-address = [omitted many lines]

matthewdarwin commented 5 years ago

This came up yesterday in https://t.me/eosfullnodes as well:

EOSUSA Michael, [15.03.19 13:20]
13:14:15 kernel: nodeos[2511]: segfault at 8 ip 00000000004be5c2 sp 00007ffdfbb9dd30 error 4 in nodeos[400000+28b3000]

heifner commented 5 years ago

Api node as well or just p2p? Do you have a core file or stacktrace?

matthewdarwin commented 5 years ago

The API nodes and BP were fine. The only API requests the p2p nodes handle is request for status updates (like /v1/chain/get_info every few seconds).

Sorry, I don't have core file or stack trace. If it keeps happening, I will enable core file generation.

p2p machines have 32GB RAM.

eosusa commented 5 years ago

I also had it crash on 2 of my nodes immediately after upgrading and catching my logs up to the current block. I've already rolled my nodes back to 1.6.3 so can't provide any current diags/logs but can spin off a clone from the snapshot if you want additional information from it.

Both nodes are on Mainnet running API/P2P/StateHist plugins but no extras added in. At the time, neither server was exposed externally servicing requests. I also have another API/P2P node (externally exposed) that seems to be running along with no issues (fingers crossed).

eosusa commented 5 years ago

Also, Todd had mentioned it might have been the OOM killer, so I did disable on both crashing servers and try them again but they both failed almost immediately after syncing current blocks)

Nodes are Ubuntu 18.04 and have 16GB RAM, although I monitored and the crashes happened when not even 8GB was being allocated (fresh boot). Here are the only 2 entries from the System log at the time of the crash if they help:

13:14:15 kernel: nodeos[2511]: segfault at 8 ip 00000000004be5c2 sp 00007ffdfbb9dd30 error 4 in nodeos[400000+28b3000] 13:14:15 kernel: show_signal_msg: 24 callbacks suppressed

heifner commented 5 years ago

That is helpful. Seems can probably rule out http threading. My guess it is the net_plugin threading. I actually found an issue Friday that I have a fix for in a PR. I can get that fix into a 1.7.1 next week.

heifner commented 5 years ago

A stack trace would help determine if that is indeed the case.

matthewdarwin commented 5 years ago

Crash again today, sorry no stack trace. This is on p2p node.

Mar 18 12:44:55 mainnet-public2 nodeos[32730]: info  2019-03-18T12:44:55.704 thread-0  producer_plugin.cpp:345       on_incoming_block    ] Received block 23ecd532e0639c12... #48258609 @ 2019-03-18T12:44:55.500 signed by eosnewyorkio [trxs: 15, lib: 48258282, conf: 0, latency: 204 ms]
Mar 18 12:44:56 mainnet-public2 nodeos[32730]: info  2019-03-18T12:44:56.281 thread-0  producer_plugin.cpp:345       on_incoming_block    ] Received block a1b2011de9f4236f... #48258610 @ 2019-03-18T12:44:56.000 signed by eosnewyorkio [trxs: 24, lib: 48258282, conf: 0, latency: 281 ms]
Mar 18 12:44:56 mainnet-public2 kernel: [576634.616974] nodeos[32730]: segfault at 8 ip 00000000004be5c2 sp 00007fffd90888b0 error 4 in nodeos[400000+28b3000]

cc32d9 commented 5 years ago

I had a crash today. nodeos 1.7.0 from deb, Ubuntu 18.10 using eosio_1.7.0-1-ubuntu-18.04_amd64.deb

plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
plugin = eosio::db_size_api_plugin

plugin = eosio::state_history_plugin
trace-history = true
chain-state-history = true

[1342852.265624] nodeos[4640]: segfault at 8 ip 00000000004be8f2 sp 00007f4b6c610970 error 4 in nodeos[400000+28c2000]

matthewdarwin commented 5 years ago

I upgraded to 1.7.1... let's see how it goes.

matthewdarwin commented 5 years ago

From Jungle TestNet chat from 2 hour ago (Dioni EOSMetal):

Crashed again with 1.7.1...

This is the latest: <4>warn 2019-04-05T10:20:48.803 thread-0 controller.cpp:249 emit ] signal handler threw exception

matthewdarwin commented 5 years ago

From EOS Mainnet chat 4 hours ago (Eric - sw/eden):

Two of my api nodes started lagging and then died, one 1.7.0 and 1.7.1. My third at 1.7.1 survived...

matthewdarwin commented 5 years ago

Jungle issue referenced above is non-issue... person is still on 1.7.0

heifner commented 5 years ago

I'm going to close this issue as 1.7.1 is out and it addresses the segfault originally reported in this issue. Please open a new issue with any 1.7.1 issues.

Thanks @matthewdarwin @cc32d9 @eosusa for reporting these issues and providing information about the crashes and for testing.

EOSIO / eos

segfault on nodeos 1.7.0 ubutu 18.04 #6954