nodeos 3.x/4.x SHiP stops serving blocks

eosusa commented 1 year ago

I've had this issue many times in the past where one SHiP reader (Hyperion) is using the same SHiP node as another server/service (Atomic) and had something trigger where the SHiP node just stops sending blocks to the readers for the requested range. It's not logging disconnects and even restarting the reader shows the websocket connecting (and actually able to read ABIs), but refuses to send any requested block information. I thought this was just something with my nodes (either version or memory), but experienced the issue using one of Nation's 4.0.3 SHiP nodes for EOS (my issues tended to be on WAX mostly using 3.1.4).

So in this scenario, I'm attempting to sync AtomicAssets API on EOS, and it connected and indexed 1000s of blocks properly. I needed to update the servers default gateway, so I stopped the reader, made the change, and then no matter what, the Nations SHiP node would connect but refused to send any blocks. I tried everything to get it working again (changing gateway back, rebooting server) but it just refused to get any blocks (but is successfully connecting). Apparently at that same time, another service that Nation has that was reading from the same SHiP node immediately stopped receiving block information too. After checking the SHiP logs and seeing no errors, eventually he restarted the SHiP node and immediately both his and my readers started receiving blocks again after reconnection.

Here's a snipped of the Atomic indexer logs when its launching, connecting, and then unable to get any blocks:

0|eosio-contract-api-filler  | 2023-06-30T17:31:05.395Z [PID:4772] [info] : Init handler atomicmarket for reader atomic-1
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.396Z [PID:4772] [info] : Init handler atomictools for reader atomic-1
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.397Z [PID:4772] [info] : Starting reader: atomic-1
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.397Z [PID:4772] [warn] : Reader (atomic-1) start block cannot be lower than the last processed block. Ignoring config.
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.397Z [PID:4772] [info] : Reader atomic-1 starting on block #99177100
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.402Z [PID:4772] [info] : Connecting to ship endpoint ws://x.x.x.x:9090
0|eosio-contract-api-filler  | 2023-06-30T17:31:05.402Z [PID:4772] [info] : Ship connect options {"start_block_num":99177100,"end_block_num":4294967295,"max_messages_in_flight":50,"have_positions":"removed","irreversible_only":false,"fetch_block":true,"fetch_traces":true,"fetch_deltas":true}
0|eosio-contract-api-filler  | 2023-06-30T17:31:15.404Z [PID:4772] [warn] : Reader atomic-1 - No blocks processed [DS:0|SH:0|JQ:0] - Stopping in 3590 seconds
0|eosio-contract-api-filler  | 2023-06-30T17:31:20.404Z [PID:4772] [warn] : Reader atomic-1 - No blocks processed [DS:0|SH:0|JQ:0] - Stopping in 3585 seconds

So it seems that something related to the stopping of the Atomic reader was able to both prevent me from receiving blocks on reconnection, and even worse, seems to have caused the SHiP node to stop sending blocks to ANY of it's connected clients. But a simple restart of the SHiP nodeos allowed the clients to properly reconnect and receive the block information being requested.

matthewdarwin commented 1 year ago

Yes, i had 4.0.3 ship on EOS for this. Our EOS-EVM stopped syncing when eosusa disconnect the AA reader from our ship.

heifner commented 1 year ago

Do you have any process to repeatably reproduce this? I've been trying various combinations of connecting/disconnecting/killing clients and have not been able to reproduce this.

matthewdarwin commented 1 year ago

I don't have steps to reproduce. It only happened once for me, that I know of.

bhazzard commented 1 year ago

Keeping this open pending reproducible steps. But for now we can't reproduce to fix.

eosusa commented 1 year ago

Just tried the same reconnection process with our Atomic indexer and Nations StateHist node and it did NOT experience the same issue this time. So def not something easily reproducible, but I experience it from time to time on our internal nodes so will keep an eye out for it and reach out if/when i see it again (and try to leave it broken so we can troubleshoot it)

coachbjork commented 9 months ago

We have been finding similar issues. We can replicate the issue in two different ways. 1) SHIP lost connection for a bit (few minutes), and while replaying blocks it suddenly stops, and no new blocks is accepted. While it keep asking peers for the same blocks over and over again. 2) SHIP reader asks for a block-range that the SHIP node don't have, creates same situation as above.

In both scenarios, the SHIP reader has a healthy connection to nodeos, except not getting any blocks. Nodeos continues to sync directly when we kill the websocket connection to nodeos.

The issue seem to be related to asking for ABI, as that is the last call the SHIP reader does. Potentially this is what Nodeos get stuck on, and not syncing any blocks, and get stuck in some loop?

We are continuing to investigate and narrow down this issue.

But I've been able to replicate both situations, and nodeos refuses to sync until the websocket connection is terminated, and instantly recovers as if nothing happened.

I will ask @pnx if he can add any further information as he has investigated the network traffic between client (ship reader) and nodeos.

pnx commented 9 months ago

We have been finding similar issues. We can replicate the issue in two different ways.

SHIP lost connection for a bit (few minutes), and while replaying blocks it suddenly stops, and no new blocks is accepted. While it keep asking peers for the same blocks over and over again.

SHIP reader asks for a block-range that the SHIP node don't have, creates same situation as above.

In both scenarios, the SHIP reader has a healthy connection to nodeos, except not getting any blocks. Nodeos continues to sync directly when we kill the websocket connection to nodeos.

The issue seem to be related to asking for ABI, as that is the last call the SHIP reader does. Potentially this is what Nodeos get stuck on, and not syncing any blocks, and get stuck in some loop?

We are continuing to investigate and narrow down this issue.

But I've been able to replicate both situations, and nodeos refuses to sync until the websocket connection is terminated, and instantly recovers as if nothing happened.

I will ask @pnx if he can add any further information as he has investigated the network traffic between client (ship reader) and nodeos.

The issue seams to be (in this case) that a call to /v1/chain/get_abi hangs. and therefore my application will stop reading tcp data resulting in the tcp receive buffer being filled on the client side.

So I believe that there is either (or both) of this events that causes nodeos to hang. I have attached a screenshot from wireshark.

bhazzard commented 8 months ago

@coachbjork and @pnx Thank you for documenting your findings related to this issue. Based on the information you shared, it appears to be an intermittent problem related to SHiP and ABI requests. To effectively address it, we'll need to:

Continue to document instances of the issue, including log data and configurations.
Attempt to reproduce the issue in a controlled environment.
Analyze network traffic and SHiP node logs for anomalies during the issue.

All in an effort to determine the root cause of some requests causing /v1/chain/get_abi to hang.

pnx commented 2 months ago

Got this issue again with running thalos. nodeos still has connection to peers but wont respond to api requests, and wont receive any blocks. When the client (thalos) disconnects from ship websocket. Everything starts to work again.

nodes.log thalos.log

Config:

block-log-retain-blocks = 1209600
state-history-log-retain-blocks = 1209600

# Antelope 4 settings
read-only-read-window-time-us = 50001
max-transaction-time = 30

# Database Mode
database-map-mode = mapped_private
#blocks-dir = "blocks"
state-dir = "state-dir"

# Endpoints
agent-name = "shufflingpixels_ship1"
p2p-listen-endpoint = 0.0.0.0:8887
http-server-address = 0.0.0.0:8888
state-history-endpoint = 0.0.0.0:8889

p2p-accept-transactions = false
api-accept-transactions = false

wasm-runtime = eos-vm-jit
eos-vm-oc-compile-threads = 2
eos-vm-oc-enable = 1
chain-threads = 2
validation-mode = light

chain-state-db-size-mb = 130072
resource-monitor-space-threshold = 99

contracts-console = false
read-mode = head
access-control-allow-origin = *
access-control-allow-headers = Origin, X-Requested-With, Content-Type, Accept
http-max-response-time-ms = -1
abi-serializer-max-time-ms = 990000
verbose-http-errors = true
http-validate-host = false
max-clients = 20
sync-fetch-span = 1000

# Plugins
trace-history = true
chain-state-history = true
plugin = eosio::http_plugin
plugin = eosio::state_history_plugin
plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
plugin = eosio::db_size_api_plugin
plugin = eosio::net_plugin
plugin = eosio::net_api_plugin

# waxswedenorg: SE, Sweden
#p2p-peer-address = p2p.waxsweden.org:35777

# waxhiveguild: FI, Finnland
#p2p-peer-address = peer1.hivebp.io:9876

# waxhiveguild: DE, Germany
#p2p-peer-address = peer2.hivebp.io:9876

# sentnlagents: GB, United Kingdom
#p2p-peer-address = waxp2p.sentnl.io:9876

# guild.nefty: DE, Germany
#p2p-peer-address = p2p-node1.neftyblocks.com:9876

# amsterdamwax: NL, Amsterdam
#p2p-peer-address = wax.eu.eosamsterdam.net:9101

# TACO - EU
#p2p-peer-address = peer1.wax.tacocrypto.io:9999
# EOSDAC - NL
p2p-peer-address = wax-p2p.eosdac.io:29876
# wecan - NL
#p2p-peer-address = seed1-wax-mainnet.wecan.dev:14998
# Blacklusion - DE
p2p-peer-address = peer1.wax.blacklusion.io:4646
# Bountyblok - NL
p2p-peer-address = p2p.wax.bountyblok.io:29876

nodeos version: v5.0.1wax01

spoonincode commented 1 month ago

When the client (thalos) disconnects from ship websocket. Everything starts to work again.

nodeos refuses to sync until the websocket connection is terminated, and instantly recovers as if nothing happened.

This sounds like you're experiencing a deadlock due to the state history threading approach used in leap 4.0 and 5.0. While state history is in the process of streaming a block out of its websocket it holds a mutex on nodeos main thread which will prevent nodeos from doing much of anything.. apply blocks, service RPC requests that require running on main thread (get_abi) etc.

So if the state history client decides to make an RPC request and wait for its response, without reading the entirety of a state history message (to the point that the send buffer in nodeos doesn't have any free space), there will be a deadlock.

There are no plans to fix this in 4.0 or 5.0. It is fixed in Spring 1.0.

As a potential workaround you might be able to set max_messages_in_flight to 1, ensure you read the entire get_blocks_result_v0 message, do any sort of RPC calls etc, and only then send get_blocks_ack_request_v0; repeat for each block. This will ensure that state history is not holding the main thread mutex when you perform your RPC requests.

AntelopeIO / leap

nodeos 3.x/4.x SHiP stops serving blocks #1358