mindreader stalled on snapshot boot

NatPDeveloper commented 4 years ago

Brief:

Running partial sync, push guarantee node. Node is not passing below logs. Starting from snapshot. Previously when restarting regularly it wanted a snapshot or a blocks.log from genesis. I thought I could just restart it, wouldn't do a clean shutdown after 10m of waiting, perhaps too impatient.

Running in screen on z1d.12xlarge ec2 instance

CPU pinned

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                   
 4686 root      20   0  121036  23236  18636 R 100.0  0.0   1:50.03 nodeos

version

sha1sum /root/go/bin/dfuseeos
ca40c8b48b1ecb9b7dfbbf11cc6e022797413984  /root/go/bin/dfuseeos

yaml

start:
  args:
  - mindreader
  flags:
    log-to-file: false
    mindreader-log-to-zap: true
    mindreader-no-blocks-log: true
    mindreader-restore-snapshot-name: 0146083494-08b50ea69b029e54fa4fc03299e809a7a94184d1ae3fd7b587434b4f8f633cf8-snapshot.bin

logs

2020-10-09T13:14:54.845Z (mindreader) performing bootstrap from snapshot (operator/operator.go:554) 
2020-10-09T13:14:54.845Z (mindreader) restoring snapshot prior starting process (operator/operator.go:592) 
2020-10-09T13:14:54.845Z (mindreader) checking if snapshot exists, mayber performing local override if it doesn't (operator/operator.go:631) {"snapshot_name": "0146083494-08b50ea69b029e54fa4fc03299e809a7a94184d1ae3fd7b587434b4f8f633cf8-snapshot.bin"}
2020-10-09T13:14:54.845Z (mindreader) getting snapshot from store (superviser/snapshot.go:156) {"snapshot_name": "0146083494-08b50ea69b029e54fa4fc03299e809a7a94184d1ae3fd7b587434b4f8f633cf8-snapshot.bin"}
2020-10-09T13:14:59.539Z (mindreader) sending initial start command (operator/operator.go:136) 
2020-10-09T13:14:59.539Z (mindreader) operator ready to receive commands (operator/operator.go:141) 
2020-10-09T13:14:59.539Z (mindreader) received operator command (operator/operator.go:241) {"command": "start", "params": null}
2020-10-09T13:14:59.539Z (mindreader) preparing for start (operator/operator.go:493) 
2020-10-09T13:14:59.539Z (mindreader) preparing to start chain (operator/operator.go:499) 
2020-10-09T13:14:59.539Z (mindreader) updating nodeos arguments before starting binary (superviser/superviser.go:204) 
2020-10-09T13:14:59.540Z (mindreader) creating new command instance and launch read loop (superviser/superviser.go:117) {"binary": "nodeos", "arguments": ["--config-dir=./mindreader", "--data-dir=/root/workspace/dfuse-data/mindreader/data", "--snapshot=/root/workspace/dfuse-data/mindreader/data/snapshots/0146083494-08b50ea69b029e54fa4fc03299e809a7a94184d1ae3fd7b587434b4f8f633cf8-snapshot.bin", "--pause-on-startup"]}
2020-10-09T13:14:59.540Z (mindreader) successfully start service (operator/operator.go:514) 
2020-10-09T13:14:59.540Z (mindreader) operator ready to receive commands (operator/operator.go:141) 
2020-10-09T13:14:59.558Z (mindreader) APPBASE: Warning: The following configuration items in the config.ini file are redundantly set to (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.558Z (mindreader)          their default value: (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.558Z (mindreader)              contracts-console, http-server-address (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.558Z (mindreader)          Explicit values will override future changes to application defaults. Consider commenting out or (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader)          removing these items (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.558 nodeos    chain_plugin.cpp:635          plugin_initialize    ] initializing chain plugin (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:466          operator()           ] Support for builtin protocol feature 'PREACTIVATE_FEATURE' (with digest of '0ec7e080177b2c02b278d5088611686b49d739925a92d9bfcacd7fc6b74053bd') is enabled without activation restrictions (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'ONLY_LINK_TO_EXISTING_PERMISSION' (with digest of '1a99a59d87e06e09ec5b028a9cbb7749b4a5ad8819004365d02dc4379a8b7241') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'FORWARD_SETCODE' (with digest of '2652f5f96006294109b3dd0bbde63693f55324af452b799ee137a81a905eed25') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'WTMSIG_BLOCK_SIGNATURES' (with digest of '299dcb6af692324b899b39f16d5a530a33062804e41f09dc97e9f156b4476707') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'REPLACE_DEFERRED' (with digest of 'ef43112c6543b88db2283a2e077278c315ae2c84719a8b25f25cc88565fbea99') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'NO_DUPLICATE_DEFERRED_ID' (with digest of '4a90c00d55454dc5b059055ca213579c6ea856967712a56017487886a4d4cc0f') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'RAM_RESTRICTIONS' (with digest of '4e7bf348da00a945489b2a681749eb56f5de00b900014e137ddae39f48f69d67') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'WEBAUTHN_KEY' (with digest of '4fca8bd82bbd181e714e283f83e1b45d95ca5af40fb89ad3977b653c448f78c2') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'DISALLOW_EMPTY_PRODUCER_SCHEDULE' (with digest of '68dcaa34c0517d19666e6b33add67351d8c5f69e999ca1e37931bc410a297428') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'ONLY_BILL_FIRST_AUTHORIZER' (with digest of '8ba52fe7a3956c5cd3a656a3174b931d3bb2abb45578befc59f283ecd816a405') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'RESTRICT_ACTION_TO_SELF' (with digest of 'ad9e3d8f650687709fd68f4b90b41f7d825a365b02c23a636cef88ac2ac00c43') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'FIX_LINKAUTH_RESTRICTION' (with digest of 'e0fb64b1085cc5538970158d05a009c24e276fb94e1a0bf6a528b48fbc4ff526') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135) 
2020-10-09T13:14:59.559Z (mindreader) info  2020-10-09T13:14:59.559 nodeos    chain_plugin.cpp:453          operator()           ] Support for builtin protocol feature 'GET_SENDER' (with digest of 'f0af56d2c5a48d60a4a5b5c903edfb7db3a736a94ed589d0b797df33ff9d3e1d') is enabled with preactivation required (log_plugin/to_zap_log_plugin.go:135)

config.ini

# Plugins
plugin = eosio::producer_plugin      # for state snapshots
plugin = eosio::producer_api_plugin  # for state snapshots
plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
plugin = eosio::http_plugin
plugin = eosio::db_size_api_plugin
plugin = eosio::net_api_plugin

# Chain
chain-state-db-size-mb = 16384
reversible-blocks-db-size-mb = 1024
max-transaction-time = 150000

read-mode = head
p2p-accept-transactions = false
api-accept-transactions = false

# P2P
agent-name = dfuse for EOSIO (mindreader)
p2p-server-address = 127.0.0.1:9877
p2p-listen-endpoint = 127.0.0.1:9877
p2p-max-nodes-per-host = 2
connection-cleanup-period = 60

# HTTP
access-control-allow-origin = *
http-server-address = 127.0.0.1:9888
http-max-response-time-ms = 1000
http-validate-host = false
verbose-http-errors = true

# Enable deep mind
deep-mind = true
contracts-console = true

wasm-runtime = eos-vm-jit
eos-vm-oc-enable = true
eos-vm-oc-compile-threads = 4

## Peers (choose your favorite ones, those are from https://github.com/cryptokylin/CryptoKylin-Testnet/blob/master/config/peer-config.ini)
p2p-peer-address = 18.234.6.119:80
p2p-peer-address = api-full1.eoseoul.io:9876
p2p-peer-address = api-full2.eoseoul.io:9876
p2p-peer-address = bp.cryptolions.io:9876
p2p-peer-address = bp.eosbeijing.one:8080
p2p-peer-address = br.eosrio.io:9876
p2p-peer-address = eos-seed-de.privex.io:9876
p2p-peer-address = fn001.eossv.org:443
p2p-peer-address = fullnode.eoslaomao.com:443
p2p-peer-address = mainnet.eoscalgary.io:5222
p2p-peer-address = node.eosflare.io:1883
p2p-peer-address = node1.eoscannon.io:59876
p2p-peer-address = p2p.eosdetroit.io:3018
p2p-peer-address = p2p.genereos.io:9876
p2p-peer-address = p2p.meet.one:9876
p2p-peer-address = peer.eosn.io:9876
p2p-peer-address = peer.main.alohaeos.com:9876
p2p-peer-address = peer1.mainnet.helloeos.com.cn:80
p2p-peer-address = peer2.mainnet.helloeos.com.cn:80
p2p-peer-address = publicnode.cypherglass.com:9876
p2p-peer-address = seed1.greymass.com:9876
p2p-peer-address = seed2.greymass.com:9876

sduchesneau commented 4 years ago

is your nodeos getting blocks ?

cleos -u http://localhost:9888 get info <-- is it advancing ? (you can also use curl localhost:9888/v1/chain/get_info if cleos not available)

does it have connected peers ?

cleos -u http://localhost:9888 net peers

If the nodeos endpoint does not listen yet, it could be just slow performance, are you using a SSD for storage ? do you have decent perf settings in your config.ini ?

eos-vm-oc-enable = true
eos-vm-oc-compile-threads = 4

If the blocks are advancing but mindreader is not producing anything, have you certain that the config.ini contains the deepmind flags ?

deep-mind = true
contracts-console = true

NatPDeveloper commented 4 years ago

nodeos not getting blocks cleos -u http://localhost:9888 get info Failed to connect to nodeos at http://localhost:9888; is nodeos running?
cleos -u http://localhost:9888 net peers Failed to connect to nodeos at http://localhost:9888; is nodeos running?
1TB gp2 ssd

df -h --total
...
total           1.6T  396G  1.2T  25% -

Turned console ON was off as not needed for scope of node
Oc threads adjusted to 4 from 2
changed http server address from 8888, tried for compatibility with other services, to 9888 once more
added mindreader-no-blocks-log: true to rm blocks log for snapshot
config.ini added to OP

Same results.

sduchesneau commented 4 years ago

Could you try grepping the nodeos command (ps aux |grep nodeos) and then running it directly, without dfuse ?

ex: nodeos -d /data -c /your-config --snapshot=... You will see the logs directly from nodeos, it could be faster to determine what's going on. I would expect it to take more than a few minutes to load the state from snapshot, but then you would know directly from nodeos and see if any error pop up

sduchesneau commented 4 years ago

( try it with deep-mind=false and contracts-console=false first )

NatPDeveloper commented 4 years ago

Tried deep-mind: false and contracts-console: false. Waited 10m then killed
Tried nodeos directly, same logs

nodeos --config-dir=./mindreader --data-dir=/root/workspace/dfuse-data/mindreader/data --snapshot=/root/workspace/dfuse-data/mindreader/data/snapshots/0146083494-08b50ea69b029e54fa4fc03299e809a7a94184d1ae3fd7b587434b4f8f633cf8-snapshot.bin --pause-on-startup

Usually it says something like "loading from snapshot this may take a long time" but not seeing that here.

sduchesneau commented 4 years ago

Maybe it takes more than 10 minutes for those big snapshots. Do you have logging.json in your config dir that may prevent some logs from appearing ?

NatPDeveloper commented 4 years ago

I'm logging from the command itself. I left it for hours, same thing.

NatPDeveloper commented 4 years ago

Attempting with c5.18xlarge instance for an hour, will report back.

NatPDeveloper commented 4 years ago

Just a bunch of

2020-10-09T18:30:24.672Z (mindreader) operator ready to receive commands (operator/operator.go:135) 
2020-10-09T18:45:24.672Z (mindreader) received operator command (operator/operator.go:235) {"command": "snapshot", "params": null}
2020-10-09T18:45:24.672Z (mindreader) preparing for snapshot (operator/operator.go:341) 
2020-10-09T18:45:24.672Z (mindreader) asking nodeos API to create a snapshot (superviser/snapshot.go:33) 
2020-10-09T18:45:24.672Z (mindreader) command failed (operator/operator.go:518) {"cmd": "snapshot", "error": "unable to take snapshot: api call failed: http://:9888/v1/producer/create_snapshot: Post \"http://:9888/v1/producer/create_snapshot\": dial tcp :9888: connect: connection refused"}
2020-10-09T18:45:24.672Z (mindreader) operator ready to receive commands (operator/operator.go:135)

I need the snapshots for this instance in case the node crashes.

sduchesneau commented 4 years ago

this is just logs because the operator is trying to trigger snapshots but the node is not ready yet. they don't say anything...

The real issue is: "with a given snapshot, nodeos never becomes ready, even without deep-mind enabled".

Since the command nodeos --config-dir=./mindreader --data-dir=/root/workspace/dfuse-data/mindreader/data --snapshot=/root/workspace/dfuse-data/mindreader/data/snapshots/0146083494-08b50ea69b029e54fa4fc03299e809a7a94184d1ae3fd7b587434b4f8f633cf8-snapshot.bin --pause-on-startup fails to get you a running nodeos, then the issue is in nodeos or in your snapshot file, not in dfuse code.

Does it work with a different snapshot ? Does it work with a different nodeos version ? How do you expect dfuse to react when the nodeos instance never becomes ready ?

NatPDeveloper commented 4 years ago

Yeah going to go get another snapshot, start from scratch if that's a no go. If the issue repeats will re-open.

NatPDeveloper commented 4 years ago

New snap looks good. Previous snap came from mindreader. Maybe an issue with how it was created or the sequence of events.

dfuse-io / dfuse-eosio

mindreader stalled on snapshot boot #167