consensus-shipyard / ipc

🌳 Spawn multi-level trees of customized, scalable, EVM-compatible networks with IPC. L2++ powered by FVM, Wasm, libp2p, IPFS/IPLD, and CometBFT.
https://ipc.space
Apache License 2.0
44 stars 39 forks source link

metrics flatline, no more 'complete proposal block' messages seen on validators in federated mode after adding a 3rd #1135

Closed gvelez17 closed 2 months ago

gvelez17 commented 2 months ago

Issue type

Support

Have you reproduced the bug with the latest dev version?

Yes

Version

0.1.0

Custom code

Yes

OS platform and distribution

Ubuntu 22.04

Describe the issue

Created a federated subnet /r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq on 9/13 with 2 validator nodes, power 10000 each. The abci_committed_block_height metric grew normally.

On 9/16 approx 16:40 UTC added a 3rd validator power 2. also glif nodes may have gone down around that time. From 9/16 17:15 UTC the abci_committed_block_height metric flatlined.

image

(times in the image are in PST)

Restarted the fendermint and cometbft containers several hours later.

In the logs, the last received complete proposal block message in the cometbft logs appears at 2024-09-16|22:50:39

Currently the nodes are not seeing any 'proposal block' messages in the cometbft logs, and the abci_committed_block_height remains flat. we need the nodes to start working again, but not clear how.

Repro steps

This has happened to us twice on adding a 3rd node to federated mode, the first time was on Friday when they each had federated power of 2. we since created a new subnet and it has happened again.

Relevant log output

not sure if this is the most relevent snippet, this is around the time that it stopped receiving proposals, but the flatline was actually several hours earlier

[2024-09-16|22:50:39.376] received proposal                            module=consensus proposal="Proposal{286202/0 (61217EF0483E665916F16DCA289CE5714BD6D989ECB37AABBAE60AF1AAAB427A:1:2BF617C15BBF, -1) 61AFE75411CC @ 2024-09-16T17:14:24.328154069Z}" proposer=51AEE530738E76C647B97E76305EEB8818C35E27
I[2024-09-16|22:50:39.376] Replay: BlockPart                            module=consensus height=286202 round=0 peer=53168b782187ff431d86849251ec28f274d9c703
I[2024-09-16|22:50:39.376] received complete proposal block             module=consensus height=286202 hash=61217EF0483E665916F16DCA289CE5714BD6D989ECB37AABBAE60AF1AAAB427A
I[2024-09-16|22:50:39.922] New websocket connection                     module=rpc-server protocol=websocket remote=172.18.0.6:48544
I[2024-09-16|22:50:39.922] service start                                module=rpc-server protocol=websocket remote=172.18.0.6:48544 msg="Starting wsConnection service" impl=wsConnection
I[2024-09-16|22:50:39.922] Subscribe to query                           module=rpc remote=172.18.0.6:48544 query="tm.event = 'NewBlock'"
I[2024-09-16|22:50:39.922] WSJSONRPC                                    module=rpc-server protocol=websocket remote=172.18.0.6:48544 method=subscribe
I[2024-09-16|22:50:40.164] Replay: New Step                             module=consensus height=286202 round=0 step=RoundStepPrevote
I[2024-09-16|22:50:40.164] Replay: Vote                                 module=consensus height=286202 round=0 type=SIGNED_MSG_TYPE_PREVOTE blockID=61217EF0483E665916F16DCA289CE5714BD6D989ECB37AABBAE60AF1AAAB427A:1:2BF617C15BBF peer=53168b782187ff431d86849251ec28f274d9c703
I[2024-09-16|22:50:40.165] Replay: Timeout                              module=consensus height=286202 round=0 step=RoundStepPropose dur=3s
I[2024-09-16|22:50:40.165] Replay: New Step                             module=consensus height=286202 round=0 step=RoundStepPropose
I[2024-09-16|22:50:40.165] Replay: New Step                             module=consensus height=286202 round=0 step=RoundStepPrevote
I[2024-09-16|22:50:40.165] Replay: Proposal                             module=consensus height=286202 round=0 header=1:2BF617C15BBF pol=-1 peer=53168b782187ff431d86849251ec28f274d9c703
I[2024-09-16|22:50:40.165] Replay: Vote                                 module=consensus height=286201 round=0 type=SIGNED_MSG_TYPE_PRECOMMIT blockID=F3411281509D72C63E5F26E6A7CD00B1481AABBEF205DC48B494BA79E4E16455:1:E2AFB8D15671 peer=53168b782187ff431d86849251ec28f274d9c703
I[2024-09-16|22:50:40.165] Replay: BlockPart                            module=consensus height=286202 round=0 peer=53168b782187ff431d86849251ec28f274d9c703
I[2024-09-16|22:50:40.165] Replay: Vote                                 module=consensus height=286202 round=0 type=SIGNED_MSG_TYPE_PREVOTE blockID=61217EF0483E665916F16DCA289CE5714BD6D989ECB37AABBAE60AF1AAAB427A:1:2BF617C15BBF peer=53168b782187ff431d86849251ec28f274d9c703
gvelez17 commented 2 months ago

@cryptoAtwill @raulk i put the issue here so others could see, we have limited seats in the hoku-deploy repo

gvelez17 commented 2 months ago

Fendermint logs

Key: ProposalProcessed completely halted

root@c2-xm-f37e35:~# grep 'ProposalProcessed' /tmp/f-all.err | grep '2024-09-14' | wc
 103219 1032190 41490853
root@c2-xm-f37e35:~# grep 'ProposalProcessed' /tmp/f-all.err | grep '2024-09-17' | wc
      0       0       0

on 9/14, when working well, showed

2024-09-14T00:00:04.172580Z  INFO fendermint/app/src/app.rs:698: event=ProposalProcessed is_accepted=true block_height=
1802 block_hash="B7A907496E783088CDACEEF3EB631E393498B5A4787ABAF8C623120BE164BE18" num_txs=0 proposer="84D0DF30D7D27B0D4B1414D00A20C52B6DDEBB5D"
2024-09-14T00:00:04.426911Z  INFO fendermint/app/src/app.rs:891: event=NewBlock block_height=1802
2024-09-14T00:00:04.741345Z  INFO query{request=Query { data: b"\xa1jActorStateV\x04\n\x91\xc9/\xee\xc2\x85\xddV\xd5\x06\xf9\xebM\xdc.\xb7\xcdnq\x
dd", path: "", height: block::Height(9223372036854775807), prove: false }}: fendermint/vm/interpreter/src/fvm/query.rs:66: query actor state height

on 9/17, no new blocks, logs like

2024-09-17T00:01:13.899797Z  INFO fetch_data{height=1973485 block_hash=[197, 170, 93, 78, 100, 151, 154, 43, 181, 146, 8, 171, 67, 100
, 185, 91, 77, 154, 81, 154, 222, 217, 157, 215, 178, 49, 52, 111, 230, 22, 170, 161]}:get_validator_changes{height=1973485}: ipc/pr
ovider/src/manager/evm/manager.rs:202: querying validator changes in evm subnet contract: 0xdc4e…a1ef
2024-09-17T00:01:14.355625Z  INFO fendermint/vm/topdown/src/sync/syncer.rs:259: event=NewParentView is_null=false block_heigh
t=1973485 block_hash="c5aa5d4e64979a2bb59208ab4364b95b4d9a519aded99dd7b231346fe616aaa1" num_msgs=0 num_validator_changes=0
2024-09-17T00:01:33.879929Z  INFO fetch_data{height=1973486 block_hash=[122, 112, 230, 197, 22, 29, 82, 162, 30, 195, 154, 160, 60, 25
0, 37, 147, 247, 176, 137, 22, 29, 253, 30, 13, 101, 254, 199, 24, 227, 56, 178, 175]}:get_validator_changes{height=1973486}: ipc/pr
ovider/src/manager/evm/manager.rs:202: querying validator changes in evm subnet contract: 0xdc4e…a1ef

at 9/16 at 17:15 approx when broke

2024-09-17T00:00:34.362664Z  INFO fendermint/vm/topdown/src/sync/syncer.rs:259: event=NewParentView is_null=false block_heigh
t=1973484 block_hash="b0f6a09ff7c5c6de2bef6b14fbecfaa02cb3f1afec92259b3bc776d28b78e14f" num_msgs=0 num_validator_changes=0
2024-09-17T00:01:03.181328Z ERROR fendermint/vm/topdown/src/sync/mod.rs:191: sync with parent encountered error error="failed to get Tenderm
int status"
2024-09-17T00:01:13.899797Z  INFO fetch_data{height=1973485 block_hash=[197, 170, 93, 78, 100, 151, 154, 43, 181, 146, 8, 171, 67, 100
, 185, 91, 77, 154, 81, 154, 222, 217, 157, 215, 178, 49, 52, 111, 230, 22, 170, 161]}:get_validator_changes{height=1973485}: ipc/pr
ovider/src/manager/evm/manager.rs:202: querying validator changes in evm subnet contract: 0xdc4e…a1ef
2024-09-17T00:01:14.355625Z  INFO fendermint/vm/topdown/src/sync/syncer.rs:259: event=NewParentView is_null=false block_heigh
t=1973485 block_hash="c5aa5d4e64979a2bb59208ab4364b95b4d9a519aded99dd7b231346fe616aaa1" num_msgs=0 num_validator_changes=0
2024-09-17T00:01:33.879929Z  INFO fetch_data{height=1973486 block_hash=[122, 112, 230, 197, 22, 29, 82, 162, 30, 195, 154, 160, 60, 25
0, 37, 147, 247, 176, 137, 22, 29, 253, 30, 13, 101, 254, 199, 24, 227, 56, 178, 175]}:get_validator_changes{height=1973486}: ipc/pr
ovider/src/manager/evm/manager.rs:202: querying validator changes in evm subnet contract: 0xdc4e…a1ef
2024-09-17T00:01:34.333529Z  INFO fendermint/vm/topdown/src/sync/syncer.rs:259: event=NewParentView is_null=false block_heigh
t=1973486 block_hash="7a70e6c5161d52a21ec39aa03cfa2593f7b089161dfd1e0d65fec718e338b2af" num_msgs=0 num_validator_changes=0
2024-09-17T00:01:53.180739Z ERROR fendermint/vm/topdown/src/sync/mod.rs:191: sync with parent encountered error error="failed to get Tenderm
int status"

the 'failed to get Tendermint status' happened some when things were working but much more now

root@c2-xm-f37e35:~# grep 'failed to get Tendermint status' /tmp/f-all.err | grep '2024-09-14' | wc
     16     208    3344
root@c2-xm-f37e35:~# grep 'failed to get Tendermint status' /tmp/f-all.err | grep '2024-09-17' | wc
    723    9399  151107

gvelez17 commented 2 months ago

The cometbft addrbook shows a last success time of just prior to the error condition

63505f0688d0:/cometbft# cat /cometbft/config/addrbook.json
{
    "key": "e9da010b4783a6485bf4007c",
    "addrs": [
        {
            "addr": {
                "id": "53168b782187ff431d86849251ec28f274d9c703",
                "ip": "170.205.62.124",
                "port": 26656
            },
            "src": {
                "id": "a918bb7532318af7ec00eef2ec3fa1341d8f78e3",
                "ip": "0.0.0.0",
                "port": 26656
            },
            "buckets": [
                25
            ],
            "attempts": 1,
            "bucket_type": 2,
            "last_attempt": "2024-09-16T21:42:33.928681864Z",
            "last_success": "2024-09-16T17:03:30.733503089Z",
            "last_ban_time": "0001-01-01T00:00:00Z"
        }
    ]
}
gvelez17 commented 2 months ago

NOTE: I had erroneously NOT started the relayer until 9/16, so the successful period was without a relayer running.

The relayer logs are as follows:

root@c2-xm-f37e35:~# docker logs 5dbd5ae739e4
2024-09-16T21:25:58.920099Z  INFO ipc_provider::checkpoint: launching bottom-up relayer, parent: /r314159, child: /r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq for t410fshes73wcqxovnvig7hvu3xbow7gw44o5e6a2lsa
2024-09-16T21:25:59.037570Z  INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:31:34.139495Z  INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:37:08.978150Z  INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:42:11.320313Z ERROR ipc_provider::checkpoint: cannot submit checkpoint for submitter: t410fshes73wcqxovnvig7hvu3xbow7gw44o5e6a2lsa due to (code: 0, message: Tendermint RPC error: response error

Caused by:
    Internal error: leveldb: closed (code: -32603)

Location:
    /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10:9, data: None)
2024-09-16T21:42:26.647738Z  INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:42:28.348755Z ERROR ipc_provider::checkpoint: cannot submit checkpoint for submitter: t410fshes73wcqxovnvig7hvu3xbow7gw44o5e6a2lsa due to (code: 0, message: Tendermint RPC error: HTTP error

Caused by:
    connection error: Connection reset by peer (os error 104)

Location:
    /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10:9, data: None)
2024-09-16T21:42:43.462467Z  INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:42:43.464295Z ERROR ipc_provider::checkpoint: cannot submit checkpoint for submitter: t410fshes73wcqxovnvig7hvu3xbow7gw44o5e6a2lsa due to (code: 0, message: Tendermint RPC error: HTTP error

Caused by:
    error trying to connect: tcp connect error: Connection refused (os error 111)

Location:
    /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10:9, data: None)
2024-09-16T21:42:58.580812Z  INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:48:36.238959Z  INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:54:10.901048Z  INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:59:47.961852Z  INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T22:05:22.559853Z  INFO ipc_provider::checkpoint: last submission height: 0
gvelez17 commented 2 months ago

One theory is that the problem is the wallet balances, also there is an error we do not understand about 'sender invalid'.

testnet-ansible-v3@c2-xm-f37e35:/home$ ipc-cli subnet send-value --subnet /r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq --from 0x91c92feec285dd56d506f9eb4ddc2eb7cd6e71dd --to 0x1ae0a65e99d41c74e85ae58e8eec65fe9f569fd0  .0002
2024-09-17T18:57:18.467241Z ERROR ipc_cli: main process failed: error processing command Some(Subnet(SubnetCommandsArgs { command: SendValue(SendValueArgs { from: Some("0x91c92feec285dd56d506f9eb4ddc2eb7cd6e71dd"), to: "0x1ae0a65e99d41c74e85ae58e8eec65fe9f569fd0", subnet: "/r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq", amount: 0.0002 }) })): (code: 1, message: failed to estimate gas: pre-validation failed: Sender invalid
, data: Some(String("")))
testnet-ansible-v3@c2-xm-f37e35:/home$ ipc-cli wallet balances --subnet /r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq --wallet-type evm
0x91c92feec285dd56d506f9eb4ddc2eb7cd6e71dd - Balance: 0.0
0x8ca38bf111456f4733e9ff55ef21f6e693f83baf - Balance: 0.0
0x731d8dcfd82a90b63e9fa0c49f0c43f9c3b5a4cf - Balance: 0.0
0x93b269addeefd6276d115a0816c4d841255788fb - Balance: 0.0
testnet-ansible-v3@c2-xm-f37e35:/home$ ipc-cli subnet list --parent /r314159
/r314159/t410fwwfc3my2jrejevghickfx66tjhubanqhvauwizi - collateral: 0.0 FIL, circ.supply: 0.0 FIL, genesis: 1974726
/r314159/t410fhcefljyminf6mgmcj7tqzef2fhppy7wg23r22ea - collateral: 0.0 FIL, circ.supply: 20110.0100001000000002 FIL, genesis: 1958160
/r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq - collateral: 0.0 FIL, circ.supply: 1000.000002 FIL, genesis: 1964397
/r314159/t410fdt67sqg7fvju6z3y7jl3awvrvu445jihustvmwi - collateral: 0.0 FIL, circ.supply: 0.0030000000000001 FIL, genesis: 1952931
/r314159/t410fmnm74ynqjpkdmfxtakd4b7or2jbp4s2jsixretq - collateral: 0.0 FIL, circ.supply: 0.0000000100000001 FIL, genesis: 1957433

To attempt to get funds into the subnet, we performed both

ipc-cli cross-msg fund-with-token --subnet /r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq --from 0x91c92feec285dd56d506f9eb4ddc2eb7cd6e71dd --approve 1000000000000

and

cast send $SOURCE_SUPPLY_ADDRESS \
  "mint(address,uint256)" \
  0x7287922457de4095e7b7e8cdcddf6be48c3f1f4a \
1000000000000000 \
  --private-key $VALIDATOR_0_PRIVATE_KEY \
  --rpc-url $PARENT_CHAIN_RPC

on 9/16/2024.

On 9/13/2024, when the validator nodes appeared to be working, the only funding had been thru the set-federated-power command.

gvelez17 commented 2 months ago

@sanderpick @raulk if you think i should start a new subnet let me know, my concern is this one already was given to Matrixed.link, and really we don't know why it stopped processing proposals, so not sure starting a new one is the answer.

gvelez17 commented 2 months ago

at 9/17/2024 21:18 UTC i simul-restarted all containers on all 3 validators (the 2 original and the new one) but did not make any difference, same types of log messages, no new blocks.

gvelez17 commented 2 months ago

from @3benbox (Ben Wilson) - these broadcast errors may be a root cause, that the checkpoints are not being signed

docker logs testnet-belgrade-v1-fendermint 2>&1 | grep -i checkpoint
...
2024-09-17T14:40:07.979239Z  INFO fendermint/vm/interpreter/src/fvm/checkpoint.rs:124: event=NewBottomUpCheckpoint block_height=600 block_hash="48da220db674f65ad544b1b69feb2539252607ed64ede28dfeebb78f51391512" num_msgs=0 next_configuration_number=2
2024-09-17T14:40:09.988490Z ERROR fendermint/vm/interpreter/src/fvm/exec.rs:224: error broadcasting checkpoint signature error=failed to broadcast checkpoint signature

resulting in empty signatory lists

[[4,10,181,138,45,179,26,76,72,146,84,199,64,148,91,251,211,73,232,16,54,7]]],"block_height":2400,"block_hash":"976d0261db46657080319252ee58c8d1dc62e3671fcd907f339ab3ce39599487","next_configuration_number":0,"msgs":[]},"signatures":[],"signatories":[]}
sanderpick commented 2 months ago

those error will happen if the validators dont have funds in the subnet. this will only have an effect on bottom up relayer messages, not block production.

from an initial look... this seems to be a misconfiguration with cometbft... i would recommend starting fresh with two working validators and then try to add a third again.

i can try this on some separate infra just to sanity check.

cryptoAtwill commented 2 months ago

Btw, @gvelez17, I also noticed a misconception (maybe), set-ferederated-power and fund-with-token are two different methods. set-federated-power gives voting power and fund-with-token is actually providing funds. So set-federated-power will not increase balance in anyway.

For the relayer, can you check if your gateway configuration in the child subnet is correct?

I rmb I had a call with Ben and everything is working, so the change is just set-federated-power and everything broken?

gvelez17 commented 2 months ago

noticed that an ip address from a previous testnet subnet kept showing up in the cometbft addrbook.json, apparently being gossiped between nodes (even if simul removed and restarted)

Added the "bad" peer id into the private_peer_ids list in each node private_peer_ids = "bd5f0ca740089aeab7f16f00c3e8a7c69e99b41e"

cleaned and restarted again, this time it did not show up and proposals started again!

image

gvelez17 commented 2 months ago

Btw, @gvelez17, I also noticed a misconception (maybe), set-ferederated-power and fund-with-token are two different methods. set-federated-power gives voting power and fund-with-token is actually providing funds. So set-federated-power will not increase balance in anyway.

For the relayer, can you check if your gateway configuration in the child subnet is correct?

I rmb I had a call with Ben and everything is working, so the change is just set-federated-power and everything broken?

hey Will, i think since your call, we had added a validator to that subnet, testnet-v2 with set-federated-power, using equal powers, and that subnet stopped functioning. Friday 9/13/24 we started a fresh subnet, testnet-v3, that this ticket is about. Thx for the clarification re funding!

gvelez17 commented 2 months ago

still having some issues with checkpoints not being signed by all validators, but the original issue is resolved, closing. thanks for help!