Closed gvelez17 closed 2 months ago
@cryptoAtwill @raulk i put the issue here so others could see, we have limited seats in the hoku-deploy repo
Fendermint logs
Key: ProposalProcessed completely halted
root@c2-xm-f37e35:~# grep 'ProposalProcessed' /tmp/f-all.err | grep '2024-09-14' | wc
103219 1032190 41490853
root@c2-xm-f37e35:~# grep 'ProposalProcessed' /tmp/f-all.err | grep '2024-09-17' | wc
0 0 0
on 9/14, when working well, showed
2024-09-14T00:00:04.172580Z INFO fendermint/app/src/app.rs:698: event=ProposalProcessed is_accepted=true block_height=
1802 block_hash="B7A907496E783088CDACEEF3EB631E393498B5A4787ABAF8C623120BE164BE18" num_txs=0 proposer="84D0DF30D7D27B0D4B1414D00A20C52B6DDEBB5D"
2024-09-14T00:00:04.426911Z INFO fendermint/app/src/app.rs:891: event=NewBlock block_height=1802
2024-09-14T00:00:04.741345Z INFO query{request=Query { data: b"\xa1jActorStateV\x04\n\x91\xc9/\xee\xc2\x85\xddV\xd5\x06\xf9\xebM\xdc.\xb7\xcdnq\x
dd", path: "", height: block::Height(9223372036854775807), prove: false }}: fendermint/vm/interpreter/src/fvm/query.rs:66: query actor state height
on 9/17, no new blocks, logs like
2024-09-17T00:01:13.899797Z INFO fetch_data{height=1973485 block_hash=[197, 170, 93, 78, 100, 151, 154, 43, 181, 146, 8, 171, 67, 100
, 185, 91, 77, 154, 81, 154, 222, 217, 157, 215, 178, 49, 52, 111, 230, 22, 170, 161]}:get_validator_changes{height=1973485}: ipc/pr
ovider/src/manager/evm/manager.rs:202: querying validator changes in evm subnet contract: 0xdc4e…a1ef
2024-09-17T00:01:14.355625Z INFO fendermint/vm/topdown/src/sync/syncer.rs:259: event=NewParentView is_null=false block_heigh
t=1973485 block_hash="c5aa5d4e64979a2bb59208ab4364b95b4d9a519aded99dd7b231346fe616aaa1" num_msgs=0 num_validator_changes=0
2024-09-17T00:01:33.879929Z INFO fetch_data{height=1973486 block_hash=[122, 112, 230, 197, 22, 29, 82, 162, 30, 195, 154, 160, 60, 25
0, 37, 147, 247, 176, 137, 22, 29, 253, 30, 13, 101, 254, 199, 24, 227, 56, 178, 175]}:get_validator_changes{height=1973486}: ipc/pr
ovider/src/manager/evm/manager.rs:202: querying validator changes in evm subnet contract: 0xdc4e…a1ef
at 9/16 at 17:15 approx when broke
2024-09-17T00:00:34.362664Z INFO fendermint/vm/topdown/src/sync/syncer.rs:259: event=NewParentView is_null=false block_heigh
t=1973484 block_hash="b0f6a09ff7c5c6de2bef6b14fbecfaa02cb3f1afec92259b3bc776d28b78e14f" num_msgs=0 num_validator_changes=0
2024-09-17T00:01:03.181328Z ERROR fendermint/vm/topdown/src/sync/mod.rs:191: sync with parent encountered error error="failed to get Tenderm
int status"
2024-09-17T00:01:13.899797Z INFO fetch_data{height=1973485 block_hash=[197, 170, 93, 78, 100, 151, 154, 43, 181, 146, 8, 171, 67, 100
, 185, 91, 77, 154, 81, 154, 222, 217, 157, 215, 178, 49, 52, 111, 230, 22, 170, 161]}:get_validator_changes{height=1973485}: ipc/pr
ovider/src/manager/evm/manager.rs:202: querying validator changes in evm subnet contract: 0xdc4e…a1ef
2024-09-17T00:01:14.355625Z INFO fendermint/vm/topdown/src/sync/syncer.rs:259: event=NewParentView is_null=false block_heigh
t=1973485 block_hash="c5aa5d4e64979a2bb59208ab4364b95b4d9a519aded99dd7b231346fe616aaa1" num_msgs=0 num_validator_changes=0
2024-09-17T00:01:33.879929Z INFO fetch_data{height=1973486 block_hash=[122, 112, 230, 197, 22, 29, 82, 162, 30, 195, 154, 160, 60, 25
0, 37, 147, 247, 176, 137, 22, 29, 253, 30, 13, 101, 254, 199, 24, 227, 56, 178, 175]}:get_validator_changes{height=1973486}: ipc/pr
ovider/src/manager/evm/manager.rs:202: querying validator changes in evm subnet contract: 0xdc4e…a1ef
2024-09-17T00:01:34.333529Z INFO fendermint/vm/topdown/src/sync/syncer.rs:259: event=NewParentView is_null=false block_heigh
t=1973486 block_hash="7a70e6c5161d52a21ec39aa03cfa2593f7b089161dfd1e0d65fec718e338b2af" num_msgs=0 num_validator_changes=0
2024-09-17T00:01:53.180739Z ERROR fendermint/vm/topdown/src/sync/mod.rs:191: sync with parent encountered error error="failed to get Tenderm
int status"
the 'failed to get Tendermint status' happened some when things were working but much more now
root@c2-xm-f37e35:~# grep 'failed to get Tendermint status' /tmp/f-all.err | grep '2024-09-14' | wc
16 208 3344
root@c2-xm-f37e35:~# grep 'failed to get Tendermint status' /tmp/f-all.err | grep '2024-09-17' | wc
723 9399 151107
The cometbft addrbook shows a last success time of just prior to the error condition
63505f0688d0:/cometbft# cat /cometbft/config/addrbook.json
{
"key": "e9da010b4783a6485bf4007c",
"addrs": [
{
"addr": {
"id": "53168b782187ff431d86849251ec28f274d9c703",
"ip": "170.205.62.124",
"port": 26656
},
"src": {
"id": "a918bb7532318af7ec00eef2ec3fa1341d8f78e3",
"ip": "0.0.0.0",
"port": 26656
},
"buckets": [
25
],
"attempts": 1,
"bucket_type": 2,
"last_attempt": "2024-09-16T21:42:33.928681864Z",
"last_success": "2024-09-16T17:03:30.733503089Z",
"last_ban_time": "0001-01-01T00:00:00Z"
}
]
}
NOTE: I had erroneously NOT started the relayer until 9/16, so the successful period was without a relayer running.
The relayer logs are as follows:
root@c2-xm-f37e35:~# docker logs 5dbd5ae739e4
2024-09-16T21:25:58.920099Z INFO ipc_provider::checkpoint: launching bottom-up relayer, parent: /r314159, child: /r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq for t410fshes73wcqxovnvig7hvu3xbow7gw44o5e6a2lsa
2024-09-16T21:25:59.037570Z INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:31:34.139495Z INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:37:08.978150Z INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:42:11.320313Z ERROR ipc_provider::checkpoint: cannot submit checkpoint for submitter: t410fshes73wcqxovnvig7hvu3xbow7gw44o5e6a2lsa due to (code: 0, message: Tendermint RPC error: response error
Caused by:
Internal error: leveldb: closed (code: -32603)
Location:
/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10:9, data: None)
2024-09-16T21:42:26.647738Z INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:42:28.348755Z ERROR ipc_provider::checkpoint: cannot submit checkpoint for submitter: t410fshes73wcqxovnvig7hvu3xbow7gw44o5e6a2lsa due to (code: 0, message: Tendermint RPC error: HTTP error
Caused by:
connection error: Connection reset by peer (os error 104)
Location:
/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10:9, data: None)
2024-09-16T21:42:43.462467Z INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:42:43.464295Z ERROR ipc_provider::checkpoint: cannot submit checkpoint for submitter: t410fshes73wcqxovnvig7hvu3xbow7gw44o5e6a2lsa due to (code: 0, message: Tendermint RPC error: HTTP error
Caused by:
error trying to connect: tcp connect error: Connection refused (os error 111)
Location:
/root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10:9, data: None)
2024-09-16T21:42:58.580812Z INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:48:36.238959Z INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:54:10.901048Z INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T21:59:47.961852Z INFO ipc_provider::checkpoint: last submission height: 0
2024-09-16T22:05:22.559853Z INFO ipc_provider::checkpoint: last submission height: 0
One theory is that the problem is the wallet balances, also there is an error we do not understand about 'sender invalid'.
testnet-ansible-v3@c2-xm-f37e35:/home$ ipc-cli subnet send-value --subnet /r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq --from 0x91c92feec285dd56d506f9eb4ddc2eb7cd6e71dd --to 0x1ae0a65e99d41c74e85ae58e8eec65fe9f569fd0 .0002
2024-09-17T18:57:18.467241Z ERROR ipc_cli: main process failed: error processing command Some(Subnet(SubnetCommandsArgs { command: SendValue(SendValueArgs { from: Some("0x91c92feec285dd56d506f9eb4ddc2eb7cd6e71dd"), to: "0x1ae0a65e99d41c74e85ae58e8eec65fe9f569fd0", subnet: "/r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq", amount: 0.0002 }) })): (code: 1, message: failed to estimate gas: pre-validation failed: Sender invalid
, data: Some(String("")))
testnet-ansible-v3@c2-xm-f37e35:/home$ ipc-cli wallet balances --subnet /r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq --wallet-type evm
0x91c92feec285dd56d506f9eb4ddc2eb7cd6e71dd - Balance: 0.0
0x8ca38bf111456f4733e9ff55ef21f6e693f83baf - Balance: 0.0
0x731d8dcfd82a90b63e9fa0c49f0c43f9c3b5a4cf - Balance: 0.0
0x93b269addeefd6276d115a0816c4d841255788fb - Balance: 0.0
testnet-ansible-v3@c2-xm-f37e35:/home$ ipc-cli subnet list --parent /r314159
/r314159/t410fwwfc3my2jrejevghickfx66tjhubanqhvauwizi - collateral: 0.0 FIL, circ.supply: 0.0 FIL, genesis: 1974726
/r314159/t410fhcefljyminf6mgmcj7tqzef2fhppy7wg23r22ea - collateral: 0.0 FIL, circ.supply: 20110.0100001000000002 FIL, genesis: 1958160
/r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq - collateral: 0.0 FIL, circ.supply: 1000.000002 FIL, genesis: 1964397
/r314159/t410fdt67sqg7fvju6z3y7jl3awvrvu445jihustvmwi - collateral: 0.0 FIL, circ.supply: 0.0030000000000001 FIL, genesis: 1952931
/r314159/t410fmnm74ynqjpkdmfxtakd4b7or2jbp4s2jsixretq - collateral: 0.0 FIL, circ.supply: 0.0000000100000001 FIL, genesis: 1957433
To attempt to get funds into the subnet, we performed both
ipc-cli cross-msg fund-with-token --subnet /r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq --from 0x91c92feec285dd56d506f9eb4ddc2eb7cd6e71dd --approve 1000000000000
and
cast send $SOURCE_SUPPLY_ADDRESS \
"mint(address,uint256)" \
0x7287922457de4095e7b7e8cdcddf6be48c3f1f4a \
1000000000000000 \
--private-key $VALIDATOR_0_PRIVATE_KEY \
--rpc-url $PARENT_CHAIN_RPC
on 9/16/2024.
On 9/13/2024, when the validator nodes appeared to be working, the only funding had been thru the set-federated-power command.
@sanderpick @raulk if you think i should start a new subnet let me know, my concern is this one already was given to Matrixed.link, and really we don't know why it stopped processing proposals, so not sure starting a new one is the answer.
at 9/17/2024 21:18 UTC i simul-restarted all containers on all 3 validators (the 2 original and the new one) but did not make any difference, same types of log messages, no new blocks.
from @3benbox (Ben Wilson) - these broadcast errors may be a root cause, that the checkpoints are not being signed
docker logs testnet-belgrade-v1-fendermint 2>&1 | grep -i checkpoint
...
2024-09-17T14:40:07.979239Z INFO fendermint/vm/interpreter/src/fvm/checkpoint.rs:124: event=NewBottomUpCheckpoint block_height=600 block_hash="48da220db674f65ad544b1b69feb2539252607ed64ede28dfeebb78f51391512" num_msgs=0 next_configuration_number=2
2024-09-17T14:40:09.988490Z ERROR fendermint/vm/interpreter/src/fvm/exec.rs:224: error broadcasting checkpoint signature error=failed to broadcast checkpoint signature
resulting in empty signatory lists
[[4,10,181,138,45,179,26,76,72,146,84,199,64,148,91,251,211,73,232,16,54,7]]],"block_height":2400,"block_hash":"976d0261db46657080319252ee58c8d1dc62e3671fcd907f339ab3ce39599487","next_configuration_number":0,"msgs":[]},"signatures":[],"signatories":[]}
those error will happen if the validators dont have funds in the subnet. this will only have an effect on bottom up relayer messages, not block production.
from an initial look... this seems to be a misconfiguration with cometbft... i would recommend starting fresh with two working validators and then try to add a third again.
i can try this on some separate infra just to sanity check.
Btw, @gvelez17, I also noticed a misconception (maybe), set-ferederated-power
and fund-with-token
are two different methods. set-federated-power
gives voting power and fund-with-token
is actually providing funds. So set-federated-power
will not increase balance in anyway.
For the relayer, can you check if your gateway configuration in the child subnet is correct?
I rmb I had a call with Ben and everything is working, so the change is just set-federated-power
and everything broken?
noticed that an ip address from a previous testnet subnet kept showing up in the cometbft addrbook.json, apparently being gossiped between nodes (even if simul removed and restarted)
Added the "bad" peer id into the private_peer_ids list in each node
private_peer_ids = "bd5f0ca740089aeab7f16f00c3e8a7c69e99b41e"
cleaned and restarted again, this time it did not show up and proposals started again!
Btw, @gvelez17, I also noticed a misconception (maybe),
set-ferederated-power
andfund-with-token
are two different methods.set-federated-power
gives voting power andfund-with-token
is actually providing funds. Soset-federated-power
will not increase balance in anyway.For the relayer, can you check if your gateway configuration in the child subnet is correct?
I rmb I had a call with Ben and everything is working, so the change is just
set-federated-power
and everything broken?
hey Will, i think since your call, we had added a validator to that subnet, testnet-v2 with set-federated-power, using equal powers, and that subnet stopped functioning. Friday 9/13/24 we started a fresh subnet, testnet-v3, that this ticket is about. Thx for the clarification re funding!
still having some issues with checkpoints not being signed by all validators, but the original issue is resolved, closing. thanks for help!
Issue type
Support
Have you reproduced the bug with the latest dev version?
Yes
Version
0.1.0
Custom code
Yes
OS platform and distribution
Ubuntu 22.04
Describe the issue
Created a federated subnet /r314159/t410f3rhmoqizadf6kws5sup6y3g77swnrippfhvdxbq on 9/13 with 2 validator nodes, power 10000 each. The
abci_committed_block_height
metric grew normally.On 9/16 approx 16:40 UTC added a 3rd validator power 2. also glif nodes may have gone down around that time. From 9/16 17:15 UTC the
abci_committed_block_height
metric flatlined.(times in the image are in PST)
Restarted the fendermint and cometbft containers several hours later.
In the logs, the last
received complete proposal block
message in the cometbft logs appears at2024-09-16|22:50:39
Currently the nodes are not seeing any 'proposal block' messages in the cometbft logs, and the abci_committed_block_height remains flat. we need the nodes to start working again, but not clear how.
Repro steps
This has happened to us twice on adding a 3rd node to federated mode, the first time was on Friday when they each had federated power of 2. we since created a new subnet and it has happened again.
Relevant log output