Open LesnyRumcajs opened 1 year ago
Most likely the send
command is not working as we expected. The test setup should be good, only incomplete regarding message behaviour.
From Lotus doc:
If messages are not deemed attractive enough by storage providers to be included in new blocks, they may become stuck in the message pool. This is usually a consequence of the GasFeeCap being too low, for example, when the Network’s BaseFee is high. It can also be a consequence of the GasPremium being too low if the network is congested.
This might be our issue. As a start we should print message ID and check if it is not stuck in the message pool. https://calibration.filscan.io/tipset/pool-message-list
In that case, maybe trying resending a new message with increased gas premium could be a solution.
I've been trying to reproduce the issue and it was pretty easy. Found out that between two subsequent messages where the first one was mined and the second is stuck in Forest message pool, the only significant differences were a lower GasFee
cap and GasPremium
:
First:
"Nonce": 2,
"Value": "300",
"GasLimit": 1330063,
"GasFeeCap": "100864",
"GasPremium": "99810",
"CID": {
"/": "bafy2bzacecttsvw2yhxvuhgtqe2o5n7yawreawlh7g5tp5yba2fyhzc6yf6ee"
}
Second:
"Nonce": 1,
"Value": "550",
"GasLimit": 1330063,
"GasFeeCap": "101191",
"GasPremium": "100137",
"CID": {
"/": "bafy2bzacedcyrn74duk6sbxmaqdwk7zkqlufv2e2tjat5xnxqikvnbxp6jk6c"
}
That can't really explain why one was mined and the other was not. But if we looked at the first one and another message (coming from Lotus) there's a big difference between gas values:
A Forest message:
CID: bafy2bzacecttsvw2yhxvuhgtqe2o5n7yawreawlh7g5tp5yba2fyhzc6yf6ee
Nonce: 2
Gas Fee Cap: 0.000100864 nanoFIL
Gas Premium: 99810 attoFIL
Gas Limit: 1,330,063
Gas Used: 1,219,763
BaseFee: 100 attoFIL
Gas Fee: 132.87556433 nanoFIL
A Lotus one:
CID: bafy2bzacebhh2y6jgimhxaiqweyk7fujueqmmvf2qscclse5gsiwzajjtrylq
Nonce: 11797
Gas Fee Cap: 0.000100504 nanoFIL
Gas Premium: 99450 attoFIL
Gas Limit: 70,178,532
Gas Used: 56,207,826
BaseFee: 100 attoFIL
Gas Fee: 6985.083331 nanoFIL
I think that a bad message Nonce
could cause the transaction to fail.
In Forest node I can see:
WARN forest_message_pool::msg_chain: encountered message from actor t1ekkzekiozleakm4jauekqqc6uhn3dcwzcrrqrry with nonce 53 less than the current nonce 54
Let's sum up what is happening here: The sending actor has its Nonce with a value of 53.
We're sending to t1mljyv5pzvu7jgzuvzfog2ycsd5fdh4y7c2z6t6q
500 afil. Forest creates the message using the Nonce from sending actor Nonce (53).
Transaction is mined successfully in block bafy2bzaceacr42ctk3z7i7vuqkdyfzydscvcv47dgzzdiunbh3qmcc6dp65su.
Sending actor Nonce is somewhat being updated in the blockchain to 54 (todo: find who does that).
We're sending a new transaction (500 afill to t1i63uuuqrgqvntdekae4wmomz5abgnwltm3buyia
). Nonce used here is still 53 (this could be our issue):
{
"Message": {
"Version": 0,
"To": "t1i63uuuqrgqvntdekae4wmomz5abgnwltm3buyia",
"From": "t1ekkzekiozleakm4jauekqqc6uhn3dcwzcrrqrry",
"Nonce": 53,
"Value": "500",
"GasLimit": 5866637,
"GasFeeCap": "100374",
"GasPremium": "99320",
"Method": 0,
"Params": "",
"CID": {
"/": "bafy2bzacedw3qgd4dy3qyizg3ian44i5yefkjtzknetdksqqwi7khc2cuvpbq"
}
},
"Signature": {
"Type": 1,
"Data": "1drZdRSXUCMEv0niS/3XSCd8MhE3UZkFei4EYpjh+ismkBpVDGqNwAyXRxccVy8Uqhoj2mFkmOMI6GliekqC0wA="
},
"CID": {
"/": "bafy2bzacecmeugxrjtaaxkp5nci7oqblt2bccsqcrc57dqx3do6kfvymhgqa6"
}
}
Transaction is rejected because it would expect a Nonce of 54.
It's not even reaching the Lotus node running the explorer (probably because detected as invalid by network nodes during p2p message broadcast).
Generally speaking, even if we didn't had this issue here, there could be a race between two CI jobs since we use the same sending actor.
I've found a new error while sending FIL:
Error: {"code":0,"message":"gas_limit cannot be greater than block gas limit"}
Something is going really wrong here. Will add display of value in error message.
To sum up:
gas_limit
error was due to some mutation bug and I've fixed it in 5f090e1Nonce
issue was fixed in bba7724 (I was using this branch for troubleshooting because it has support for mpool pending
which is very useful)I still don't get why it is happening. I would like to reproduce in again on the CI and see the json message (we need to merge the PR with mpool pending
first).
Maybe also looking at how Lotus handles this message when reaching its Mempool.
@LesnyRumcajs did you try to run the script locally? Like sending to an address 40 times.
I can do it today. Should I use the latest main
?
I can do it today. Should I use the latest
main
?
Yes, just run forest-cli send
a couple of times in a loop and see if you can reproduce. Did not try on my Fedora, only macOS.
@elmattic Can't reproduce it locally. All of the messages arrived. I sent:
I'll try doing it in parallel with different nodes.
We already discussed this in Slack, so just for the record. We managed to reproduce it by running sending FIL in two separate locations (Germany and Poland).
Sending 500atto from t1es57egc3chszrnlprkkfl2x6g36zlxvrdcdz76i to t1gbwtv52gqukzhe5rmhu4n5jclpf3lc6bdn3m5eq for the 8417 time
Error: {"code":0,"message":"gas price is lower than min gas price"}
Sending 500atto from t1es57egc3chszrnlprkkfl2x6g36zlxvrdcdz76i to t1gbwtv52gqukzhe5rmhu4n5jclpf3lc6bdn3m5eq for the 8418 time
bafy2bzaceclc3dxyhmyfmnr76gvswlu6fwcaq5xgr2kwrtttsrc53oadiice4
Sending 500atto from t1es57egc3chszrnlprkkfl2x6g36zlxvrdcdz76i to t1gbwtv52gqukzhe5rmhu4n5jclpf3lc6bdn3m5eq for the 8419 time
This is a great finding, unfortunately it's not the same error than the one in the CI where the message was successfully added to the pool and its CID was printed.
@LesnyRumcajs I had another try reproducing this issue. The issue is still there but not easy to reproduce.
Using only one node, I managed to reproduce the issue two times on around 1500 runs.
The way I'm doing it. First create a new sending account and fill it using some calibnet faucet.
Then:
Looking at the message payloads in send_log.txt
, the send is failing because the message Nonce
wasn't correctly incremented, it should be 1451
but it still equals to 1450
.
Looking at the forest daemon logs around 03:09:34
we can see a fork happening, so that's likely be related to our issue.
Thanks. Do we know if Lotus has the same issue? If not, how does Lotus handle it?
Not 100% sure but they do have some code that can react to HEAD changes and republish previously mined messages.
@elmattic Would you like to contact Lotus devs to validate that this is the solution to the same problem?
@elmattic do you recall what's the status of this? Is sending FIL still failing every now and then?
@LesnyRumcajs I will check our CI. But, AFAIK, the pesky bug is still present.
@elmattic do you recall what's the status of this? Is sending FIL still failing every now and then?
https://github.com/ChainSafe/forest/actions/runs/9204884823/job/25319573657
We can now update the status to "still broken".
Describe the bug
The wallet test failed, and the FIL was not sent. The target wallet didn't receive anything and the transaction is not visible in the source wallet. https://calibration.filscan.io/address/general?address=t1ac6ndwj6nghqbmtbovvnwcqo577p6ox2pt52q2y (it should've happened around 2023-03-23 17:47:17, target t1fye76pfymru4y2fqbgdxswpwqt3ukeems2nzfna)
This means that either we have a bug in the test setup or the send command doesn't always send the funds, which would be a significant issue.
To Reproduce
It happened in this PR https://github.com/ChainSafe/forest/pull/2709 (which didn't introduce any logic changes)
The job failed: https://github.com/ChainSafe/forest/actions/runs/4503553182/jobs/7926889785
Log output
Expected behaviour
The test passes.
Screenshots
Environment (please complete the following information):
rustc --version
)Other information and links