ChainSafe / forest

🌲 Rust Filecoin Node Implementation
https://forest.chainsafe.io
Apache License 2.0
625 stars 149 forks source link

Send FIL failed #2726

Open LesnyRumcajs opened 1 year ago

LesnyRumcajs commented 1 year ago

Describe the bug

The wallet test failed, and the FIL was not sent. The target wallet didn't receive anything and the transaction is not visible in the source wallet. https://calibration.filscan.io/address/general?address=t1ac6ndwj6nghqbmtbovvnwcqo577p6ox2pt52q2y (it should've happened around 2023-03-23 17:47:17, target t1fye76pfymru4y2fqbgdxswpwqt3ukeems2nzfna)

This means that either we have a bug in the test setup or the send command doesn't always send the funds, which would be a significant issue.

To Reproduce

It happened in this PR https://github.com/ChainSafe/forest/pull/2709 (which didn't introduce any logic changes)

The job failed: https://github.com/ChainSafe/forest/actions/runs/4503553182/jobs/7926889785

Log output

Wallet tests
Importing preloaded wallet key
t1ac6ndwj6nghqbmtbovvnwcqo577p6ox2pt52q2y
Exporting key
Fetching metrics
--2023-03-23 17:47:17--  http://localhost:6116/metrics
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:6116... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:6116... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28751 (28K) [text/plain]
Saving to: ‘metrics.log’

     0K .......... .......... ........                        100%  404M=0s

2023-03-23 17:47:17 (404 MB/s) - ‘metrics.log’ saved [28751/28751]

Listing wallet balances
Address                                   Default Balance
t1ac6ndwj6nghqbmtbovvnwcqo577p6ox2pt52q2y           ~100 FIL
Creating a new address to send FIL to
t1fye76pfymru4y2fqbgdxswpwqt3ukeems2nzfna
Listing wallet balances
Address                                   Default Balance
t1ac6ndwj6nghqbmtbovvnwcqo577p6ox2pt52q2y  X        ~100 FIL
t1fye76pfymru4y2fqbgdxswpwqt3ukeems2nzfna           0 atto FIL
Sending FIL to the above address
Checking balance of t1fye76pfymru4y2fqbgdxswpwqt3ukeems2nzfna...
Checking balance 1/10
Checking balance 2/10
Checking balance 3/10
Checking balance 4/10
Checking balance 5/10
Checking balance 6/10
Checking balance 7/10
Checking balance 8/10
Checking balance 9/10
Checking balance 10/10
Address                                   Default Balance
t1ac6ndwj6nghqbmtbovvnwcqo577p6ox2pt52q2y  X        ~100 FIL
t1fye76pfymru4y2fqbgdxswpwqt3ukeems2nzfna           0 atto FIL
FIL amount should match
Shutting down Forest node

Expected behaviour

The test passes.

Screenshots

Environment (please complete the following information):

Other information and links

elmattic commented 1 year ago

Most likely the send command is not working as we expected. The test setup should be good, only incomplete regarding message behaviour.

From Lotus doc:

If messages are not deemed attractive enough by storage providers to be included in new blocks, they may become stuck in the message pool. This is usually a consequence of the GasFeeCap being too low, for example, when the Network’s BaseFee is high. It can also be a consequence of the GasPremium being too low if the network is congested.

This might be our issue. As a start we should print message ID and check if it is not stuck in the message pool. https://calibration.filscan.io/tipset/pool-message-list

In that case, maybe trying resending a new message with increased gas premium could be a solution.

elmattic commented 1 year ago

I've been trying to reproduce the issue and it was pretty easy. Found out that between two subsequent messages where the first one was mined and the second is stuck in Forest message pool, the only significant differences were a lower GasFee cap and GasPremium:

First:

    "Nonce": 2,
    "Value": "300",
    "GasLimit": 1330063,
    "GasFeeCap": "100864",
    "GasPremium": "99810",
    "CID": {
      "/": "bafy2bzacecttsvw2yhxvuhgtqe2o5n7yawreawlh7g5tp5yba2fyhzc6yf6ee"
    }

Second:

    "Nonce": 1,
    "Value": "550",
    "GasLimit": 1330063,
    "GasFeeCap": "101191",
    "GasPremium": "100137",
    "CID": {
      "/": "bafy2bzacedcyrn74duk6sbxmaqdwk7zkqlufv2e2tjat5xnxqikvnbxp6jk6c"
    }

That can't really explain why one was mined and the other was not. But if we looked at the first one and another message (coming from Lotus) there's a big difference between gas values:

A Forest message:

CID: bafy2bzacecttsvw2yhxvuhgtqe2o5n7yawreawlh7g5tp5yba2fyhzc6yf6ee
Nonce: 2
Gas Fee Cap: 0.000100864 nanoFIL
Gas Premium: 99810 attoFIL
Gas Limit: 1,330,063
Gas Used: 1,219,763
BaseFee: 100 attoFIL
Gas Fee: 132.87556433 nanoFIL

A Lotus one:

CID: bafy2bzacebhh2y6jgimhxaiqweyk7fujueqmmvf2qscclse5gsiwzajjtrylq
Nonce: 11797
Gas Fee Cap: 0.000100504 nanoFIL
Gas Premium: 99450 attoFIL
Gas Limit: 70,178,532
Gas Used: 56,207,826
BaseFee: 100 attoFIL
Gas Fee: 6985.083331 nanoFIL
elmattic commented 1 year ago

I think that a bad message Nonce could cause the transaction to fail.

In Forest node I can see: WARN forest_message_pool::msg_chain: encountered message from actor t1ekkzekiozleakm4jauekqqc6uhn3dcwzcrrqrry with nonce 53 less than the current nonce 54

Let's sum up what is happening here: The sending actor has its Nonce with a value of 53.

We're sending to t1mljyv5pzvu7jgzuvzfog2ycsd5fdh4y7c2z6t6q 500 afil. Forest creates the message using the Nonce from sending actor Nonce (53).

Transaction is mined successfully in block bafy2bzaceacr42ctk3z7i7vuqkdyfzydscvcv47dgzzdiunbh3qmcc6dp65su.

Sending actor Nonce is somewhat being updated in the blockchain to 54 (todo: find who does that).

We're sending a new transaction (500 afill to t1i63uuuqrgqvntdekae4wmomz5abgnwltm3buyia). Nonce used here is still 53 (this could be our issue):

{
  "Message": {
    "Version": 0,
    "To": "t1i63uuuqrgqvntdekae4wmomz5abgnwltm3buyia",
    "From": "t1ekkzekiozleakm4jauekqqc6uhn3dcwzcrrqrry",
    "Nonce": 53,
    "Value": "500",
    "GasLimit": 5866637,
    "GasFeeCap": "100374",
    "GasPremium": "99320",
    "Method": 0,
    "Params": "",
    "CID": {
      "/": "bafy2bzacedw3qgd4dy3qyizg3ian44i5yefkjtzknetdksqqwi7khc2cuvpbq"
    }
  },
  "Signature": {
    "Type": 1,
    "Data": "1drZdRSXUCMEv0niS/3XSCd8MhE3UZkFei4EYpjh+ismkBpVDGqNwAyXRxccVy8Uqhoj2mFkmOMI6GliekqC0wA="
  },
  "CID": {
    "/": "bafy2bzacecmeugxrjtaaxkp5nci7oqblt2bccsqcrc57dqx3do6kfvymhgqa6"
  }
}

Transaction is rejected because it would expect a Nonce of 54.

It's not even reaching the Lotus node running the explorer (probably because detected as invalid by network nodes during p2p message broadcast).

Generally speaking, even if we didn't had this issue here, there could be a race between two CI jobs since we use the same sending actor.

elmattic commented 1 year ago

I've found a new error while sending FIL:

Error: {"code":0,"message":"gas_limit cannot be greater than block gas limit"}

Something is going really wrong here. Will add display of value in error message.

elmattic commented 1 year ago

To sum up:

I still don't get why it is happening. I would like to reproduce in again on the CI and see the json message (we need to merge the PR with mpool pending first).

Maybe also looking at how Lotus handles this message when reaching its Mempool.

elmattic commented 1 year ago

@LesnyRumcajs did you try to run the script locally? Like sending to an address 40 times.

LesnyRumcajs commented 1 year ago

I can do it today. Should I use the latest main?

elmattic commented 1 year ago

I can do it today. Should I use the latest main?

Yes, just run forest-cli send a couple of times in a loop and see if you can reproduce. Did not try on my Fedora, only macOS.

LesnyRumcajs commented 1 year ago

@elmattic Can't reproduce it locally. All of the messages arrived. I sent:

I'll try doing it in parallel with different nodes.

LesnyRumcajs commented 1 year ago

https://calibration.filfox.info/en/address/t1gbwtv52gqukzhe5rmhu4n5jclpf3lc6bdn3m5eq

LesnyRumcajs commented 1 year ago

We already discussed this in Slack, so just for the record. We managed to reproduce it by running sending FIL in two separate locations (Germany and Poland).

Sending 500atto from t1es57egc3chszrnlprkkfl2x6g36zlxvrdcdz76i to t1gbwtv52gqukzhe5rmhu4n5jclpf3lc6bdn3m5eq for the 8417 time
Error: {"code":0,"message":"gas price is lower than min gas price"}
Sending 500atto from t1es57egc3chszrnlprkkfl2x6g36zlxvrdcdz76i to t1gbwtv52gqukzhe5rmhu4n5jclpf3lc6bdn3m5eq for the 8418 time
bafy2bzaceclc3dxyhmyfmnr76gvswlu6fwcaq5xgr2kwrtttsrc53oadiice4
Sending 500atto from t1es57egc3chszrnlprkkfl2x6g36zlxvrdcdz76i to t1gbwtv52gqukzhe5rmhu4n5jclpf3lc6bdn3m5eq for the 8419 time
elmattic commented 1 year ago

This is a great finding, unfortunately it's not the same error than the one in the CI where the message was successfully added to the pool and its CID was printed.

elmattic commented 11 months ago

@LesnyRumcajs I had another try reproducing this issue. The issue is still there but not easy to reproduce.

Using only one node, I managed to reproduce the issue two times on around 1500 runs.

The way I'm doing it. First create a new sending account and fill it using some calibnet faucet.

Then:

  1. This account will send each time 500 attoFIL to a newly created account.
  2. Print Forest pending mpool and wait for the sent message to be mined.
  3. If successful, rince and repeat from step 1.

Looking at the message payloads in send_log.txt, the send is failing because the message Nonce wasn't correctly incremented, it should be 1451 but it still equals to 1450.

Looking at the forest daemon logs around 03:09:34 we can see a fork happening, so that's likely be related to our issue.

send_log.txt forest_log.txt

LesnyRumcajs commented 11 months ago

Thanks. Do we know if Lotus has the same issue? If not, how does Lotus handle it?

elmattic commented 11 months ago

Not 100% sure but they do have some code that can react to HEAD changes and republish previously mined messages.

See https://github.com/filecoin-project/lotus/blob/3fd57ff7d90a3f4c51975b89dfc98135116d52de/chain/messagepool/messagepool.go#L1262

LesnyRumcajs commented 11 months ago

@elmattic Would you like to contact Lotus devs to validate that this is the solution to the same problem?

LesnyRumcajs commented 5 months ago

@elmattic do you recall what's the status of this? Is sending FIL still failing every now and then?

elmattic commented 5 months ago

@LesnyRumcajs I will check our CI. But, AFAIK, the pesky bug is still present.

elmattic commented 3 months ago

@elmattic do you recall what's the status of this? Is sending FIL still failing every now and then?

https://github.com/ChainSafe/forest/actions/runs/9204884823/job/25319573657

We can now update the status to "still broken".