filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.86k stars 1.27k forks source link

Lotus miner loses connection / events for 'mineOne' / misses blocks to mine #10313

Open RobQuistNL opened 1 year ago

RobQuistNL commented 1 year ago

Checklist

Lotus component

Lotus Version

1.19.0

Describe the Bug

Every once in a while a miner will "lose" its connection to the deamon - the mineOne message stops appearing in the miner log, and this also means the miner will never mine a block from there again, until the miner is restarted.

{"level":"info","ts":"2023-02-18T14:36:40.043Z","logger":"miner","caller":"miner/miner.go:478","msg":"completed mineOne","tookMilliseconds":7,"forRound":2614154,"baseEpoch":2614153,"baseDeltaSeconds":10,"nullRounds":0,"lateStart":false,"beaconEpoch":2709999,"lookbackEpochs":900,"networkPowerAtLookback":"21666446240811155456","minerPowerAtLookback":"2768526650933248","isEligible":true,"isWinner":false,"error":null}

Logging Information

Still investigating and trying to find the exact related log issues, but this is happening since 1.19

Repo Steps

  1. Run lotus-miner
  2. See mineOne messages stop
RobQuistNL commented 1 year ago

Related; https://filecoinproject.slack.com/archives/CEGN061C5/p1673609087243379

beck-8 commented 1 year ago

Are there any missing logs? Are your lotus and lotus-miner on the same machine?

RobQuistNL commented 1 year ago

I might be on to something here, the winningpostworker seems to hang at

{"level":"info","ts":"2023-02-19T22:27:11.215+0000","logger":"bellperson::groth16::prover","caller":"/home/lotus/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.22.0/src/groth16/prover.rs:668","msg":"starting proof timer"}

when I restarted this winningpost worker, the block messages started coming through again:

{"level":"error","ts":"2023-02-20T10:45:03.086Z","logger":"miner","caller":"miner/miner.go:474","msg":"completed mineOne","tookMilliseconds":44272753,"forRound":2617975,"baseEpoch":2617974,"baseDeltaSeconds":10,"nullRounds":0,"lateStart":false,"beaconEpoch":2713820,"lookbackEpochs":900,"networkPowerAtLookback":"21747599217924407296","minerPowerAtLookback":"xxx","isEligible":true,"isWinner":true,"error":"failed to compute winning post proof: RPC client error: sendRequest failed: Post \"http://xxxx:45801/rpc/v0\": EOF","errorVerbose":"failed to compute winning post proof:\n    github.com/filecoin-project/lotus/miner.(*Miner).mineOne\n        /home/lotus/lotus/miner/miner.go:544\n  - RPC client error: sendRequest failed: Post \"http://xxxx:45801/rpc/v0\": EOF"}
{"level":"info","ts":"2023-02-20T10:45:10.034Z","logger":"miner","caller":"miner/miner.go:478","msg":"completed mineOne","tookMilliseconds":6,"forRound":2619451,"baseEpoch":2619450,"baseDeltaSeconds":10,"nullRounds":0,"lateStart":false,"beaconEpoch":2715296,"lookbackEpochs":900,"networkPowerAtLookback":"21760063759690235904","minerPowerAtLookback":"xxx","isEligible":true,"isWinner":false,"error":null}
beck-8 commented 1 year ago

You have to check the environment first, why EOF? Is there an additional proxy configured somewhere? No valid conclusions can be drawn from this information.

RobQuistNL commented 1 year ago

No, the winningpost service hangs at that very line (that is the last line in the logs). When restarting it, the EOF comes through (because the connection gets terminated, because the service got restarted)