filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.84k stars 1.26k forks source link

Winning_post problem: block production took longer than the block delay. Did it crash? #7487

Closed William8Work closed 2 years ago

William8Work commented 3 years ago

Checklist

Lotus component

Lotus Version

$ lotus-miner version
Daemon:  1.13.0-rc2+mainnet+git.530011fd0+api1.2.0
Local: lotus-miner version 1.13.0-rc2+mainnet+git.530011fd0

$ lotus version
Daemon:  1.13.0-rc2+mainnet+git.530011fd0+api1.4.0
Local: lotus version 1.13.0-rc2+mainnet+git.530011fd0

Since this is related to gpu, here are additional information:

CPU: AMD EPYC 7601 32-Core Processor x 2
GPU: RTX 2080 Ti

Describe the Bug

Found out today that the miner won a block but the block production took longer than expected.

The miner reported an issue "CAUTION: block production took longer than the block delay. Your computer may not be fast enough to keep up". and then said "mined block in the past". It said took about 43 seconds.

I know it's not the computer performance because it mined another block later took 3 seconds. The question is: what was the mining doing causing 43 seconds to work on the mined block.

So I searched the proofing logs but found the winning_post start but didn't see the winning_post finish:

$ journalctl -n 200000 -u lotusminer.service | grep -ai generate_winning_post

Oct 09 11:42:35 g0lotus01 lotus-miner[49281]: 2021-10-09T11:42:35.518 INFO filcrypto::proofs::api > generate_winning_post: start
Oct 09 11:42:35 g0lotus01 lotus-miner[49281]: 2021-10-09T11:42:35.591 INFO filecoin_proofs::api::winning_post > generate_winning_post:start
Oct 09 11:43:15 g0lotus01 lotus-miner[49281]: 2021-10-09T11:43:15.511 INFO filecoin_proofs::api::winning_post > generate_winning_post:finish
Oct 09 11:43:15 g0lotus01 lotus-miner[49281]: 2021-10-09T11:43:15.511 INFO filcrypto::proofs::api > generate_winning_post: finish
Oct 09 14:02:22 g0lotus01 lotus-miner[95415]: 2021-10-09T14:02:22.838 INFO filcrypto::proofs::api > generate_winning_post: start
Oct 09 14:02:22 g0lotus01 lotus-miner[95415]: 2021-10-09T14:02:22.882 INFO filecoin_proofs::api::winning_post > generate_winning_post:start
Oct 09 14:02:25 g0lotus01 lotus-miner[95415]: 2021-10-09T14:02:25.128 INFO filecoin_proofs::api::winning_post > generate_winning_post:finish
Oct 09 14:02:25 g0lotus01 lotus-miner[95415]: 2021-10-09T14:02:25.128 INFO filcrypto::proofs::api > generate_winning_post: finish
**Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.023 INFO filcrypto::proofs::api > generate_winning_post: start
Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.025 INFO filecoin_proofs::api::winning_post > generate_winning_post:start**
Oct 10 02:51:06 g0lotus01 lotus-miner[95415]: 2021-10-10T02:51:06.015 INFO filcrypto::proofs::api > generate_winning_post: start
Oct 10 02:51:06 g0lotus01 lotus-miner[95415]: 2021-10-10T02:51:06.018 INFO filecoin_proofs::api::winning_post > generate_winning_post:start
Oct 10 02:51:08 g0lotus01 lotus-miner[95415]: 2021-10-10T02:51:08.446 INFO filecoin_proofs::api::winning_post > generate_winning_post:finish
Oct 10 02:51:08 g0lotus01 lotus-miner[95415]: 2021-10-10T02:51:08.451 INFO filcrypto::proofs::api > generate_winning_post: finish

please note: the highlighted winning_post didn't have the finish statement. Did the winning_post crash?

Logging Information

Here is the miner log:

`
$ grep -ai '2021-10-10T00:3[7-8]' /media/lotus/logs/miner.log 
2021-10-10T00:37:06.023-0700    INFO    miner   miner/miner.go:479      completed mineOne       {"tookMilliseconds": 16, "forRound": 1184835, "baseEpoch": 1184834, "baseDeltaSeconds": 6, "nullRounds": 0, "lateStart": false, "beaconEpoch": 1280679, "lookbackEpochs": 900, "networkPowerAtLookback": "14061140796126724096", "minerPowerAtLookback": "747961964625920", "isEligible": true, "isWinner": false, "error": null}
2021-10-10T00:37:30.790-0700    WARN    storageminer    storage/wdpost_run.go:227       Checked sectors {"checked": 2068, "good": 2068}
2021-10-10T00:37:30.878-0700    INFO    storageminer    storage/wdpost_run.go:626       running window post     {"chain-random": "rqEizwOxHdl0BsjwpgjHV7UVTaTsOiwjKB/Dou23O8A=", "deadline": {"CurrentEpoch":1184835,"PeriodStart":1184845,"Index":0,"Open":1184845,"Close":1184905,"Challenge":1184825,"FaultCutoff":1184775,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "1184835", "skipped": 0}
2021-10-10T00:37:36.023-0700    INFO    storageminer    storage/miner.go:290    Computing WinningPoSt ;[{SealProof:8 SectorNumber:7300 SealedCID:bagboea4b5abcagjom5i3zfcokr5w3f6wrozvqad2jglqrrlvf2medqnkwnphh2i3}]; [2 83 180 144 117 91 52 91 206 64 51 135 81 193 11 78 1 187 162 255 230 30 207 249 69 169 58 52 25 177 206 127]
2021-10-10T00:38:19.150-0700    INFO    storageminer    storage/miner.go:297    GenerateWinningPoSt took 43.127771627s
2021-10-10T00:38:19.886-0700    INFO    miner   miner/miner.go:569      mined new block {"cid": "bafy2bzacedsq5dhh3ryosynffryntjdvhjwdtui53reymbew7w6fcvxzwc4pg", "height": 1184836, "miner": "f08399", "parents": ["f0500003","f01218989","f0127595","f0724216","f061666","f0152337","f0755327"], "parentTipset": "{bafy2bzacechnbaxbfotrm6w2spqc4lmja5ijfr56cu4duosuiboo5cvub34cs,bafy2bzaceau3r7aa7fsxxmkmslckbpkc4eeinfhxum344bsb7gqd76ybelnrm,bafy2bzaceahxl6oftzbwkpjc2rkyj4kglmjjunbpn5pygbzwcnnw7hofmumk6,bafy2bzacecz6s2gk5b5uzctekb77a3vin5l7jimdkgg525gtevjut7nduye7o,bafy2bzacea6k7mzct2vgmmq5qzj3p5qrvirby2jq3osnp6ajiiqnei3ypz3gs,bafy2bzacebyu7btk3xi5sm2kx63qkafo62sliihgjkoltukiblczj7ibnz52c,bafy2bzaceamm7k52yrbmgxjroakqx2opi63nb7zlnvkrt3i36exumdoipfuma}", "took": 43.876861466}
2021-10-10T00:38:19.886-0700    WARN    miner   miner/miner.go:571      CAUTION: block production took longer than the block delay. Your computer may not be fast enough to keep up     {"tPowercheck ": 0.01028246, "tTicket ": 0.002941062, "tSeed ": 0.000008957, "tProof ": 43.128152113, "tPending ": 0.263120308, "tCreateBlock ": 0.472356566}
2021-10-10T00:38:19.886-0700    INFO    miner   miner/miner.go:479      completed mineOne       {"tookMilliseconds": 43877, "forRound": 1184836, "baseEpoch": 1184835, "baseDeltaSeconds": 6, "nullRounds": 0, "lateStart": false, "beaconEpoch": 1280680, "lookbackEpochs": 900, "networkPowerAtLookback": "14061140796126724096", "minerPowerAtLookback": "747961964625920", "isEligible": true, "isWinner": true, "error": null}
2021-10-10T00:38:19.886-0700    WARN    miner   miner/miner.go:324      mined block in the past {"block-time": "2021-10-10T00:38:00.000-0700", "time": "2021-10-10T00:38:19.886-0700", "difference": 19.88698674}
2021-10-10T00:38:20.256-0700    WARN    miner   miner/miner.go:477      completed mineOne       {"tookMilliseconds": 17, "forRound": 1184837, "baseEpoch": 1184836, "baseDeltaSeconds": 20, "nullRounds": 0, "lateStart": true, "beaconEpoch": 1280681, "lookbackEpochs": 900, "networkPowerAtLookback": "14061140761766985728", "minerPowerAtLookback": "747961964625920", "isEligible": true, "isWinner": false, "error": null}
2021-10-10T00:38:36.021-0700    INFO    miner   miner/miner.go:479      completed mineOne       {"tookMilliseconds": 15, "forRound": 1184838, "baseEpoch": 1184837, "baseDeltaSeconds": 6, "nullRounds": 0, "lateStart": false, "beaconEpoch": 1280682, "lookbackEpochs": 900, "networkPowerAtLookback": "14061140761766985728", "minerPowerAtLookback": "747961964625920", "isEligible": true, "isWinner": false, "error": null}
`

Here is the proving logs:
`
$ journalctl -n 200000 -u lotusminer.service | grep -ai 'Oct 10 00:3[7-8]' | grep bellperson
Hint: You are currently not seeing messages from other users and the system.
      Users in groups 'adm', 'systemd-journal' can see all messages.
      Pass -q to turn off this notice.
Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.601 INFO bellperson::groth16::prover > Bellperson 0.17.0 is being used!
Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.746 INFO bellperson::groth16::prover > synthesis time: 145.59913ms
Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.747 INFO bellperson::groth16::prover > starting proof timer
Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.756 DEBUG bellperson::gpu::locks > Acquiring priority lock at "/media/cache/tmpdir/bellman.priority.lock" ...
Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.756 DEBUG bellperson::gpu::locks > Priority lock acquired!
Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.757 DEBUG bellperson::groth16::prover > get h
Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.839 INFO bellperson::gpu::locks > GPU is available for FFT!
Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.839 DEBUG bellperson::gpu::locks > Acquiring GPU lock at "/media/cache/tmpdir/bellman.gpu.lock" ...
Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.839 DEBUG bellperson::gpu::locks > GPU lock acquired!
Oct 10 00:37:36 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:36.839 INFO bellperson::gpu::program > Using kernel on OpenCL.
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.016 INFO bellperson::gpu::fft > FFT: 1 working device(s) selected.
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.016 INFO bellperson::gpu::fft > FFT: Device 0: GeForce RTX 2080 Ti
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.016 INFO bellperson::domain > GPU FFT kernel instantiated!
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.252 DEBUG bellperson::gpu::locks > GPU lock released!
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.252 DEBUG bellperson::groth16::prover > multiexp h
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.252 INFO bellperson::gpu::locks > GPU is available for Multiexp!
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.252 DEBUG bellperson::gpu::locks > Acquiring GPU lock at "/media/cache/tmpdir/bellman.gpu.lock" ...
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.252 DEBUG bellperson::groth16::prover > get l
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.252 DEBUG bellperson::gpu::locks > GPU lock acquired!
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.252 INFO bellperson::gpu::program > Using kernel on OpenCL.
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.399 INFO bellperson::gpu::multiexp > Multiexp: 1 working device(s) selected. (CPU utilization: 0)
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.399 INFO bellperson::gpu::multiexp > Multiexp: Device 0: GeForce RTX 2080 Ti (Chunk-size: 7488999)
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.399 INFO bellperson::multiexp > GPU Multiexp kernel instantiated!
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.595 DEBUG bellperson::groth16::prover > multiexp l
Oct 10 00:37:37 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:37.595 DEBUG bellperson::groth16::prover > get_a b_g1 b_g2
Oct 10 00:37:38 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:38.956 DEBUG bellperson::groth16::prover > get_a b_g1 b_g2
Oct 10 00:37:38 g0lotus01 lotus-miner[95415]: 2021-10-10T00:37:38.956 DEBUG bellperson::groth16::prover > multiexp a b_g1 b_g2
Oct 10 00:38:39 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:39.256 INFO bellperson::groth16::prover > synthesis time: 19.286769958s
Oct 10 00:38:39 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:39.256 INFO bellperson::groth16::prover > starting proof timer
Oct 10 00:38:43 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:43.584 DEBUG bellperson::gpu::locks > Acquiring priority lock at "/media/cache/tmpdir/bellman.priority.lock" ...
Oct 10 00:38:43 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:43.584 DEBUG bellperson::gpu::locks > Priority lock acquired!
Oct 10 00:38:43 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:43.584 DEBUG bellperson::groth16::prover > get h
Oct 10 00:38:44 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:44.130 INFO bellperson::gpu::locks > GPU is available for FFT!
Oct 10 00:38:44 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:44.130 DEBUG bellperson::gpu::locks > Acquiring GPU lock at "/media/cache/tmpdir/bellman.gpu.lock" ...
Oct 10 00:38:44 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:44.130 DEBUG bellperson::gpu::locks > GPU lock acquired!
Oct 10 00:38:44 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:44.130 INFO bellperson::gpu::program > Using kernel on OpenCL.
Oct 10 00:38:44 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:44.356 INFO bellperson::gpu::fft > FFT: 1 working device(s) selected.
Oct 10 00:38:44 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:44.356 INFO bellperson::gpu::fft > FFT: Device 0: GeForce RTX 2080 Ti
Oct 10 00:38:44 g0lotus01 lotus-miner[95415]: 2021-10-10T00:38:44.356 INFO bellperson::domain > GPU FFT kernel instantiated!
`

Repo Steps

  1. Run '...'
  2. Do '...'
  3. See error '...' ...
Reiers commented 2 years ago

Hi @William8Work

Thanks for the report. The issue you are experiencing lies elsewhere.

Moving this issue to Lotus Discussions for help and troubleshooting.

Closing ticket.