filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.83k stars 1.25k forks source link

[BUG] Random WindowPost fails due late vanilla starts. #6720

Closed cryptowhizzard closed 2 years ago

cryptowhizzard commented 3 years ago

We are experiencing late WindowPost fails due the late start of vanilla.

These issues occur randomly. ( Not only on the first window , but also on others ). Sometimes a few days everything goes alright and then suddenly it fails again.

There are no visible disk errors ( SmartCTL reports all ok ) . The miner is connected with NFS to sector storage ... that storage has been recently scrubbed and clean of errors. Hardware 256 GB / Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz

The machine is running lotus daemon and lotus miner ( All PC1 / PC2 is outsourced to workers ).

Screenshot_20210709_113600

The Grafana screenshot shows the basics that nothing is happening untill 15:33 PM. After that the IO starts caused by NFS / Vanilla start.

2021-07-08T15:16:33.594 INFO filcrypto::proofs::api > generate_window_post: start 2021-07-08T15:16:34.312 INFO filecoin_proofs::api::window_post > generate_window_post:start 2021-07-08T15:16:34.312 INFO filecoin_proofs::caches > trying parameters memory cache for: Window_POST[34359738368] 2021-07-08T15:16:34.312 INFO filecoin_proofs::caches > found params in memory cache for Window_POST[34359738368] 2021-07-08T15:33:25.090 INFO storage_proofs_core::compound_proof > vanilla_proofs:start 2021-07-08T15:37:06.342+0200 ^[[34mINFO^[[0m storagemarket_impl impl/provider.go:211 Handling storage deal proposal! 2021-07-08T15:37:06.488+0200 ^[[34mINFO^[[0m markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventOpen", "proposal CID": "bafyreicodzeytwvuxhw4azhrd24gtgycrry7eicbormgor6hlp3iprfk6y", "state": "StorageDealValidating", "message": ""} 2021-07-08T15:37:06.524+0200 ^[[34mINFO^[[0m markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDealDeciding", "proposal CID": "bafyreicodzeytwvuxhw4azhrd24gtgycrry7eicbormgor6hlp3iprfk6y", "state": "StorageDealAcceptWait", "message": ""} 2021-07-08T15:37:06.658+0200 ^[[34mINFO^[[0m markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataRequested", "proposal CID": "bafyreicodzeytwvuxhw4azhrd24gtgycrry7eicbormgor6hlp3iprfk6y", "state": "StorageDealWaitingForData", "message": ""} 2021-07-08T15:37:07.945+0200 ^[[34mINFO^[[0m dt-impl impl/events.go:298 received new channel request from 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw 2021-07-08T15:37:07.946+0200 ^[[34mINFO^[[0m dt_graphsync graphsync/graphsync.go:189 Opening graphsync request to 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw for root QmRiZ5o7s2PcRXxbrShQWcfgdQPBBE4LqgEdfcHfG54p2c with 0 CIDs already received 2021-07-08T15:37:07.946+0200 ^[[34mINFO^[[0m dt-impl impl/events.go:19 channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-1625711864047272673: opened 2021-07-08T15:37:07.946+0200 ^[[34mINFO^[[0m markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataTransferInitiated", "proposal CID": "bafyreicodzeytwvuxhw4azhrd24gtgycrry7eicbormgor6hlp3iprfk6y", "state": "StorageDealTransferring", "message": ""} 2021-07-08T15:37:30.943+0200 ^[[34mINFO^[[0m badger v2@v2.2007.2/db.go:1030 Storing value log head: {Fid:164 Len:34 Offset:1073669551} 2021-07-08T15:37:30.973+0200 ^[[34mINFO^[[0m badger v2@v2.2007.2/levels.go:1000 [Compactor: 0] Running compaction: {level:0 score:1 dropPrefixes:[]} for level: 0 2021-07-08T15:37:31.332+0200 ^[[34mINFO^[[0m badger v2@v2.2007.2/levels.go:962 LOG Compact 0->1, del 6 tables, add 1 tables, took 358.4993ms 2021-07-08T15:37:31.332+0200 ^[[34mINFO^[[0m badger v2@v2.2007.2/levels.go:1010 [Compactor: 0] Compaction for level: 0 DONE 2021-07-08T15:37:58.679+0200 ^[[31mERROR^[[0m data-transfer channels/channel_state.go:105 unexpected EOF 2021-07-08T15:39:31.987+0200 ^[[34mINFO^[[0m badger v2@v2.2007.2/db.go:1030 Storing value log head: {Fid:166 Len:34 Offset:1073523473} 2021-07-08T15:42:09.192+0200 ^[[34mINFO^[[0m badger v2@v2.2007.2/db.go:1030 Storing value log head: {Fid:168 Len:34 Offset:1073520030} 2021-07-08T15:42:10.626+0200 ^[[34mINFO^[[0m dt-impl impl/events.go:249 channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-1625711864047272673: sending completion message to initiator 2021-07-08T15:50:29.275 INFO storage_proofs_core::compound_proof > vanilla_proofs:finish 2021-07-08T15:50:30.418 INFO storage_proofs_core::compound_proof > snark_proof:start 2021-07-08T15:50:30.532 INFO bellperson::groth16::prover > Bellperson 0.14.2 is being used! 2021-07-08T15:51:11.674 INFO bellperson::groth16::prover > starting proof timer 2021-07-08T15:51:14.166 INFO bellperson::gpu::locks > GPU is available for FFT! 2021-07-08T15:51:14.272 INFO bellperson::gpu::fft > FFT: 1 working device(s) selected. 2021-07-08T15:51:14.272 INFO bellperson::gpu::fft > FFT: Device 0: GeForce RTX 3090 2021-07-08T15:51:14.272 INFO bellperson::domain > GPU FFT kernel instantiated! 2021-07-08T15:51:33.627+0200 ^[[33mWARN^[[0m storageminer storage/wdpost_run.go:227 Checked sectors {"checked": 2347, "good": 2347} 2021-07-08T15:51:33.666+0200 ^[[34mINFO^[[0m storageminer storage/wdpost_run.go:626 running window post {"chain-random": "Wsr4fahMoGCtl98WwaKJOD74MGgMiqPzswqCZm05aBI=", "deadline": {"CurrentEpoch":914863,"PeriodStart":914803,"Index":1,"Open":914863,"Close":914923,"Challenge":914843,"FaultCutoff":914793,"WPoStPeriodDeadlines":48,"WPoStProvingPeriod":2880,"WPoStChallengeWindow":60,"WPoStChallengeLookback":20,"FaultDeclarationCutoff":70}, "height": "914863", "skipped": 0} 2021-07-08T15:51:33.687 INFO filcrypto::proofs::api > generate_window_post: start 2021-07-08T15:51:34.401 INFO filecoin_proofs::api::window_post > generate_window_post:start 2021-07-08T15:51:34.401 INFO filecoin_proofs::caches > trying parameters memory cache for: Window_POST[34359738368] 2021-07-08T15:51:34.401 INFO filecoin_proofs::caches > found params in memory cache for Window_POST[34359738368] 2021-07-08T15:51:43.779 INFO bellperson::gpu::locks > GPU is available for Multiexp! 2021-07-08T15:51:43.861 INFO bellperson::gpu::multiexp > Multiexp: 1 working device(s) selected. (CPU utilization: 0) 2021-07-08T15:51:43.861 INFO bellperson::gpu::multiexp > Multiexp: Device 0: GeForce RTX 3090 (Chunk-size: 18061702) 2021-07-08T15:51:43.861 INFO bellperson::multiexp > GPU Multiexp kernel instantiated! 2021-07-08T15:53:56.999 INFO bellperson::groth16::prover > prover time: 165.324432236s 2021-07-08T15:53:57.209 INFO storage_proofs_core::compound_proof > snark_proof:finish 2021-07-08T15:53:57.209 INFO filecoin_proofs::api::window_post > generate_window_post:finish 2021-07-08T15:53:57.239 INFO filcrypto::proofs::api > generate_window_post: finish 2021-07-08T15:53:57.240+0200 ^[[34mINFO^[[0m storageminer storage/wdpost_run.go:642 computing window post {"batch": 0, "elapsed": 2243.666505939}

cryptowhizzard commented 3 years ago

Please see attached log of today. The miner was started hours before W-Post. The first one took a bit long as i needed to load it' s params, understandable. The second and third started fine and are well in time. Params cache is working as expected. The fourth window hanged 25 minutes between W-Post start van Vanilla Start. It does do reads on the NFS

todaylog.zip

Reiers commented 2 years ago

Hi @cryptowhizzard

Thanks for the report. The issue is outdated, and have not seen any reports since. With the new PoSt workers and an improved scheduler - I guess this will be even better from now on.

But if this is resurface in current release or later - please open a new ticket.

Thanks again !

-- TSE @Reiers footer