filecoin-project / bellperson

zk-SNARK library
Other
186 stars 118 forks source link

GPU preemption failure #291

Closed Elhorses closed 1 year ago

Elhorses commented 1 year ago

When the wdpost calculation and winningpost calculation occur at the same time, although the priority of winningpost is true and that of wdpost is false, winningpost still fails to preempt the GPU,and then winningpost computing timeout

vmx commented 1 year ago

Could you please provide some logs with log level debug, or even better trace (by setting RUST_LOG=trace)?

Do you have some way to reproduce the problem?

Elhorses commented 1 year ago

Could you please provide some logs with log level debug, or even better trace (by setting RUST_LOG=trace)?

Do you have some way to reproduce the problem?

2022-11-24T23:07:17.607 INFO storage_proofs_core::compound_proof > snark_proof:start 2022-11-24T23:07:17.750 INFO bellperson::groth16::prover > Bellperson 0.22.0 is being used! 2022-11-24T23:07:40.287Z INFO miner miner/miner.go:548 completed mineOne {"tookMilliseconds": 12, "forRound": 2367496, "baseEpoch": 2367495, "baseDeltaSeconds": 10, "nullRounds": 0, "lateStart": false, "beaconEpoch": 2463341, "lookbackEpochs": 900, "networkPowerAtLookback": "21798573920164216832", "minerPowerAtLookback": "11159012229775360", "isEligible": true, "isWinner": false, "error": null} 2022-11-24T23:07:55.019 INFO bellperson::groth16::prover > synthesis time: 37.268746346s 2022-11-24T23:07:55.019 INFO bellperson::groth16::prover > starting proof timer 2022-11-24T23:07:59.294 INFO bellperson::gpu::locks > GPU is available for FFT! 2022-11-24T23:07:59.295 INFO ec_gpu_gen::program > Using kernel on CUDA. 2022-11-24T23:07:59.317 INFO ec_gpu_gen::fft > FFT: 1 working device(s) selected. 2022-11-24T23:07:59.318 INFO ec_gpu_gen::fft > FFT: Device 0: GeForce RTX 3090 2022-11-24T23:07:59.318 INFO bellperson::gpu::locks > GPU FFT kernel instantiated! 2022-11-24T23:08:10.074Z INFO miner miner/miner.go:548 completed mineOne {"tookMilliseconds": 51, "forRound": 2367497, "baseEpoch": 2367495, "baseDeltaSeconds": 40, "nullRounds": 1, "lateStart": false, "beaconEpoch": 2463342, "lookbackEpochs": 900, "networkPowerAtLookback": "21798574221660848128", "minerPowerAtLookback": "11159012229775360", "isEligible": true, "isWinner": false, "error": null} 2022-11-24T23:08:17.474Z ERROR storagemarket_impl impl/provider.go:205 failed to connect index provider host with the full node: failed to call NetProtectAdd on the full node, err: missing permission to invoke 'NetProtectAdd' (need 'admin') 2022-11-24T23:08:27.245 INFO bellperson::gpu::locks > GPU is available for Multiexp! 2022-11-24T23:08:27.245 INFO bellperson::gpu::multiexp > Multiexp: CPU utilization: 0. 2022-11-24T23:08:27.246 INFO ec_gpu_gen::program > Using kernel on CUDA. 2022-11-24T23:08:27.248 INFO ec_gpu_gen::multiexp > Multiexp: 1 working device(s) selected. 2022-11-24T23:08:27.248 INFO ec_gpu_gen::multiexp > Multiexp: Device 0: GeForce RTX 3090 (Chunk-size: 18061702) 2022-11-24T23:08:27.248 INFO bellperson::gpu::locks > GPU Multiexp kernel instantiated! 2022-11-24T23:08:40.246Z INFO miner miner/miner.go:548 completed mineOne {"tookMilliseconds": 17, "forRound": 2367498, "baseEpoch": 2367497, "baseDeltaSeconds": 10, "nullRounds": 0, "lateStart": false, "beaconEpoch": 2463343, "lookbackEpochs": 900, "networkPowerAtLookback": "21798572294495338496", "minerPowerAtLookback": "11159012229775360", "isEligible": true, "isWinner": false, "error": null} 2022-11-24T23:09:10.044Z INFO miner miner/miner.go:590 round winner, will mine new block, for {"height": "2367499"} 2022-11-24T23:09:10.045Z INFO storageminer storage/winning_prover.go:70 Computing WinningPoSt ;[{SealProof:9 SectorNumber:152313 SectorKey: SealedCID:bagboea4b5abcadqi47tmsbyg24t463o4u2nkb5dbhzn24ndpv7mb7ywjosqc4e27}]; [114 34 174 18 207 210 176 171 229 24 20 68 99 184 137 67 14 147 227 93 13 156 156 207 168 200 156 1 88 161 142 198] 2022-11-24T23:09:10.045Z INFO advmgr sealer/manager_post.go:23 GenerateWinningPoSt run at lotus-miner 2022-11-24T23:09:10.054 INFO filecoin_proofs::api::winning_post > generate_winning_post:start 2022-11-24T23:09:10.065 INFO filecoin_proofs::caches > trying parameters memory cache for: WINNING_POST[68719476736] 2022-11-24T23:09:10.065 INFO filecoin_proofs::caches > found params in memory cache for WINNING_POST[68719476736] 2022-11-24T23:09:10.191 INFO storage_proofs_core::compound_proof > vanilla_proofs:start 2022-11-24T23:09:10.488 INFO storage_proofs_core::compound_proof > vanilla_proofs:finish 2022-11-24T23:09:10.493 INFO storage_proofs_core::compound_proof > snark_proof:start 2022-11-24T23:09:10.494 INFO bellperson::groth16::prover > Bellperson 0.22.0 is being used! 2022-11-24T23:09:10.594 INFO bellperson::groth16::prover > synthesis time: 100.372952ms 2022-11-24T23:09:10.594 INFO bellperson::groth16::prover > starting proof timer 2022-11-24T23:09:10.610 INFO bellperson::gpu::locks > GPU is available for FFT! 2022-11-24T23:09:10.610 INFO ec_gpu_gen::program > Using kernel on CUDA. 2022-11-24T23:09:11.044 INFO ec_gpu_gen::fft > FFT: 1 working device(s) selected. 2022-11-24T23:09:11.044 INFO ec_gpu_gen::fft > FFT: Device 0: GeForce RTX 3090 2022-11-24T23:09:11.044 INFO bellperson::gpu::locks > GPU FFT kernel instantiated! 2022-11-24T23:09:17.476Z ERROR storagemarket_impl impl/provider.go:205 failed to connect index provider host with the full node: failed to call NetProtectAdd on the full node, err: missing permission to invoke 'NetProtectAdd' (need 'admin') 2022-11-24T23:10:17.477Z ERROR storagemarket_impl impl/provider.go:205 failed to connect index provider host with the full node: failed to call NetProtectAdd on the full node, err: missing permission to invoke 'NetProtectAdd' (need 'admin') 2022-11-24T23:11:00.420 INFO bellperson::gpu::locks > GPU is available for Multiexp! 2022-11-24T23:11:00.420 INFO bellperson::gpu::multiexp > Multiexp: CPU utilization: 0. 2022-11-24T23:11:00.420 INFO ec_gpu_gen::program > Using kernel on CUDA. 2022-11-24T23:11:00.421 INFO bellperson::groth16::prover > prover time: 185.401883711s 2022-11-24T23:11:01.763 INFO storage_proofs_core::compound_proof > snark_proof:finish 2022-11-24T23:11:01.763 INFO filecoin_proofs::api::window_post > generate_window_post:finish 2022-11-24T23:11:01.764 INFO ec_gpu_gen::multiexp > Multiexp: 1 working device(s) selected. 2022-11-24T23:11:01.764 INFO ec_gpu_gen::multiexp > Multiexp: Device 0: GeForce RTX 3090 (Chunk-size: 18061702) 2022-11-24T23:11:01.764 INFO bellperson::gpu::locks > GPU Multiexp kernel instantiated! 2022-11-24T23:11:02.383 INFO bellperson::groth16::prover > prover time: 111.788384309s 2022-11-24T23:11:02.387 INFO storage_proofs_core::compound_proof > snark_proof:finish 2022-11-24T23:11:02.387 INFO filecoin_proofs::api::winning_post > generate_winning_post:finish 2022-11-24T23:11:02.388Z INFO storageminer storage/winning_prover.go:77 GenerateWinningPoSt took 1m52.342815192s 2022-11-24T23:11:04.035Z INFO wdpost wdpost/wdpost_run.go:732 computing window post {"batch": 0, "elapsed": 301.29480264, "skip": 0, "err": null} 2022-11-24T23:11:04.047 INFO filecoin_proofs::api::window_post > verify_window_post:start 2022-11-24T23:11:04.047 INFO filecoin_proofs::caches > trying parameters memory cache for: WINDOW_POST[68719476736]-verifying-key 2022-11-24T23:11:04.047 INFO filecoin_proofs::caches > found params in memory cache for WINDOW_POST[68719476736]-verifying-key 2022-11-24T23:11:04.081 INFO filecoin_proofs::api::window_post > verify_window_post:finish 2022-11-24T23:11:06.994Z INFO miner miner/miner.go:645 mined new block {"cid": "bafy2bzaceaukxlerk4p4rjtq6x6yfdn764vjmtbkmr2a7wunn3mnaqov2if2a", "height": 2367499, "miner": "f0502198", "parents": ["f0230861","f01886704","f01702940","f01171513","f01852363","f01680940","f089180","f01926802"], "parentTipset": "{bafy2bzaceduwoeqccvx34e6qosnrzjteg4qelshpxnnmdzqcl33axfjyng3qy,bafy2bzaceabyjgwvovms7vdayfuhcqfhyv4ufbvgtnzj66dt7ksdyakzky7ms,bafy2bzacebnwfa24777iglbvuebudz3ockjqvu3adoqmwqd52dtcbgevmeaqo,bafy2bzacea4b5n6olk6khgla6ntji3oacebljbfywakph6jv2buev2ygbterm,bafy2bzacebl76ealt5wv2nzmxrzbb3saoynuvc3brfwuldltbdiyb2kwu42mi,bafy2bzacec3qgh53icxqezayboosxobtoznilaemn47qqqb5hre7corkqyns6,bafy2bzacedyn2wahnuyaaahtv5nmoub2orhwsfob7ovk5gxn6lqfkcynnmby4,bafy2bzaceasooknnkcphdfsuigvbeet3dzrzhmmcx4v6g335liahl3su73ywm}", "took": 116.967937239} 2022-11-24T23:11:06.994Z WARN miner miner/miner.go:647 CAUTION: block production took longer than the block delay. Your computer may not be fast enough to keep up {"tPowercheck ": 0.016822184, "tTicket ": 0.0015335, "tSeed ": 0.00000244, "tProof ": 112.343196209, "tPending ": 4.507330629, "tCreateBlock ": 0.099052277}

vmx commented 1 year ago

From the log messages it's hard to tell, which lines comes from which process/thread. It could well be that the WinningPoSt one got priority. Why are you sure it didn't?

Are you able to reproduce the issue? Are you compiling the Rust parts from source? I'm asking as if you can, I might be able to provide you a version, where it also logs the thread ID, so that we can distinguish them.

Elhorses commented 1 year ago

you can run "cargo test test_parallel_prover --features "cuda" -- --nocapture" with v0.21.0 and v0.22.0, and then compare rust DEBUG log, we find v0.21.0 could get "[2022-11-28T13:26:12Z WARN bellperson::gpu::locks] GPU acquired by a high priority process! Freeing up Multiexp kernels..." if happened conflict, but v0.22.0 never get this log. and for my lotus-miner, When the wdpost calculation and winningpost calculation occur at the same time, although the priority of winningpost is true and that of wdpost is false, winningpost still fails to preempt the GPU,and then winningpost computing timeout.

vmx commented 1 year ago

Thanks @Elhorses for providing the command to run. I think I can reproduce it, I'm having a look.

Elhorses commented 1 year ago

Thanks @Elhorses for providing the command to run. I think I can reproduce it, I'm having a look.

OK, thank you ! I have solved the problem, you can look at https://github.com/Elhorses/bellperson/tree/v0.22.0, commit: , and now my lotus-miner working fine

vmx commented 1 year ago

Thanks, that'll save me a lot of time!

vmx commented 1 year ago

@Elhorses here's my version of a fix: https://github.com/filecoin-project/bellperson/pull/293. It's for the master branch, but it should be easily applicable to older bellperson versions as well. The patch I've done for ec-gpu-gen that is referenced from my PR isn't needed for correctness, it just makes sure the output doesn't contain any messages about panics.

Elhorses commented 1 year ago

@Elhorses here's my version of a fix: #293. It's for the master branch, but it should be easily applicable to older bellperson versions as well. The patch I've done for ec-gpu-gen that is referenced from my PR isn't needed for correctness, it just makes sure the output doesn't contain any messages about panics.

Ok, thank for you help, i'll use it

Elhorses commented 1 year ago

@Elhorses here's my version of a fix: #293. It's for the master branch, but it should be easily applicable to older bellperson versions as well. The patch I've done for ec-gpu-gen that is referenced from my PR isn't needed for correctness, it just makes sure the output doesn't contain any messages about panics.

hello, can we using bellperson on the AMD GPU?

vmx commented 1 year ago

The OpenCL version should run on AMD GPUs. If it doesn't, it's a bug. Please report if you run into problems.

Elhorses commented 1 year ago

The OpenCL version should run on AMD GPUs. If it doesn't, it's a bug. Please report if you run into problems.

ok, thank for you help!