Open weikengchen opened 3 years ago
new numbers after the FFT change. Seems to be still similar @ValarDragon
Note: the result is one iteration, so there could be statistical fluctuation.
===== Groth16 =====
1 core
R1CS to QAP witness map ... 31.945s
Compute C ................. 55.35s
Total ..................... 92.371s
4 cores
R1CS to QAP witness map ... 12.143s
Compute C ................. 16.867s
Total ..................... 34.371s
20 cores
R1CS to QAP witness map ... 7.408s
Compute C ................. 4.489s
Total ..................... 17.283s
40 cores
R1CS to QAP witness map ... 7.754s
Compute C ................. 4.484s
Total ..................... 17.738s
48 cores
R1CS to QAP witness map ... ~8.40s~ (rerun: 7.807s)
Compute C ................. ~4.715s~ (rerun: 4.421s)
Total ..................... ~18.349s~ (rerun: 17.680s)
Oh that PR didn't actually change the FFT algorithm, just documented the existing one and identified the problem
I'll let you know once I update the algorithm
what was the exact command that you ran, @weikengchen ?
was this ever completed? Can we move this forward?
More updates will be added to this note.
I measured the cost for Groth16 to prove a constraint system with two million constraints, with a different number of cores, using
cargo bench
in this repo with appropriate command line parameters.Below I focus on the main cost in my constraint system, which turns out to be in the witness map and computation of C.
Going from 1 core to 4 cores, the improvement is significant. But later when more cores are added, the time for the witness map seems to not change a lot.
Since the witness map involves a lot of FFT, it may suggest that the current implementation of FFT has a barrier toward many-many-core parallelism.
Such a barrier, maybe avoidable, maybe unavoidable. I will take a look at the detailed breakdown of the cost of the witness map.