[VID] batch commit with GPU is unexpectedly slow

Check this branch https://github.com/EspressoSystems/jellyfish/tree/cl/gpu-profiling Running with cargo test --features gpu-vid,kzg-print-trace,print-trace -p jf-primitives -- profile_gpu_commit --nocapture gives you the following result. You can see the performance degrading with increased batch size. However according to cargo bench --bench kzg-gpu --features "test-srs icicle", MSM should only cost you [28.107 ms 28.438 ms 28.988 ms]
Start:   KZG10::Setup with prover degree 1048576 and verifier degree 1
··Start:   Generating powers of G
··End:     Generating powers of G ..................................................8.384s
End:     KZG10::Setup with prover degree 1048576 and verifier degree 1 .............8.769s
Start:   Type Conversion: ark->ICICLE: Group
End:     Type Conversion: ark->ICICLE: Group .......................................9.590ms
Start:   Load group elements: CPU->GPU
End:     Load group elements: CPU->GPU .............................................7.521ms
Start:   Batch commit 1048576 total elements, batch size 1
··Start:   Type Conversion: ark->ICICLE: Scalar
··End:     Type Conversion: ark->ICICLE: Scalar ....................................23.156ms
··Start:   Load scalars: CPU->GPU
··End:     Load scalars: CPU->GPU ..................................................2.502ms
··Start:   GPU-accelerated MSM
··End:     GPU-accelerated MSM .....................................................22.853ms
··Start:   Sync MSM result
··End:     Sync MSM result .........................................................11.730ms
··Start:   Load MSM result GPU->CPU
··End:     Load MSM result GPU->CPU ................................................52.750µs
··Start:   Type Conversion: ICICLE->ark: Group
··End:     Type Conversion: ICICLE->ark: Group .....................................182.968µs
End:     Batch commit 1048576 total elements, batch size 1 .........................61.846ms
Start:   Batch commit 1048576 total elements, batch size 8
··Start:   Type Conversion: ark->ICICLE: Scalar
··End:     Type Conversion: ark->ICICLE: Scalar ....................................24.932ms
··Start:   Load scalars: CPU->GPU
··End:     Load scalars: CPU->GPU ..................................................2.627ms
··Start:   GPU-accelerated MSM
··End:     GPU-accelerated MSM .....................................................22.863ms
··Start:   Sync MSM result
··End:     Sync MSM result .........................................................27.982ms
··Start:   Load MSM result GPU->CPU
··End:     Load MSM result GPU->CPU ................................................49.570µs
··Start:   Type Conversion: ICICLE->ark: Group
··End:     Type Conversion: ICICLE->ark: Group .....................................682.948µs
End:     Batch commit 1048576 total elements, batch size 8 .........................80.681ms
Start:   Batch commit 1048576 total elements, batch size 16
··Start:   Type Conversion: ark->ICICLE: Scalar
··End:     Type Conversion: ark->ICICLE: Scalar ....................................22.681ms
··Start:   Load scalars: CPU->GPU
··End:     Load scalars: CPU->GPU ..................................................5.194ms
··Start:   GPU-accelerated MSM
··End:     GPU-accelerated MSM .....................................................98.494ms
··Start:   Sync MSM result
··End:     Sync MSM result .........................................................49.478ms
··Start:   Load MSM result GPU->CPU
··End:     Load MSM result GPU->CPU ................................................109.749µs
··Start:   Type Conversion: ICICLE->ark: Group
··End:     Type Conversion: ICICLE->ark: Group .....................................865.120µs
End:     Batch commit 1048576 total elements, batch size 16 ........................178.481ms
Start:   Batch commit 1048576 total elements, batch size 256
··Start:   Type Conversion: ark->ICICLE: Scalar
··End:     Type Conversion: ark->ICICLE: Scalar ....................................23.140ms
··Start:   Load scalars: CPU->GPU
··End:     Load scalars: CPU->GPU ..................................................10.269ms
··Start:   GPU-accelerated MSM
··End:     GPU-accelerated MSM .....................................................180.192ms
··Start:   Sync MSM result
··End:     Sync MSM result .........................................................61.028ms
··Start:   Load MSM result GPU->CPU
··End:     Load MSM result GPU->CPU ................................................260.128µs
··Start:   Type Conversion: ICICLE->ark: Group
··End:     Type Conversion: ICICLE->ark: Group .....................................3.137ms
End:     Batch commit 1048576 total elements, batch size 256 .......................279.902ms
Start:   Batch commit 1048576 total elements, batch size 1024
··Start:   Type Conversion: ark->ICICLE: Scalar
··End:     Type Conversion: ark->ICICLE: Scalar ....................................24.463ms
··Start:   Load scalars: CPU->GPU
··End:     Load scalars: CPU->GPU ..................................................2.960ms
··Start:   GPU-accelerated MSM
··End:     GPU-accelerated MSM .....................................................60.377ms
··Start:   Sync MSM result
··End:     Sync MSM result .........................................................64.456ms
··Start:   Load MSM result GPU->CPU
··End:     Load MSM result GPU->CPU ................................................159.259µs
··Start:   Type Conversion: ICICLE->ark: Group
··End:     Type Conversion: ICICLE->ark: Group .....................................12.733ms
End:     Batch commit 1048576 total elements, batch size 1024 ......................167.020ms
Start:   Batch commit 1048576 total elements, batch size 4096
··Start:   Type Conversion: ark->ICICLE: Scalar
··End:     Type Conversion: ark->ICICLE: Scalar ....................................23.588ms
··Start:   Load scalars: CPU->GPU
··End:     Load scalars: CPU->GPU ..................................................5.343ms
··Start:   GPU-accelerated MSM
··End:     GPU-accelerated MSM .....................................................198.382ms
··Start:   Sync MSM result
··End:     Sync MSM result .........................................................39.863ms
··Start:   Load MSM result GPU->CPU
··End:     Load MSM result GPU->CPU ................................................200.608µs
··Start:   Type Conversion: ICICLE->ark: Group
··End:     Type Conversion: ICICLE->ark: Group .....................................41.532ms
End:     Batch commit 1048576 total elements, batch size 4096 ......................311.326ms
test pcs::univariate_kzg::tests::icicle::profile_gpu_commit ... ok
EspressoSystems / jellyfish

[VID] batch commit with GPU is unexpectedly slow #526