Open mrain opened 7 months ago
The other weird observation is, if you do not warmup a new cuda stream per run, the performance is also not good even with 1 batch. cc @alxiong
Also with one single batch, the type conversion takes one-third of the time. We could eliminate that #516
Check this branch https://github.com/EspressoSystems/jellyfish/tree/cl/gpu-profiling Running with
cargo test --features gpu-vid,kzg-print-trace,print-trace -p jf-primitives -- profile_gpu_commit --nocapture
gives you the following result. You can see the performance degrading with increased batch size. However according tocargo bench --bench kzg-gpu --features "test-srs icicle"
, MSM should only cost you [28.107 ms 28.438 ms 28.988 ms]