fma_poly_scale_slice_hexl performance times does not increase linearly with mod_size

Janmajayamall commented 1 year ago

fma_poly_scale_slice_hexl performance times does not increase linearly with mod_size (ie moduli count in poly). Instead as mod_size increases time blows up.

Poly stores its coefficients in row major form and all fma_poly_scale_slice_hexl does is that it calls hexl_rs::elwise_fma_mod for each row in poly. This means fma_poly_scale_slice_hexl calls elwise_fma_mod mod_size times (row count equals mod_size, since there is a single row for each moduli). Hence, time taken when mod_size = 15 must be 15x of time taken when mod_size = 1. But this isn't the case.

For example on r6i.8xlarge instance, following are benchmarks for fma_poly_scale_slice_hexl:

range_fn/fma_poly_scale_slice_hexl/n=32768/mod_size=1
                        time:   [8.5271 µs 8.5275 µs 8.5278 µs]

range_fn/fma_poly_scale_slice_hexl/n=32768/mod_size=3
                        time:   [36.679 µs 36.695 µs 36.709 µs]

range_fn/fma_poly_scale_slice_hexl/n=32768/mod_size=7
                        time:   [107.51 µs 107.51 µs 107.52 µs]

range_fn/fma_poly_scale_slice_hexl/n=32768/mod_size=15
                        time:   [254.32 µs 254.35 µs 254.37 µs]

Time taken clearly does not increase linearly with mod_size.

Janmajayamall commented 1 year ago

Benchmarks of elwise_fma_mod_2d (a function that mimics fma_poly_scale_slice_hexl) in hexl-rs display same behaviour. https://github.com/Janmajayamall/hexl-rs/issues/1 tracks this.

Janmajayamall commented 1 year ago

Note that this issue renders benchamarks of optimised_range_fn_fma_hexl useless since all optimised_range_fn_fma_hexl does is to call fma_poly_scale_slice_hexl 127*2 times and mul_poly_scalar_slice_hexl 2 times.

Janmajayamall / ObliviousMessageRetrieval

fma_poly_scale_slice_hexl performance times does not increase linearly with mod_size #2