flintlib / flint

FLINT (Fast Library for Number Theory)
http://www.flintlib.org
GNU Lesser General Public License v3.0
401 stars 235 forks source link

Improve nfloat squaring code #2009

Closed fredrik-johansson closed 4 weeks ago

fredrik-johansson commented 4 weeks ago

Performances for vectorized squaring are currently as follows:

$ build/nfloat/profile/p-vs_arf 

                   _gr_vec_add          _gr_vec_mul       _gr_vec_sqr
prec = 64
n =   10        3.030e-08 (4.290x)   1.750e-08 (7.086x)   1.130e-08 (10.885x)
n =  100        2.680e-07 (4.627x)   1.410e-07 (8.723x)   8.320e-08 (15.024x)
prec = 128
n =   10        4.120e-08 (4.806x)   2.530e-08 (5.375x)   1.920e-08 (7.031x)  
n =  100        3.860e-07 (5.207x)   2.310e-07 (5.931x)   1.620e-07 (8.457x)  
prec = 192
n =   10        4.970e-08 (4.950x)   5.380e-08 (2.937x)   4.420e-08 (3.371x)  
n =  100        4.790e-07 (5.052x)   5.190e-07 (3.083x)   4.210e-07 (3.610x)  
prec = 256
n =   10        6.170e-08 (3.987x)   7.030e-08 (2.518x)   5.490e-08 (2.987x)  
n =  100        5.630e-07 (4.405x)   6.900e-07 (2.623x)   5.330e-07 (3.189x)  
prec = 512
n =   10        1.260e-07 (2.087x)   1.580e-07 (1.981x)   1.000e-07 (3.050x)  
n =  100        1.220e-06 (2.164x)   1.570e-06 (2.051x)   9.840e-07 (3.272x)  
prec = 1024
n =   10        1.710e-07 (1.819x)   7.230e-07 (1.285x)   4.490e-07 (1.584x)  
n =  100        1.660e-06 (1.855x)   7.460e-06 (1.323x)   4.910e-06 (1.670x)  
prec = 2048
n =   10        2.260e-07 (1.739x)   2.320e-06 (1.263x)   1.340e-06 (1.619x)  
n =  100        2.320e-06 (1.681x)   2.370e-05 (1.342x)   1.350e-05 (1.800x)  
prec = 4096
n =   10        3.660e-07 (1.533x)   7.650e-06 (1.199x)   4.480e-06 (1.478x)  
n =  100        3.900e-06 (1.503x)   7.960e-05 (1.256x)   4.500e-05 (1.633x)  

$ build/nfloat/profile/p-vs_acf 
                   _gr_vec_add          _gr_vec_mul       _gr_vec_sqr
prec = 64
n =   10        5.900e-08 (4.085x)   1.380e-07 (3.580x)   9.040e-08 (4.347x)  
n =  100        5.540e-07 (4.458x)   1.390e-06 (3.561x)   9.010e-07 (4.306x)  
prec = 128
n =   10        8.240e-08 (5.024x)   2.100e-07 (3.324x)   1.430e-07 (3.294x)  
n =  100        7.770e-07 (5.225x)   2.010e-06 (3.463x)   1.420e-06 (3.232x)  
prec = 192
n =   10        1.030e-07 (4.893x)   3.720e-07 (2.129x)   2.180e-07 (2.486x)  
n =  100        9.750e-07 (5.026x)   3.680e-06 (2.149x)   2.170e-06 (2.502x)  
prec = 256
n =   10        1.220e-07 (4.082x)   4.390e-07 (2.009x)   2.600e-07 (2.338x)  
n =  100        1.160e-06 (4.310x)   4.330e-06 (2.007x)   2.570e-06 (2.354x)  
prec = 512
n =   10        2.650e-07 (2.038x)   8.490e-07 (1.802x)   5.120e-07 (2.129x)  
n =  100        2.660e-06 (2.056x)   8.530e-06 (1.758x)   5.100e-06 (2.078x)  
prec = 1024
n =   10        3.450e-07 (1.835x)   2.850e-06 (1.544x)   1.760e-06 (1.580x)  
n =  100        3.460e-06 (1.855x)   2.920e-05 (1.462x)   1.810e-05 (1.530x)  
prec = 2048
n =   10        4.570e-07 (1.670x)   8.150e-06 (1.914x)   4.660e-06 (1.820x)  
n =  100        4.720e-06 (1.676x)   8.220e-05 (1.788x)   4.670e-05 (1.754x)  
prec = 4096
n =   10        7.550e-07 (1.589x)   2.690e-05 (1.457x)   1.500e-05 (1.707x)  
n =  100        8.410e-06 (1.463x)   2.730e-04 (1.407x)   1.500e-04 (1.660x)  
albinahlback commented 4 weeks ago

Nice performance!

OT, but the CI appears to be working now.

fredrik-johansson commented 4 weeks ago

Indeed.