dalek-cryptography / curve25519-dalek

A pure-Rust implementation of group operations on Ristretto and Curve25519
Other
867 stars 439 forks source link

Runtime backend autodetection #523

Closed koute closed 1 year ago

koute commented 1 year ago

This PR makes the following changes:

Since the code in src/backend/vector/scalar_mul had to be nested in an extra mod the diff is a little messy; I suggest reviewing commit-by-commit and just ignoring the commit which rustfmts those files.

tarcieri commented 1 year ago

FWIW, in lieu of unsafe_target_feature in @RustCrypto project we've used #[inline(always)] in the context of things like trait impls where the code is one or two lines long, though that can slow compile times.

In this PR though, it seems like you're applying it to larger code blocks.

koute commented 1 year ago

FWIW, in lieu of unsafe_target_feature in @RustCrypto project we've used #[inline(always)] in the context of things like trait impls where the code is one or two lines long, though that can slow compile times.

In this PR though, it seems like you're applying it to larger code blocks.

The #[inline(always)] is only used for the thin wrapper functions though, e.g. this:

#[unsafe_target_feature("sse2")]
fn function() {
   /* ... */
}

gets turned into this:

#[inline(always)]
fn function() {
    #[target_feature(enable = "sse2")]
    unsafe fn _impl_function() {
        /* ... */
    }
    unsafe { _impl_function() }
}

So the inner function which actually contains the body of the function is not marked with #[inline(always)].

koute commented 1 year ago

I've also ran the benchmarks for this PR; here are the results for anyone interested.

The benches were ran on a Xeon Platinum 8481C 2.70GHz on Google Cloud.

Scalar vs AVX2 | testcase | scalar | avx2 | % | |---------------------------------------------------------------------------------|-----------------------|--------------------|---------| | multiscalar benches/Constant-time variable-base multiscalar multiplication/384 | 5896.6 | 3886.29 | -34.09% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/768 | 11792.0 | 7778.8 | -34.03% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/512 | 7843.0 | 5194.8 | -33.77% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/16 | 271.05 | 179.74 | -33.69% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1024 | 15724.0 | 10432.0 | -33.66% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/64 | 1005.10 | 669.14 | -33.43% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/128 | 1973.4 | 1315.19 | -33.35% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/256 | 3902.8 | 2601.79 | -33.34% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/32 | 512.99 | 342.09 | -33.31% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/8 | 151.03 | 102.31 | -32.26% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/4 | 91.36 | 63.62 | -30.36% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/2 | 61.51 | 44.24 | -28.07% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1024 | 6677.2 | 4906.0 | -26.53% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/512 | 3775.3 | 2794.1 | -25.99% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/384 | 3034.5 | 2246.9 | -25.95% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1 | 46.62 | 34.53 | -25.92% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/256 | 2184.5 | 1621.3 | -25.78% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/768 | 5258.20 | 3903.9 | -25.76% | | multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn) | 99.18 | 77.43 | -21.93% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/64 | 616.72 | 481.67 | -21.90% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/128 | 1201.6 | 938.67 | -21.88% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/16 | 177.74 | 139.12 | -21.73% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/32 | 324.31 | 253.95 | -21.70% | | multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn) | 65.02 | 50.92 | -21.68% | | multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn) | 166.7 | 130.68 | -21.61% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/8 | 104.01 | 81.73 | -21.42% | | multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn) | 2172.5 | 1712.1 | -21.19% | | multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn) | 566.45 | 446.53 | -21.17% | | multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn) | 301.09 | 237.38 | -21.16% | | multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn) | 3249.5 | 2562.60 | -21.14% | | multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn) | 4320.2 | 3409.60 | -21.08% | | multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn) | 1097.6 | 866.8 | -21.03% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/4 | 67.03 | 52.98 | -20.96% | | multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn) | 47.33 | 37.52 | -20.73% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/2 | 48.53 | 38.59 | -20.49% | | multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn) | 92.14 | 73.28 | -20.47% | | edwards benches/Constant-time variable-base scalar mul | 44.45 | 35.43 | -20.27% | | multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn) | 155.31 | 124.24 | -20.01% | | multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn) | 59.81 | 47.89 | -19.92% | | multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn) | 59.79 | 47.89 | -19.89% | | multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn) | 89.27 | 71.62 | -19.77% | | multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn) | 44.73 | 35.99 | -19.54% | | multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn) | 44.72 | 35.98 | -19.52% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1 | 39.15 | 31.54 | -19.44% | | multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn) | 3986.4 | 3218.6 | -19.26% | | multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn) | 2007.3 | 1620.9 | -19.25% | | multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn) | 2995.6 | 2419.79 | -19.22% | | multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn) | 521.72 | 421.92 | -19.13% | | multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn) | 1015.09 | 821.04 | -19.12% | | multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn) | 277.22 | 224.61 | -18.98% | | multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn) | 6329.90 | 5141.4 | -18.78% | | multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn) | 146.78 | 119.67 | -18.47% | | multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn) | 37.08 | 30.24 | -18.45% | | multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn) | 37.08 | 30.24 | -18.44% | | multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn) | 37.07 | 30.24 | -18.42% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn) | 8438.0 | 6894.3 | -18.29% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/2 | 43.61 | 35.93 | -17.60% | | multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn) | 2808.39 | 2320.3 | -17.38% | | multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn) | 1879.1 | 1553.39 | -17.33% | | multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn) | 3736.60 | 3090.6 | -17.29% | | multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn) | 260.52 | 215.51 | -17.28% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/4 | 57.79 | 47.82 | -17.25% | | multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn) | 949.81 | 786.7 | -17.17% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1 | 36.41 | 30.16 | -17.16% | | multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn) | 488.12 | 404.41 | -17.15% | | multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn) | 5834.1 | 4864.0 | -16.63% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/8 | 85.91 | 71.62 | -16.63% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/16 | 141.72 | 119.42 | -15.74% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn) | 7773.79 | 6565.1 | -15.55% | | edwards benches/Variable-time aA+bB A variable B fixed | 40.69 | 34.62 | -14.91% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/32 | 252.32 | 215.53 | -14.58% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/64 | 473.06 | 405.01 | -14.39% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/256 | 1819.19 | 1558.6 | -14.32% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/384 | 2715.5 | 2328.6 | -14.25% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/512 | 3614.5 | 3099.8 | -14.24% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/128 | 918.51 | 788.72 | -14.13% | | multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn) | 5413.2 | 4667.6 | -13.77% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/768 | 5424.29 | 4683.5 | -13.66% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024 | 7260.2 | 6339.5 | -12.68% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn) | 7231.8 | 6320.5 | -12.60% | | edwards benches/Constant-time fixed-base scalar mul | 11.56 | 12.72 | 9.99% | | montgomery benches/Montgomery pseudomultiplication | 50.93 | 55.64 | 9.25% | | montgomery benches/Constant-time fixed-base scalar mul | 15.57 | 16.56 | 6.36% | | scalar benches/Scalar addition | 23511000.0 | 22964000.0 | -2.33% | | scalar benches/Batch scalar inversion/1 | 13.10 | 13.21 | 0.89% | | scalar benches/Scalar inversion | 12.81 | 12.92 | 0.88% | | scalar benches/Batch scalar inversion/4 | 13.65 | 13.76 | 0.85% | | scalar benches/Batch scalar inversion/8 | 14.36 | 14.48 | 0.81% | | scalar benches/Batch scalar inversion/16 | 15.80 | 15.92 | 0.75% | | scalar benches/Batch scalar inversion/2 | 13.28 | 13.38 | 0.74% | | ristretto benches/Batch Ristretto double-and-encode/16 | 13.30 | 13.21 | -0.68% | | ristretto benches/Batch Ristretto double-and-encode/8 | 8.70 | 8.66 | -0.41% | | scalar benches/Scalar subtraction | 21408000.0 | 21485000.0 | 0.36% | | scalar benches/Scalar multiplication | 91928000.0 | 91662000.0 | -0.29% | | ristretto benches/Batch Ristretto double-and-encode/4 | 6.35 | 6.34 | -0.23% | | edwards benches/EdwardsPoint compression | 3.95 | 3.95 | 0.12% | | ristretto benches/RistrettoPoint decompression | 4.61 | 4.61 | 0.07% | | ristretto benches/Batch Ristretto double-and-encode/1 | 4.58 | 4.58 | -0.06% | | edwards benches/EdwardsPoint decompression | 4.31 | 4.32 | 0.04% | | ristretto benches/Batch Ristretto double-and-encode/2 | 5.18 | 5.18 | -0.01% | | ristretto benches/RistrettoPoint compression | 4.57 | 4.57 | 0.00% |
Scalar vs AVX512 | testcase | scalar | avx512 | % | |---------------------------------------------------------------------------------|-----------------------|--------------------------|---------| | multiscalar benches/Constant-time variable-base multiscalar multiplication/768 | 11792.0 | 5123.8 | -56.55% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/384 | 5896.6 | 2567.0 | -56.47% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/512 | 7843.0 | 3427.9 | -56.29% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1024 | 15724.0 | 6908.5 | -56.06% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/256 | 3902.8 | 1718.6 | -55.96% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/128 | 1973.4 | 872.49 | -55.79% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/64 | 1005.10 | 449.67 | -55.26% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/32 | 512.99 | 236.33 | -53.93% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/16 | 271.05 | 128.61 | -52.55% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/768 | 5258.20 | 2529.79 | -51.89% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/384 | 3034.5 | 1475.7 | -51.37% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1024 | 6677.2 | 3251.79 | -51.30% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/512 | 3775.3 | 1854.60 | -50.88% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/8 | 151.03 | 75.82 | -49.79% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/256 | 2184.5 | 1104.0 | -49.46% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/4 | 91.36 | 49.58 | -45.72% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/2 | 61.51 | 35.94 | -41.56% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1 | 46.62 | 29.46 | -36.80% | | edwards benches/Constant-time variable-base scalar mul | 44.45 | 29.97 | -32.56% | | multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn) | 4320.2 | 2954.0 | -31.62% | | multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn) | 3249.5 | 2223.6 | -31.57% | | multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn) | 2172.5 | 1490.89 | -31.37% | | multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn) | 1097.6 | 763.06 | -30.48% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/128 | 1201.6 | 835.7 | -30.45% | | multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn) | 3986.4 | 2785.29 | -30.13% | | multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn) | 2995.6 | 2095.1 | -30.06% | | multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn) | 566.45 | 396.4 | -30.02% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/64 | 616.72 | 431.84 | -29.98% | | multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn) | 6329.90 | 4435.2 | -29.93% | | multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn) | 2007.3 | 1406.9 | -29.91% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn) | 8438.0 | 5921.59 | -29.82% | | multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn) | 301.09 | 212.54 | -29.41% | | multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn) | 1015.09 | 719.19 | -29.15% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/32 | 324.31 | 230.12 | -29.04% | | multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn) | 3736.60 | 2666.6 | -28.64% | | multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn) | 2808.39 | 2007.1 | -28.53% | | multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn) | 166.7 | 119.31 | -28.43% | | multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn) | 5834.1 | 4180.5 | -28.34% | | multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn) | 1879.1 | 1347.39 | -28.30% | | multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn) | 521.72 | 374.48 | -28.22% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn) | 7773.79 | 5599.5 | -27.97% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/16 | 177.74 | 129.05 | -27.39% | | multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn) | 949.81 | 689.65 | -27.39% | | multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn) | 277.22 | 201.88 | -27.18% | | multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn) | 99.18 | 72.67 | -26.73% | | multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn) | 5413.2 | 3998.3 | -26.14% | | multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn) | 155.31 | 114.76 | -26.11% | | multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn) | 488.12 | 360.78 | -26.09% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn) | 7231.8 | 5371.8 | -25.72% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024 | 7260.2 | 5398.5 | -25.64% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/8 | 104.01 | 77.50 | -25.48% | | multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn) | 260.52 | 194.85 | -25.21% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/768 | 5424.29 | 4059.99 | -25.15% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/512 | 3614.5 | 2708.5 | -25.07% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/384 | 2715.5 | 2039.8 | -24.88% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/256 | 1819.19 | 1368.4 | -24.78% | | multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn) | 65.02 | 48.93 | -24.75% | | multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn) | 92.14 | 69.44 | -24.64% | | multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn) | 146.78 | 111.45 | -24.07% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/128 | 918.51 | 699.73 | -23.82% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/4 | 67.03 | 51.08 | -23.79% | | multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn) | 89.27 | 68.24 | -23.56% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/64 | 473.06 | 365.51 | -22.73% | | multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn) | 59.81 | 46.43 | -22.36% | | multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn) | 59.79 | 46.43 | -22.34% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/2 | 48.53 | 37.73 | -22.25% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/32 | 252.32 | 196.43 | -22.15% | | multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn) | 47.33 | 36.85 | -22.14% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1 | 39.15 | 30.78 | -21.38% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/16 | 141.72 | 111.78 | -21.13% | | scalar benches/Scalar inversion | 12.81 | 16.17 | 26.21% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/8 | 85.91 | 68.35 | -20.44% | | multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn) | 44.73 | 35.68 | -20.23% | | multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn) | 44.72 | 35.68 | -20.20% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/4 | 57.79 | 46.42 | -19.68% | | multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn) | 37.07 | 29.93 | -19.27% | | multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn) | 37.08 | 29.94 | -19.27% | | multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn) | 37.08 | 29.95 | -19.23% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/2 | 43.61 | 35.52 | -18.54% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1 | 36.41 | 29.77 | -18.25% | | edwards benches/Variable-time aA+bB A variable B fixed | 40.69 | 33.53 | -17.60% | | montgomery benches/Constant-time fixed-base scalar mul | 15.57 | 17.22 | 10.62% | | scalar benches/Scalar multiplication | 91928000.0 | 99047000.0 | 7.74% | | edwards benches/Constant-time fixed-base scalar mul | 11.56 | 12.40 | 7.29% | | ristretto benches/Batch Ristretto double-and-encode/16 | 13.30 | 14.25 | 7.14% | | montgomery benches/Montgomery pseudomultiplication | 50.93 | 54.34 | 6.68% | | scalar benches/Scalar addition | 23511000.0 | 22191000.0 | -5.61% | | ristretto benches/Batch Ristretto double-and-encode/8 | 8.70 | 9.21 | 5.90% | | ristretto benches/Batch Ristretto double-and-encode/4 | 6.35 | 6.60 | 3.86% | | scalar benches/Batch scalar inversion/16 | 15.80 | 16.32 | 3.29% | | ristretto benches/RistrettoPoint compression | 4.57 | 4.69 | 2.67% | | scalar benches/Batch scalar inversion/8 | 14.36 | 14.72 | 2.50% | | ristretto benches/Batch Ristretto double-and-encode/2 | 5.18 | 5.31 | 2.47% | | ristretto benches/RistrettoPoint decompression | 4.61 | 4.71 | 2.23% | | scalar benches/Batch scalar inversion/4 | 13.65 | 13.95 | 2.18% | | edwards benches/EdwardsPoint decompression | 4.31 | 4.39 | 1.72% | | scalar benches/Batch scalar inversion/1 | 13.10 | 13.31 | 1.61% | | scalar benches/Batch scalar inversion/2 | 13.28 | 13.47 | 1.45% | | ristretto benches/Batch Ristretto double-and-encode/1 | 4.58 | 4.64 | 1.35% | | scalar benches/Scalar subtraction | 21408000.0 | 21139000.0 | -1.26% | | edwards benches/EdwardsPoint compression | 3.95 | 3.95 | 0.05% |
AVX2 vs AVX512 | testcase | avx2 | avx512 | % | |---------------------------------------------------------------------------------|--------------------|--------------------------|---------| | multiscalar benches/Variable-time variable-base multiscalar multiplication/768 | 3903.9 | 2529.79 | -35.20% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/384 | 2246.9 | 1475.7 | -34.32% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/768 | 7778.8 | 5123.8 | -34.13% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/512 | 5194.8 | 3427.9 | -34.01% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/384 | 3886.29 | 2567.0 | -33.95% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/256 | 2601.79 | 1718.6 | -33.95% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1024 | 10432.0 | 6908.5 | -33.78% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1024 | 4906.0 | 3251.79 | -33.72% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/128 | 1315.19 | 872.49 | -33.66% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/512 | 2794.1 | 1854.60 | -33.62% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/64 | 669.14 | 449.67 | -32.80% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/256 | 1621.3 | 1104.0 | -31.91% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/32 | 342.09 | 236.33 | -30.92% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/16 | 179.74 | 128.61 | -28.45% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/8 | 102.31 | 75.82 | -25.88% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/4 | 63.62 | 49.58 | -22.06% | | scalar benches/Scalar inversion | 12.92 | 16.17 | 25.11% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/2 | 44.24 | 35.94 | -18.76% | | edwards benches/Constant-time variable-base scalar mul | 35.43 | 29.97 | -15.41% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn) | 6320.5 | 5371.8 | -15.01% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024 | 6339.5 | 5398.5 | -14.84% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn) | 6565.1 | 5599.5 | -14.71% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1 | 34.53 | 29.46 | -14.68% | | multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn) | 4667.6 | 3998.3 | -14.34% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn) | 6894.3 | 5921.59 | -14.11% | | multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn) | 4864.0 | 4180.5 | -14.05% | | multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn) | 5141.4 | 4435.2 | -13.74% | | multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn) | 3090.6 | 2666.6 | -13.72% | | multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn) | 2320.3 | 2007.1 | -13.50% | | multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn) | 3218.6 | 2785.29 | -13.46% | | multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn) | 2419.79 | 2095.1 | -13.42% | | multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn) | 3409.60 | 2954.0 | -13.36% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/768 | 4683.5 | 4059.99 | -13.31% | | multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn) | 1553.39 | 1347.39 | -13.26% | | multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn) | 2562.60 | 2223.6 | -13.23% | | multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn) | 1620.9 | 1406.9 | -13.20% | | multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn) | 1712.1 | 1490.89 | -12.92% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/512 | 3099.8 | 2708.5 | -12.62% | | multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn) | 821.04 | 719.19 | -12.40% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/384 | 2328.6 | 2039.8 | -12.40% | | multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn) | 786.7 | 689.65 | -12.34% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/256 | 1558.6 | 1368.4 | -12.20% | | multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn) | 866.8 | 763.06 | -11.97% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/128 | 788.72 | 699.73 | -11.28% | | multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn) | 421.92 | 374.48 | -11.24% | | multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn) | 446.53 | 396.4 | -11.23% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/128 | 938.67 | 835.7 | -10.97% | | multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn) | 404.41 | 360.78 | -10.79% | | multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn) | 237.38 | 212.54 | -10.46% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/64 | 481.67 | 431.84 | -10.35% | | multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn) | 224.61 | 201.88 | -10.12% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/64 | 405.01 | 365.51 | -9.75% | | multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn) | 215.51 | 194.85 | -9.59% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/32 | 253.95 | 230.12 | -9.38% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/32 | 215.53 | 196.43 | -8.86% | | multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn) | 130.68 | 119.31 | -8.70% | | multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn) | 124.24 | 114.76 | -7.63% | | scalar benches/Scalar multiplication | 91662000.0 | 99047000.0 | 8.06% | | ristretto benches/Batch Ristretto double-and-encode/16 | 13.21 | 14.25 | 7.88% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/16 | 139.12 | 129.05 | -7.24% | | multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn) | 119.67 | 111.45 | -6.87% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/16 | 119.42 | 111.78 | -6.40% | | multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn) | 77.43 | 72.67 | -6.15% | | ristretto benches/Batch Ristretto double-and-encode/8 | 8.66 | 9.21 | 6.33% | | multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn) | 73.28 | 69.44 | -5.24% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/8 | 81.73 | 77.50 | -5.17% | | multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn) | 71.62 | 68.24 | -4.72% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/8 | 71.62 | 68.35 | -4.58% | | ristretto benches/Batch Ristretto double-and-encode/4 | 6.34 | 6.60 | 4.10% | | multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn) | 50.92 | 48.93 | -3.92% | | montgomery benches/Constant-time fixed-base scalar mul | 16.56 | 17.22 | 4.01% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/4 | 52.98 | 51.08 | -3.58% | | scalar benches/Scalar addition | 22964000.0 | 22191000.0 | -3.37% | | edwards benches/Variable-time aA+bB A variable B fixed | 34.62 | 33.53 | -3.16% | | multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn) | 47.89 | 46.43 | -3.06% | | multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn) | 47.89 | 46.43 | -3.05% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/4 | 47.82 | 46.42 | -2.93% | | ristretto benches/RistrettoPoint compression | 4.57 | 4.69 | 2.67% | | scalar benches/Batch scalar inversion/16 | 15.92 | 16.32 | 2.52% | | edwards benches/Constant-time fixed-base scalar mul | 12.72 | 12.40 | -2.45% | | ristretto benches/Batch Ristretto double-and-encode/2 | 5.18 | 5.31 | 2.48% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1 | 31.54 | 30.78 | -2.40% | | montgomery benches/Montgomery pseudomultiplication | 55.64 | 54.34 | -2.35% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/2 | 38.59 | 37.73 | -2.22% | | ristretto benches/RistrettoPoint decompression | 4.61 | 4.71 | 2.16% | | multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn) | 37.52 | 36.85 | -1.78% | | edwards benches/EdwardsPoint decompression | 4.32 | 4.39 | 1.69% | | scalar benches/Batch scalar inversion/8 | 14.48 | 14.72 | 1.67% | | scalar benches/Scalar subtraction | 21485000.0 | 21139000.0 | -1.61% | | ristretto benches/Batch Ristretto double-and-encode/1 | 4.58 | 4.64 | 1.41% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1 | 30.16 | 29.77 | -1.32% | | scalar benches/Batch scalar inversion/4 | 13.76 | 13.95 | 1.32% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/2 | 35.93 | 35.52 | -1.14% | | multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn) | 30.24 | 29.93 | -1.04% | | multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn) | 30.24 | 29.94 | -1.00% | | multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn) | 30.24 | 29.95 | -0.96% | | multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn) | 35.99 | 35.68 | -0.86% | | multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn) | 35.98 | 35.68 | -0.84% | | scalar benches/Batch scalar inversion/1 | 13.21 | 13.31 | 0.72% | | scalar benches/Batch scalar inversion/2 | 13.38 | 13.47 | 0.70% | | edwards benches/EdwardsPoint compression | 3.95 | 3.95 | -0.07% |
Before this PR vs after this PR (AVX2) | testcase | before PR | after PR | % | |---------------------------------------------------------------------------------|--------------------|--------------------|--------| | multiscalar benches/Constant-time variable-base multiscalar multiplication/16 | 191.57 | 179.74 | -6.18% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/32 | 363.33 | 342.09 | -5.85% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/8 | 108.53 | 102.31 | -5.73% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/256 | 2754.6 | 2601.79 | -5.55% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/128 | 1392.4 | 1315.19 | -5.54% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/64 | 708.22 | 669.14 | -5.52% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/384 | 4111.90 | 3886.29 | -5.49% | | edwards benches/Constant-time fixed-base scalar mul | 13.45 | 12.72 | -5.45% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/768 | 8200.8 | 7778.8 | -5.15% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/512 | 5475.79 | 5194.8 | -5.13% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/4 | 67.04 | 63.62 | -5.09% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1024 | 10969.0 | 10432.0 | -4.90% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/2 | 46.28 | 44.24 | -4.41% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1 | 35.90 | 34.53 | -3.81% | | edwards benches/Variable-time aA+bB A variable B fixed | 35.93 | 34.62 | -3.63% | | scalar benches/Scalar addition | 23745000.0 | 22964000.0 | -3.29% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1024 | 5061.0 | 4906.0 | -3.06% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1 | 32.53 | 31.54 | -3.03% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/16 | 143.46 | 139.12 | -3.03% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1 | 31.10 | 30.16 | -3.01% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/2 | 39.78 | 38.59 | -3.00% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/4 | 54.57 | 52.98 | -2.92% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/8 | 84.14 | 81.73 | -2.87% | | multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn) | 31.12 | 30.24 | -2.83% | | multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn) | 31.12 | 30.24 | -2.81% | | multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn) | 31.11 | 30.24 | -2.79% | | multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn) | 37.01 | 35.99 | -2.77% | | multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn) | 37.01 | 35.98 | -2.77% | | multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn) | 38.58 | 37.52 | -2.74% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/32 | 261.0 | 253.95 | -2.70% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/64 | 494.95 | 481.67 | -2.68% | | multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn) | 134.23 | 130.68 | -2.64% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/128 | 963.84 | 938.67 | -2.61% | | multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn) | 75.22 | 73.28 | -2.58% | | multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn) | 2381.5 | 2320.3 | -2.57% | | multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn) | 3303.4 | 3218.6 | -2.57% | | multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn) | 3171.7 | 3090.6 | -2.56% | | multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn) | 73.49 | 71.62 | -2.55% | | multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn) | 432.92 | 421.92 | -2.54% | | multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn) | 4990.70 | 4864.0 | -2.54% | | multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn) | 458.09 | 446.53 | -2.52% | | multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn) | 49.13 | 47.89 | -2.52% | | multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn) | 127.44 | 124.24 | -2.51% | | multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn) | 1593.39 | 1553.39 | -2.51% | | multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn) | 842.11 | 821.04 | -2.50% | | multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn) | 414.72 | 404.41 | -2.49% | | multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn) | 49.11 | 47.89 | -2.49% | | multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn) | 52.21 | 50.92 | -2.47% | | multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn) | 4785.70 | 4667.6 | -2.47% | | multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn) | 1661.7 | 1620.9 | -2.46% | | multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn) | 243.34 | 237.38 | -2.45% | | multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn) | 806.45 | 786.7 | -2.45% | | multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn) | 2480.10 | 2419.79 | -2.43% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn) | 6477.40 | 6320.5 | -2.42% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/512 | 2863.4 | 2794.1 | -2.42% | | multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn) | 3493.9 | 3409.60 | -2.41% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/384 | 2302.29 | 2246.9 | -2.41% | | multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn) | 888.06 | 866.8 | -2.39% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/2 | 36.81 | 35.93 | -2.39% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn) | 6725.3 | 6565.1 | -2.38% | | multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn) | 79.32 | 77.43 | -2.37% | | multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn) | 2624.89 | 2562.60 | -2.37% | | multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn) | 220.72 | 215.51 | -2.36% | | multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn) | 5264.9 | 5141.4 | -2.35% | | multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn) | 229.98 | 224.61 | -2.33% | | multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn) | 1752.69 | 1712.1 | -2.32% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/256 | 1659.7 | 1621.3 | -2.31% | | multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn) | 122.44 | 119.67 | -2.26% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn) | 7049.1 | 6894.3 | -2.20% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/4 | 48.82 | 47.82 | -2.03% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/64 | 413.36 | 405.01 | -2.02% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/512 | 3163.7 | 3099.8 | -2.02% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024 | 6469.59 | 6339.5 | -2.01% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/256 | 1589.9 | 1558.6 | -1.97% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/128 | 804.5 | 788.72 | -1.96% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/32 | 219.82 | 215.53 | -1.95% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/384 | 2374.89 | 2328.6 | -1.95% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/8 | 72.99 | 71.62 | -1.87% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/768 | 3978.0 | 3903.9 | -1.86% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/768 | 4771.7 | 4683.5 | -1.85% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/16 | 121.62 | 119.42 | -1.81% | | edwards benches/Constant-time variable-base scalar mul | 35.77 | 35.43 | -0.95% | | ristretto benches/RistrettoPoint compression | 4.61 | 4.57 | -0.90% | | scalar benches/Scalar multiplication | 92256000.0 | 91662000.0 | -0.64% | | edwards benches/EdwardsPoint decompression | 4.34 | 4.32 | -0.59% | | scalar benches/Scalar subtraction | 21571000.0 | 21485000.0 | -0.40% | | ristretto benches/Batch Ristretto double-and-encode/16 | 13.25 | 13.21 | -0.29% | | scalar benches/Batch scalar inversion/8 | 14.46 | 14.48 | 0.18% | | scalar benches/Batch scalar inversion/16 | 15.89 | 15.92 | 0.18% | | ristretto benches/Batch Ristretto double-and-encode/2 | 5.17 | 5.18 | 0.17% | | montgomery benches/Constant-time fixed-base scalar mul | 16.53 | 16.56 | 0.15% | | scalar benches/Batch scalar inversion/4 | 13.75 | 13.76 | 0.10% | | scalar benches/Scalar inversion | 12.93 | 12.92 | -0.07% | | scalar benches/Batch scalar inversion/2 | 13.37 | 13.38 | 0.07% | | scalar benches/Batch scalar inversion/1 | 13.21 | 13.21 | 0.06% | | montgomery benches/Montgomery pseudomultiplication | 55.62 | 55.64 | 0.04% | | edwards benches/EdwardsPoint compression | 3.95 | 3.95 | -0.04% | | ristretto benches/Batch Ristretto double-and-encode/4 | 6.34 | 6.34 | -0.03% | | ristretto benches/RistrettoPoint decompression | 4.62 | 4.61 | -0.03% | | ristretto benches/Batch Ristretto double-and-encode/8 | 8.67 | 8.66 | -0.01% | | ristretto benches/Batch Ristretto double-and-encode/1 | 4.58 | 4.58 | 0.01% |
Before this PR vs after this PR (AVX512) | testcase | before PR | after PR | % | |---------------------------------------------------------------------------------|--------------------------|--------------------------|---------| | scalar benches/Scalar inversion | 12.63 | 16.17 | 28.06% | | scalar benches/Scalar addition | 25469000.0 | 22191000.0 | -12.87% | | scalar benches/Scalar subtraction | 24187000.0 | 21139000.0 | -12.60% | | montgomery benches/Constant-time fixed-base scalar mul | 16.33 | 17.22 | 5.42% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1024 | 6698.0 | 6908.5 | 3.14% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/32 | 229.61 | 236.33 | 2.93% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/768 | 4990.2 | 5123.8 | 2.68% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/2 | 36.87 | 35.94 | -2.50% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/8 | 74.11 | 75.82 | 2.32% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/1 | 30.14 | 29.46 | -2.24% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/384 | 2511.7 | 2567.0 | 2.20% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/512 | 3354.9 | 3427.9 | 2.18% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/64 | 440.6 | 449.67 | 2.06% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/256 | 1684.0 | 1718.6 | 2.05% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/256 | 1082.0 | 1104.0 | 2.03% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/128 | 857.79 | 872.49 | 1.71% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/64 | 359.36 | 365.51 | 1.71% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/128 | 688.29 | 699.73 | 1.66% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/384 | 2006.80 | 2039.8 | 1.64% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/512 | 2667.60 | 2708.5 | 1.53% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/64 | 425.4 | 431.84 | 1.51% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/256 | 1348.0 | 1368.4 | 1.51% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/128 | 823.29 | 835.7 | 1.51% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/768 | 4004.2 | 4059.99 | 1.39% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/16 | 127.28 | 129.05 | 1.39% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/8 | 76.45 | 77.50 | 1.38% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/32 | 227.04 | 230.12 | 1.36% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/4 | 50.44 | 51.08 | 1.27% | | multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn) | 30.26 | 29.93 | -1.09% | | multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn) | 30.27 | 29.94 | -1.09% | | multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn) | 30.27 | 29.95 | -1.05% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/4 | 49.11 | 49.58 | 0.96% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1 | 29.49 | 29.77 | 0.95% | | scalar benches/Scalar multiplication | 98142000.0 | 99047000.0 | 0.92% | | multiscalar benches/Constant-time variable-base multiscalar multiplication/16 | 127.44 | 128.61 | 0.92% | | multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn) | 35.92 | 35.68 | -0.68% | | multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn) | 35.91 | 35.68 | -0.66% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/32 | 195.15 | 196.43 | 0.66% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/2 | 37.49 | 37.73 | 0.65% | | multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn) | 758.23 | 763.06 | 0.64% | | multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn) | 1355.9 | 1347.39 | -0.63% | | multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn) | 2683.1 | 2666.6 | -0.61% | | multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn) | 118.58 | 119.31 | 0.62% | | multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn) | 693.79 | 689.65 | -0.60% | | multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn) | 196.0 | 194.85 | -0.59% | | multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn) | 4021.70 | 3998.3 | -0.58% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn) | 5402.4 | 5371.8 | -0.57% | | multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn) | 68.63 | 68.24 | -0.56% | | multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn) | 2018.1 | 2007.1 | -0.55% | | multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn) | 2938.0 | 2954.0 | 0.54% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/384 | 1483.5 | 1475.7 | -0.53% | | multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn) | 394.37 | 396.4 | 0.51% | | multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn) | 37.04 | 36.85 | -0.51% | | multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn) | 2212.5 | 2223.6 | 0.50% | | multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn) | 1483.5 | 1490.89 | 0.50% | | multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn) | 4413.2 | 4435.2 | 0.50% | | multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn) | 72.31 | 72.67 | 0.50% | | multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn) | 46.65 | 46.43 | -0.48% | | multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn) | 111.99 | 111.45 | -0.48% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/8 | 68.66 | 68.35 | -0.46% | | multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn) | 211.61 | 212.54 | 0.44% | | multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn) | 46.63 | 46.43 | -0.42% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn) | 5897.70 | 5921.59 | 0.41% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1024 | 3239.1 | 3251.79 | 0.39% | | multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn) | 362.16 | 360.78 | -0.38% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/1 | 30.69 | 30.78 | 0.32% | | ristretto benches/Batch Ristretto double-and-encode/1 | 4.66 | 4.64 | -0.30% | | ristretto benches/Batch Ristretto double-and-encode/8 | 9.24 | 9.21 | -0.27% | | edwards benches/EdwardsPoint compression | 3.96 | 3.95 | -0.26% | | edwards benches/Variable-time aA+bB A variable B fixed | 33.44 | 33.53 | 0.26% | | edwards benches/Constant-time variable-base scalar mul | 29.90 | 29.97 | 0.25% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/16 | 111.51 | 111.78 | 0.24% | | multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn) | 48.83 | 48.93 | 0.21% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/768 | 2525.1 | 2529.79 | 0.19% | | scalar benches/Batch scalar inversion/1 | 13.33 | 13.31 | -0.16% | | multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn) | 4187.40 | 4180.5 | -0.16% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/2 | 35.46 | 35.52 | 0.16% | | multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn) | 202.18 | 201.88 | -0.15% | | multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn) | 5607.59 | 5599.5 | -0.14% | | scalar benches/Batch scalar inversion/4 | 13.97 | 13.95 | -0.14% | | ristretto benches/RistrettoPoint compression | 4.69 | 4.69 | 0.14% | | multiscalar benches/Variable-time variable-base multiscalar multiplication/512 | 1852.10 | 1854.60 | 0.13% | | multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn) | 69.53 | 69.44 | -0.13% | | multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn) | 2097.70 | 2095.1 | -0.12% | | scalar benches/Batch scalar inversion/8 | 14.74 | 14.72 | -0.12% | | ristretto benches/Batch Ristretto double-and-encode/16 | 14.27 | 14.25 | -0.11% | | ristretto benches/RistrettoPoint decompression | 4.71 | 4.71 | 0.11% | | multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn) | 1408.4 | 1406.9 | -0.11% | | edwards benches/EdwardsPoint decompression | 4.39 | 4.39 | -0.10% | | multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn) | 719.89 | 719.19 | -0.10% | | montgomery benches/Montgomery pseudomultiplication | 54.28 | 54.34 | 0.10% | | multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn) | 374.8 | 374.48 | -0.09% | | scalar benches/Batch scalar inversion/16 | 16.33 | 16.32 | -0.08% | | ristretto benches/Batch Ristretto double-and-encode/4 | 6.60 | 6.60 | -0.07% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/4 | 46.39 | 46.42 | 0.07% | | edwards benches/Constant-time fixed-base scalar mul | 12.41 | 12.40 | -0.06% | | scalar benches/Batch scalar inversion/2 | 13.48 | 13.47 | -0.06% | | multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn) | 2786.1 | 2785.29 | -0.03% | | ristretto benches/Batch Ristretto double-and-encode/2 | 5.31 | 5.31 | -0.02% | | multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024 | 5397.5 | 5398.5 | 0.02% | | multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn) | 114.78 | 114.76 | -0.02% |
tarcieri commented 1 year ago

@koute any chance you could push a commit that would show the difference of what this PR would look like without unsafe_target_feature?

I ask because I think we'd really like to keep 3rd party dependencies to a minimum. There's currently only two enabled by default: cfg-if (@rust-lang) and zeroize (@RustCrypto, i.e. me), with subtle being a @dalek-cryptography crate.

koute commented 1 year ago

@koute any chance you could push a commit that would show the difference of what this PR would look like without unsafe_target_feature?

Done. Please take a look at the latest commit. I've only done it partially though, as doing it is really tedious and error-prone (it's easy to forget e.g. an #[inline(always)] annotation somewhere and completely crater the performance). Let me know if you want me to do this for the rest of the code too (which would need to be modified in a similar fashion).

tarcieri commented 1 year ago

@koute c67e430 looks good to me, thanks for attempting it!

@rozbb do you have an opinion on which way to proceed? I think it's worth removing the additional dependency.

pinkforest commented 1 year ago

Can I suggest to have the features as optional by way of them being negative ?

This way if using negative "disallow" features these can be explicitly ruled out and if one rules them out it applies wholesale

Also this would make them non-default so managing the featureset becomes easier when people use one-size-fits-all featureset

This would make sense also given most people should not need to be disallowing the detection.

fwiw - I'm beginning to think that disallow / forbid should be cfg flags as well giben it's niche e.g.

cfg(curve25519_dalek_forbid = "simd" | "simd_avx512" "simd_avx2")

Having niche configuration scattered between featuresets and cfg() would be confusing otherwise and having them via cfg would make it better to document and educate while leaving the top-level binary in control as intended.

Depending on which one is chosen I can send documentation PR after.

koute commented 1 year ago

So do we have a consensus regarding removing the unsafe_target_features dependency and doing everything by hand? (I'm happy to do it, but I don't really want to do all of this legwork and then end up having to revert it in the end.)

@pinkforest Usually cargo features should be only additive, but I guess in this case this doesn't really matter as those features don't actually change any functionality per-se, so sure, we could make those negative.

However, I'd really prefer to keep them as cargo features though; the issue with specifying anything through RUSTFLAGS is that it's really awkward to use, especially for bigger framework-like projects with hundreds/thousands of downstream users where telling everyone "hey, you now need to specify those RUSTFLAGS to compile your stuff because it depends on our stuff" is just not feasible. If we had to specify the flags we need through RUSTFLAGS we'd be most likely forced to fork the crate and change the configuration knob to be a cargo feature.

tarcieri commented 1 year ago

So do we have a consensus regarding removing the unsafe_target_features dependency and doing everything by hand?

@koute that would be my preference out of paranoia regarding adding third-party dependencies

@rozbb do you agree?

pinkforest commented 1 year ago

However, I'd really prefer to keep them as cargo features though [ .. ]

Yeah I hear you - we can have both via build.rs - similarly in build.rs for the cfg(curve25519_dalek_bits) acts as an override we can put feature-gate there that sets the cfg-flags in build.rs - which means both can be used. I've seen often feature-chains that get broken and then things like openssl might get stuck in the dependency tree 10 layers down :)

No need to change anything on this PR - I can send a PR straight after this to address this as negative feature that works both ways.

koute commented 1 year ago

No need to change anything on this PR - I can send a PR straight after this to address this as negative feature that works both ways.

I'm happy to change it if you want, but that works for me too. Thank you!

jrose-signal commented 1 year ago

Just to offer an alternate perspective: I think the usual justification for "no negative features" is that they're not composable: if client crate A enables "no_apples" and client crate B enables "no_bananas", a project might not be able to use A and B together because A depends on bananas and B on apples. But in this case it's "avx2_backend" and "avx512_backend" that aren't composable; you cannot have more than one backend enabled at once. If you want to phrase these positively, you could call them "pre_avx2_compatibility" and "pre_avx512_compatibility" or something; then at least --all-features does something meaningful, if probably not desirable.

(The cfg vs. feature debate is not unrelated; I agree with @koute that cfg flags are much less discoverable and more awkward to work with, and I agree with @tarcieri that some controls really are "one choice for the entire build" and cannot be represented composably.)

rozbb commented 1 year ago

Sorry all, this thread flew under the radar for me. Getting up to date now.

pinkforest commented 1 year ago

Yeah that's valid point re: --all-features most (?) would not need (I suspect?) these negative features in any case and that would veer towards using cfg() given it would be niche (?) - is there use-case where this would happen more often that people would really need to disable ?

Should the feature be around to disable detection and forcing backend. In any case I'll create a separate issue as we've visited this issue before and it needs to be re-visited properly.

koute commented 1 year ago

But in this case it's "avx2_backend" and "avx512_backend" that aren't composable

FYI, actually in this case they are composable. (: That's because they don't force a given backend, they just make it available for selection at runtime if the host on which the program's running supports it.

rozbb commented 1 year ago

@rozbb do you have an opinion on which way to proceed? I think it's worth removing the additional dependency.

Just reviewed everything. My thought is: removing unsafe_target_feature consistently would 1) add a lot of noise to the code and 2) make it easier to mess up in the future if we make changes or additions. As an intermediate solution, I think we could vendor the dep, or (my preference) just pin to a specific version and call it a day. Thoughts?

tarcieri commented 1 year ago

If we had an in-tree dependency like curve25519-dalek-derive, that'd be fine with me.

Note that it will make the custom derive stack at least a default dependency where it isn't right now (i.e. syn has long compile times, although it is commonly found in most projects)

pinkforest commented 1 year ago

Would be good reason to do the monorepo base here (now) for this + git combine (later) then

May I suggest landing this PR pinning to version first and then I can just send another PR to rename some files and add it in-tree as monorepo ?

rozbb commented 1 year ago

I like that!

koute commented 1 year ago

That sounds good to me! So I'll revert the last commit and I'll pin the version of the unsafe_target_features crate, and later we can just vendor it.

Or perhaps alternatively I could just transfer it to dalek-cryptography? Then it wouldn't be a third-party crate anymore. (:

My three cents regarding the syn dependency: I don't think it's that big of a deal, considering that most projects probably already have it in their dependency tree. I agree that we shouldn't add extra unnecessary dependencies, but all things considered I personally think the tradeoff here is worth it.

pinkforest commented 1 year ago

Yeah better just pin it for now - we were going to re-organise everything to monorepo anyways so this was a good reason to start it

koute commented 1 year ago

Sorry for the delay.

I have reverted the commit, pinned the version of the dependency, and fixed failing clippy. Should be ready to go!

koute commented 1 year ago

Hmm, if fiat is the only backend we have, perhaps we should reconsider how gating works.

Indeed. Not sure if this is a good idea, and this would be a big change and require more refactoring, but maybe we could parametrize all of the public types by the backend? For example, instead of Scalar we could have Scalar<B>, and B would be a type through which the user could pick whatever backend they want, like e.g. Scalar<FiatU64> or Scalar<U32> (with an appropriate default). Then we wouldn't need any --cfg knobs. (Although the compile times would probably suffer.)

but I worry this might persist the problem where curve25519-dalek is a transitive dependency with these features being enabled via intermediate dependencies

Hm, this is a fair point, but I think doing what @pinkforest suggested would probably fix this -- make the features be negative and non-default, so that would make it very unlikely that an intermediate crate using curve25519-dalek would enable them (most likely it'd either use the default features, or no default features and pick a subset).

Curious how this is all going to work with SIMD backends for non-x86 platforms

From what I can see it should work mostly fine, although it might still require some minor refactoring.

tarcieri commented 1 year ago

make the features be negative and non-default, so that would make it very unlikely that an intermediate crate using curve25519-dalek would enable them

FWIW we did this with @RustCrypto and I still consider it a mistake. We have/had force-soft feature(s) to disable all hardware optimizations, and intermediate dependencies would turn it on. All it takes is one misbehaving crate in your dependency tree to force a performance downgrade.

That was the major impetus for using cfg attributes: it removes the ability of intermediate dependencies to do this, and relegates all control to the toplevel binary.

daira commented 1 year ago

That was the major impetus for using cfg attributes: it removes the ability of intermediate dependencies to do this, and relegates all control to the toplevel binary.

Agreed. Using either positive or negative features for selection of crypto implementations is also a nightmare for auditing, because it greatly expands the number of crates in which a backdoor could potentially force use of an insecure implementation.

(Example: suppose that you don't trust RDRAND. Now try to audit that the getrandom crate is not relying on it, given that it might if the rdrand feature is enabled. Ugh.)