OpenMined / TenSEAL

A library for doing homomorphic encryption operations on tensors
Apache License 2.0
804 stars 155 forks source link

Intel HEXL Support #272

Open fboemer opened 3 years ago

fboemer commented 3 years ago

Hi, thanks for the library. This is a neat project. It's the only python wrapper for SEAL I'm aware of that keeps up-to-date with the latest SEAL releases.

SEAL v3.6.3 adds support for Intel HEXL (https://github.com/intel/hexl), an AVX512 acceleration library. I'm wondering if you'd had a chance to try SEAL support for HEXL? See the Intel HE Toolkit whitepaper at (https://software.intel.com/content/www/us/en/develop/tools/homomorphic-encryption.html?wapkw=homomorphic%20encryption) for an idea of the performance improvement. I'm happy to take any feedback on HEXL as well (I'm one of the developers)

bcebere commented 3 years ago

Hi @fboemer

Thank you for the wonderful contributions to Intel HEXL. I saw from SEAL's issue tracker that it adds an impressive speedup.

The latest TenSEAL release includes SEAL 3.6.3, as you mentioned, and we also activated the SEAL_USE_INTEL_HEXL flag in this release.

Unfortunately, we don't have hardware that supports AVX512 to measure the real improvement. We can see a speedup in the tests/benchmarks both locally and on GitHub runners, but we want to test it in an isolated container and on proper hardware, to be sure. We will do the benchmarks these days and upload them here.

Thank you!

bcebere commented 3 years ago

Hello @fboemer

We did the benchmarks. Since the SEAL benchmarks are pretty clear, we mostly focused on ML operations.

We did the tests on a couple of AWS instances, the old version vs. the HEXL version and older hardware vs. hardware with AVX152 support.

TenSEAL 0.3.0 is the older version, with SEAL 3.6.2. TenSEAL 0.3.1 is the newer version, with SEAL 3.6.3 and HEXL on.

Compiler: Clang 10. OS: Ubuntu 20.04 Benchmarks were done with pytest-benchmark, with the median over 5 iterations reported here. The code for benchmarks is here.

Bottom line

The results are amazing for AVX512-compatible hardware. The MNIST full evaluation is faster, with 26%-34%, depending on the parallelism. There seems to be some impact on older CPUs. The MNIST full evaluation seems to be slower by 20% here.

If you have any feedback on the benchmarks, please let us know. One thing that I noticed and is not clear to me(sorry for the noob question): compiling the library on older hardware seems to affect the performance on newer hardware. I got the performance improvement only after I compiled the library on AVX512-compatible hardware. Is this expected?

Results

I will break the benchmarks per hardware.

c4.2xlarge

Specs: Intel Xeon E5-2666 v3, 8 CPUs, no AVX152 support

We notice that on unsupported hardware, there is some impact on performance.

Test case TenSEAL 0.3.0 duration(ms) TenSEAL 0.3.1 duration(ms)
CKKS convolution. Image shape 8x8 58.1 74.63
CKKS convolution. Image shape 16x16 58.15 74.55
CKKS convolution. Image shape 28x28 58.43 74.61
Generate keys 949.97 847.33
mnist_prepare_input 9.7 10.52
MNIST eval conv 236.32 291.69
MNIST eval square1 8.24 10.63
MNIST eval fc1 1094.3 1324.82
MNIST eval square2 4.18 5.52
MNIST eval fc2 116.83 138.31
MNIST eval full 1460.13 1770.5
Tensor length Test case TenSEAL 0.3.0 duration(ms) TenSEAL 0.3.1 duration(ms)
8192 CKKS add 0.15 0.15
8192 CKKS multiply 8.63 11.34
8192 CKKS negate 0.13 0.14
8192 CKKS square 8.36 10.94
8192 CKKS sub 0.15 0.15
8192 CKKS dot 54.79 71.03
8192 CKKS polyval 20.47 24.23
16384 CKKS add 0.29 0.29
16384 CKKS multiply 17.24 22.69
16384 CKKS negate 0.25 0.28
16384 CKKS square 16.69 21.87
16384 CKKS sub 0.28 0.3
16384 CKKS dot 110.28 142.59
16384 CKKS polyval 41.04 48.87

c4.4xlarge

Specs: Intel Xeon E5-2666 v3, 16 CPUs, no AVX152 support.

We redid the test with more CPUs to confirm the impact.

Test case TenSEAL 0.3.0 duration(ms) TenSEAL 0.3.1 duration(ms)
Generate keys 918.42 848.57
mnist_prepare_input 9.72 10.56
MNIST eval conv 234.87 292.02
MNIST eval square1 8.28 10.66
MNIST eval fc1 575.51 693.78
MNIST eval square2 4.17 5.54
MNIST eval fc2 68.58 80.9
MNIST eval full 877.32 1060.43
CKKS convolution. Image shape 8x8 57.96 74.6
CKKS convolution. Image shape 16x16 58.07 74.58
CKKS convolution. Image shape 28x28 58.56 74.65
Tensor length Test case TenSEAL 0.3.0 duration(ms) TenSEAL 0.3.1 duration(ms)
8192 CKKS add 0.16 0.15
8192 CKKS multiply 8.64 11.28
8192 CKKS negate 0.13 0.14
8192 CKKS square 8.36 10.91
8192 CKKS sub 0.15 0.15
8192 CKKS dot 54.93 70.45
8192 CKKS polyval 20.49 24.16
16384 CKKS add 0.31 0.28
16384 CKKS multiply 17.28 22.56
16384 CKKS negate 0.26 0.27
16384 CKKS square 16.72 21.8
16384 CKKS sub 0.29 0.29
16384 CKKS dot 110.11 141.34
16384 CKKS polyval 41.07 48.34

However, when we switch to hardware that supports AVX512, we can see a major improvement.

c5.2xlarge

Specs: Intel Xeon Platinum 8275CL, 8 CPUs, with AVX152 support

Test case TenSEAL 0.3.0 duration(ms) TenSEAL 0.3.1 duration(ms)
Generate keys 819.0 633.85
mnist_prepare_input 8.73 5.21
MNIST eval conv 195.28 130.53
MNIST eval square1 6.86 4.38
MNIST eval fc1 923.04 587.95
MNIST eval square2 3.46 2.25
MNIST eval fc2 99.84 63.25
MNIST eval full 1229.2 799.0
CKKS convolution. Image shape 8x8 48.77 34.04
CKKS convolution. Image shape 16x16 48.8 33.69
CKKS convolution. Image shape 28x28 49.2 33.21
Tensor length Test case TenSEAL 0.3.0 duration(ms) TenSEAL 0.3.1 duration(ms)
8192 CKKS add 0.13 0.11
8192 CKKS multiply 7.2 4.6
8192 CKKS negate 0.1 0.11
8192 CKKS square 6.94 4.6
8192 CKKS sub 0.12 0.13
8192 CKKS dot 45.85 31.15
8192 CKKS polyval 17.8 11.53
16384 CKKS add 0.25 0.23
16384 CKKS multiply 14.5 9.26
16384 CKKS negate 0.2 0.24
16384 CKKS square 13.98 9.09
16384 CKKS sub 0.25 0.26
16384 CKKS dot 92.0 62.68
16384 CKKS polyval 36.0 23.49

c5.4xlarge

Specs: Intel Xeon Platinum 8275CL, 16 CPUs, with AVX152 support

Test case TenSEAL 0.3.0 duration(ms) TenSEAL 0.3.1 duration(ms)
Generate keys 781.07 625.9
mnist_prepare_input 8.44 5.77
MNIST eval conv 186.44 143.01
MNIST eval square1 6.47 4.79
MNIST eval fc1 451.27 337.13
MNIST eval square2 3.29 2.47
MNIST eval fc2 55.79 40.46
MNIST eval full 712.69 526.74
CKKS convolution. Image shape 8x8 46.24 36.43
CKKS convolution. Image shape 16x16 46.25 37.11
CKKS convolution. Image shape 28x28 46.91 37.05
Tensor length Test case TenSEAL 0.3.0 duration(ms) TenSEAL 0.3.1 duration(ms)
8192 CKKS add 0.12 0.11
8192 CKKS multiply 6.84 5.18
8192 CKKS negate 0.1 0.11
8192 CKKS square 6.62 5.04
8192 CKKS sub 0.12 0.13
8192 CKKS dot 43.8 34.91
8192 CKKS polyval 16.81 12.8
16384 CKKS add 0.26 0.24
16384 CKKS multiply 13.78 10.41
16384 CKKS negate 0.2 0.23
16384 CKKS square 13.4 10.18
16384 CKKS sub 0.25 0.26
16384 CKKS dot 87.5 69.48
16384 CKKS polyval 33.95 25.74
fboemer commented 3 years ago

@bcebere , thanks for the detailed report! Our current HEXL implementation/integration has focused on improving performance on AVX512-enabled machines. In particular, the recent Intel processors with the AVX512-IFMA52 instruction set (IceLake server, IceLake client) should yield up to additional ~2x speedup (see performance numbers in Tables 1-4 of https://arxiv.org/pdf/2103.16400.pdf) than the CascadeLake servers you tried. We'll investigate the performance regression on non-AVX512 processors, thanks for pointing this out.

Regarding the library compilation: we currently compile for AVX512 only on machines supporting the AVX512 instruction set. We'd be happy to investigate enabling AVX512 compilation for non-AVX512 machines, if that would be helpful. I imagine this would help enable AVX512-enabled tenseal package distribution?

bcebere commented 3 years ago

Thank you so much for the explanations!

Regarding the compilation: we build, package and deploy the library to PyPI using Github runners, and we don't have much control over the hardware we're using. Furthermore, we cannot distinguish at pip install between supported architectures. Having a single binary for both scenarios(AVX152 and non-AVX152) compiled on any hardware would be fantastic!