Intel HEXL Support - Githubissues

fboemer commented 3 years ago

Hi, thanks for the library. This is a neat project. It's the only python wrapper for SEAL I'm aware of that keeps up-to-date with the latest SEAL releases.

SEAL v3.6.3 adds support for Intel HEXL (https://github.com/intel/hexl), an AVX512 acceleration library. I'm wondering if you'd had a chance to try SEAL support for HEXL? See the Intel HE Toolkit whitepaper at (https://software.intel.com/content/www/us/en/develop/tools/homomorphic-encryption.html?wapkw=homomorphic%20encryption) for an idea of the performance improvement. I'm happy to take any feedback on HEXL as well (I'm one of the developers)

bcebere commented 3 years ago

Hi @fboemer

Thank you for the wonderful contributions to Intel HEXL. I saw from SEAL's issue tracker that it adds an impressive speedup.

The latest TenSEAL release includes SEAL 3.6.3, as you mentioned, and we also activated the SEAL_USE_INTEL_HEXL flag in this release.

Unfortunately, we don't have hardware that supports AVX512 to measure the real improvement. We can see a speedup in the tests/benchmarks both locally and on GitHub runners, but we want to test it in an isolated container and on proper hardware, to be sure. We will do the benchmarks these days and upload them here.

Thank you!

bcebere commented 3 years ago

Hello @fboemer

We did the benchmarks. Since the SEAL benchmarks are pretty clear, we mostly focused on ML operations.

We did the tests on a couple of AWS instances, the old version vs. the HEXL version and older hardware vs. hardware with AVX152 support.

TenSEAL 0.3.0 is the older version, with SEAL 3.6.2. TenSEAL 0.3.1 is the newer version, with SEAL 3.6.3 and HEXL on.

Compiler: Clang 10. OS: Ubuntu 20.04 Benchmarks were done with pytest-benchmark, with the median over 5 iterations reported here. The code for benchmarks is here.

Bottom line

The results are amazing for AVX512-compatible hardware. The MNIST full evaluation is faster, with 26%-34%, depending on the parallelism. There seems to be some impact on older CPUs. The MNIST full evaluation seems to be slower by 20% here.

If you have any feedback on the benchmarks, please let us know. One thing that I noticed and is not clear to me(sorry for the noob question): compiling the library on older hardware seems to affect the performance on newer hardware. I got the performance improvement only after I compiled the library on AVX512-compatible hardware. Is this expected?

Results

I will break the benchmarks per hardware.

c4.2xlarge

Specs: Intel Xeon E5-2666 v3, 8 CPUs, no AVX152 support

We notice that on unsupported hardware, there is some impact on performance.

Test case	TenSEAL 0.3.0 duration(ms)	TenSEAL 0.3.1 duration(ms)
CKKS convolution. Image shape 8x8	58.1	74.63
CKKS convolution. Image shape 16x16	58.15	74.55
CKKS convolution. Image shape 28x28	58.43	74.61
Generate keys	949.97	847.33
mnist_prepare_input	9.7	10.52
MNIST eval conv	236.32	291.69
MNIST eval square1	8.24	10.63
MNIST eval fc1	1094.3	1324.82
MNIST eval square2	4.18	5.52
MNIST eval fc2	116.83	138.31
MNIST eval full	1460.13	1770.5

Tensor length	Test case	TenSEAL 0.3.0 duration(ms)	TenSEAL 0.3.1 duration(ms)
8192	CKKS add	0.15	0.15
8192	CKKS multiply	8.63	11.34
8192	CKKS negate	0.13	0.14
8192	CKKS square	8.36	10.94
8192	CKKS sub	0.15	0.15
8192	CKKS dot	54.79	71.03
8192	CKKS polyval	20.47	24.23
16384	CKKS add	0.29	0.29
16384	CKKS multiply	17.24	22.69
16384	CKKS negate	0.25	0.28
16384	CKKS square	16.69	21.87
16384	CKKS sub	0.28	0.3
16384	CKKS dot	110.28	142.59
16384	CKKS polyval	41.04	48.87

c4.4xlarge

Specs: Intel Xeon E5-2666 v3, 16 CPUs, no AVX152 support.

We redid the test with more CPUs to confirm the impact.

Test case	TenSEAL 0.3.0 duration(ms)	TenSEAL 0.3.1 duration(ms)
Generate keys	918.42	848.57
mnist_prepare_input	9.72	10.56
MNIST eval conv	234.87	292.02
MNIST eval square1	8.28	10.66
MNIST eval fc1	575.51	693.78
MNIST eval square2	4.17	5.54
MNIST eval fc2	68.58	80.9
MNIST eval full	877.32	1060.43
CKKS convolution. Image shape 8x8	57.96	74.6
CKKS convolution. Image shape 16x16	58.07	74.58
CKKS convolution. Image shape 28x28	58.56	74.65

Tensor length	Test case	TenSEAL 0.3.0 duration(ms)	TenSEAL 0.3.1 duration(ms)
8192	CKKS add	0.16	0.15
8192	CKKS multiply	8.64	11.28
8192	CKKS negate	0.13	0.14
8192	CKKS square	8.36	10.91
8192	CKKS sub	0.15	0.15
8192	CKKS dot	54.93	70.45
8192	CKKS polyval	20.49	24.16
16384	CKKS add	0.31	0.28
16384	CKKS multiply	17.28	22.56
16384	CKKS negate	0.26	0.27
16384	CKKS square	16.72	21.8
16384	CKKS sub	0.29	0.29
16384	CKKS dot	110.11	141.34
16384	CKKS polyval	41.07	48.34

However, when we switch to hardware that supports AVX512, we can see a major improvement.

c5.2xlarge

Specs: Intel Xeon Platinum 8275CL, 8 CPUs, with AVX152 support

Test case	TenSEAL 0.3.0 duration(ms)	TenSEAL 0.3.1 duration(ms)
Generate keys	819.0	633.85
mnist_prepare_input	8.73	5.21
MNIST eval conv	195.28	130.53
MNIST eval square1	6.86	4.38
MNIST eval fc1	923.04	587.95
MNIST eval square2	3.46	2.25
MNIST eval fc2	99.84	63.25
MNIST eval full	1229.2	799.0
CKKS convolution. Image shape 8x8	48.77	34.04
CKKS convolution. Image shape 16x16	48.8	33.69
CKKS convolution. Image shape 28x28	49.2	33.21

Tensor length	Test case	TenSEAL 0.3.0 duration(ms)	TenSEAL 0.3.1 duration(ms)
8192	CKKS add	0.13	0.11
8192	CKKS multiply	7.2	4.6
8192	CKKS negate	0.1	0.11
8192	CKKS square	6.94	4.6
8192	CKKS sub	0.12	0.13
8192	CKKS dot	45.85	31.15
8192	CKKS polyval	17.8	11.53
16384	CKKS add	0.25	0.23
16384	CKKS multiply	14.5	9.26
16384	CKKS negate	0.2	0.24
16384	CKKS square	13.98	9.09
16384	CKKS sub	0.25	0.26
16384	CKKS dot	92.0	62.68
16384	CKKS polyval	36.0	23.49

c5.4xlarge

Specs: Intel Xeon Platinum 8275CL, 16 CPUs, with AVX152 support

Test case	TenSEAL 0.3.0 duration(ms)	TenSEAL 0.3.1 duration(ms)
Generate keys	781.07	625.9
mnist_prepare_input	8.44	5.77
MNIST eval conv	186.44	143.01
MNIST eval square1	6.47	4.79
MNIST eval fc1	451.27	337.13
MNIST eval square2	3.29	2.47
MNIST eval fc2	55.79	40.46
MNIST eval full	712.69	526.74
CKKS convolution. Image shape 8x8	46.24	36.43
CKKS convolution. Image shape 16x16	46.25	37.11
CKKS convolution. Image shape 28x28	46.91	37.05

Tensor length	Test case	TenSEAL 0.3.0 duration(ms)	TenSEAL 0.3.1 duration(ms)
8192	CKKS add	0.12	0.11
8192	CKKS multiply	6.84	5.18
8192	CKKS negate	0.1	0.11
8192	CKKS square	6.62	5.04
8192	CKKS sub	0.12	0.13
8192	CKKS dot	43.8	34.91
8192	CKKS polyval	16.81	12.8
16384	CKKS add	0.26	0.24
16384	CKKS multiply	13.78	10.41
16384	CKKS negate	0.2	0.23
16384	CKKS square	13.4	10.18
16384	CKKS sub	0.25	0.26
16384	CKKS dot	87.5	69.48
16384	CKKS polyval	33.95	25.74

fboemer commented 3 years ago

@bcebere , thanks for the detailed report! Our current HEXL implementation/integration has focused on improving performance on AVX512-enabled machines. In particular, the recent Intel processors with the AVX512-IFMA52 instruction set (IceLake server, IceLake client) should yield up to additional ~2x speedup (see performance numbers in Tables 1-4 of https://arxiv.org/pdf/2103.16400.pdf) than the CascadeLake servers you tried. We'll investigate the performance regression on non-AVX512 processors, thanks for pointing this out.

Regarding the library compilation: we currently compile for AVX512 only on machines supporting the AVX512 instruction set. We'd be happy to investigate enabling AVX512 compilation for non-AVX512 machines, if that would be helpful. I imagine this would help enable AVX512-enabled tenseal package distribution?

bcebere commented 3 years ago

Thank you so much for the explanations!

Regarding the compilation: we build, package and deploy the library to PyPI using Github runners, and we don't have much control over the hardware we're using. Furthermore, we cannot distinguish at pip install between supported architectures. Having a single binary for both scenarios(AVX152 and non-AVX152) compiled on any hardware would be fantastic!

OpenMined / TenSEAL

Intel HEXL Support #272

Bottom line

Results

c4.2xlarge

c4.4xlarge

c5.2xlarge

c5.4xlarge