storage: encryption-at-rest performance regression in FIPS build

nicktrav commented 1 year ago

Describe the problem

We've observed a ~20% performance regression in the FIPS build of CockroachDB, when encryption is enabled.

TPCC (FIPS vs. regular binary):

tpccbench_aws_ear_fips

tpccbench_aws_ear

KV95 (FIPS vs. regular binary):

kv95_aws_ear_fips

kv95_aws_ear

Environment:

FIPS build of CockroachDB.

Additional context

Discussed internally, here.

Jira issue: CRDB-27950

Epic CRDB-16419

bdarnell commented 1 year ago

Encryption at rest uses the CTR cipher mode, which is a stream cipher that happens to be seekable (and we rely on this seekability so that we don't have to read a whole SST file at a time). The go stdlib implements CTR as a stream cipher but does not expose any way to seek within the stream (if you dig into the Go stdlib code, note the ctrAble interface - the real implementation is in crypto/aes and not the one in crypto/cipher). Therefore, we implemented CTR ourselves.

Our CTR implementation calls the AES encryption primitive once for every 16 bytes of data (line 194 of the previous link). In the standard build, this compiles down to roughly a single hardware-accelerated CPU instruction. In the FIPS build, each call incurs CGO overhead, making it orders of magnitude more expensive.

So what we need is an implementation of CTR that is seekable but also uses blocks much larger than 16 bytes (i.e. kilobytes. Align this with pebble's favored read sizes). It is also important that it call the standard library's crypto functions so they can get rerouted through the FIPS openssl module. If you google "golang seekable ctr" there are a few libraries out there, but the first two results are not suitable (https://github.com/uhthomas/seekctr also does one call per 16 bytes. https://github.com/starius/aesctrat has its own assembly and would not be fips compliant).

When I looked at this six months ago I concluded that patching the standard library to make the CTR object seekable would be the best/only solution. I can't remember why but the crypto interfaces didn't seem amenable to "just" refactoring our ctr_stream.go to use larger block sizes.

srosenberg commented 1 year ago

In the standard build, this compiles down to roughly a single hardware-accelerated CPU instruction.

Right. On x86, it's essentially PXOR followed by multiple AESENC, depending on the size of the key (see aes/asm_amd64.s).

In the FIPS build, each call incurs CGO overhead, making it orders of magnitude more expensive.

Right. The flamegraph below shows CGO's overhead is substantial.

When I looked at this six months ago I concluded that patching the standard library to make the CTR object seekable would be the best/only solution. I can't remember why but the crypto interfaces didn't seem amenable to "just" refactoring our ctr_stream.go to use larger block sizes.

I also reached the same conclusion after looking at the current stdlib implementation. Note, as of [1] (in 1.20) we can get a speed up of using a single SIMD's PXOR instead of the the 16 XORs we do in [2]. (See subtle/xor_amd64.s.)

[1] https://github.com/golang/go/issues/53021 [2] https://github.com/cockroachdb/cockroach/blob/ced134707ca0b934da4a4997fc1357cfe7c42d07/pkg/ccl/storageccl/engineccl/ctr_stream.go#L195-L197

srosenberg commented 1 year ago

After ~2 weeks of nightly roachtests in FIPS-mode (cf. [1], [2]), a performance regression is evident across the benchmarks which use encryption-at-rest (i.e., enc=true). The regression is more pronounced on TPC-C, ~20–25%.

To investigate further, we ran tpccbench/nodes=3/cpu=16/enc=true both in FIPS-mode and without. The setup section details the steps for reproducibility. The findings are summarized below.

Setup

Build fips-enabled crdb and workload binaries,

./dev build --cross=linuxfips workload cockroach-short
mv artifacts/workload artifacts/workload_fips
mv artifacts/cockroach-short artifacts/cockroach_fips

Provision 3-node cluster with prom/grafana,

roachprod create stan-tpcc-fips -n4 --gce-machine-type n1-standard-16 --fips
roachprod grafana-start stan-tpcc-fips

Sanity check that FIPS is enabled in the kernel,

roachprod run stan-tpcc-fips "cat /proc/sys/crypto/fips_enabled"|grep -v nil
stan-tpcc-fips: cat /proc/sys/crypto/fips_e....   
   1: 1
   2: 1
   3: 1
   4: 1

Run tpccbench/nodes=3/cpu=16/enc=true,

roachtest run --port 9090 'tpccbench/nodes=3/cpu=16/enc=true$' --workload artifacts/workload_fips --cockroach artifacts/cockroach_fips --cluster stan-tpcc-fips --debug-always

Get external ip of n1,

roachprod ip --external stan-tpcc-fips:1

Profiles were collected using pprof-loop,

scripts/pprof-loop.sh "http://34.23.121.42:26258/debug/pprof/profile?seconds=3"
scripts/pprof-loop.sh "http://34.23.121.42:26258/debug/pprof/heap?seconds=3"

[1] https://roachperf.crdb.dev/release-23.1-fips/?filter=tpcc&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16%2Fenc%3Dtrue&tab=gce [2] https://roachperf.crdb.dev/release-23.1/?filter=tpcc&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16%2Fenc%3Dtrue&tab=gce

srosenberg commented 1 year ago

Recall that tpccbench uses a line search to find an optimal number of warehouses (wrt to efficiency). The same time window, for the last iteration, is used to compare system metrics below.

tpmC

CPU/Load

FIPS-mode,

Without,

Log Commit Latency

FIPS-mode,

Without,

CPU Profiles

FIPS-mode,

Without,

srosenberg commented 1 year ago

In summary, the bottleneck is CGO as clearly seen in the above flamegraphs. Note, the extra mallocs are a one-time thing [1] (first call to Encrypt); it accounts for <= 0.1%. Otherwise, the memory profiles of the two runs are very similar.

From the CPU utilization graphs, we can observe that the FIPS run pushes mean utilization to ~97% (cf. ~95% without). This is also reflected as a higher latency in the (raft) log commits.

The performance regression on TPC-C is indeed >= ~20%. In contrast to kv95 (~3–5%), we see a larger hit owing to the fact that tpccbench is cpu-bound.

[1] https://github.com/golang-fips/go/blob/5f6364d8c0a02e9578dfc50987b5611b75cf9264/patches/001-initial-openssl-for-fips.patch#L2470

sumeerbhola commented 1 year ago

Our CTR implementation calls the AES encryption primitive once for every 16 bytes of data (line 194 of the previous link). In the standard build, this compiles down to roughly a single hardware-accelerated CPU instruction. In the FIPS build, each call incurs CGO overhead, making it orders of magnitude more expensive. So what we need is an implementation of CTR that is seekable but also uses blocks much larger than 16 bytes (i.e. kilobytes. Align this with pebble's favored read sizes).

I am wary of changing the CTR block size since it would mean supporting two different block sizes. I think there is a simpler solution: fileCipherStream.{Decrypt,Encrypt} is being called with a byte slice that is much bigger than 16 bytes (the sstable block size for the reads, ~32KB). We can write a C implementation for these, and so do the CGO crossing once.

bdarnell commented 1 year ago

Yes. "Block" was the wrong word on my part, what we're really talking about is batching the same computations into one cgo crossing. There are two tricky things about using cgo here:

In the non-fips build, there is no cgo involved, and we want to keep using the pure go implementation
In the fips build, as I understand it the openssl library is loaded by some kind of magic and I'm not sure it is accessible via normal cgo. It may take some care to get this working in a fips-compatible way.

bdarnell commented 8 months ago

With the v2 implementation of encryption-at-rest, FIPS mode is actually faster than non-FIPS: https://github.com/cockroachdb/cockroach/pull/115549#issuecomment-1843108426

cockroachdb / cockroach