Open nicktrav opened 1 year ago
Encryption at rest uses the CTR cipher mode, which is a stream cipher that happens to be seekable (and we rely on this seekability so that we don't have to read a whole SST file at a time). The go stdlib implements CTR as a stream cipher but does not expose any way to seek within the stream (if you dig into the Go stdlib code, note the ctrAble
interface - the real implementation is in crypto/aes
and not the one in crypto/cipher
). Therefore, we implemented CTR ourselves.
Our CTR implementation calls the AES encryption primitive once for every 16 bytes of data (line 194 of the previous link). In the standard build, this compiles down to roughly a single hardware-accelerated CPU instruction. In the FIPS build, each call incurs CGO overhead, making it orders of magnitude more expensive.
So what we need is an implementation of CTR that is seekable but also uses blocks much larger than 16 bytes (i.e. kilobytes. Align this with pebble's favored read sizes). It is also important that it call the standard library's crypto functions so they can get rerouted through the FIPS openssl module. If you google "golang seekable ctr" there are a few libraries out there, but the first two results are not suitable (https://github.com/uhthomas/seekctr also does one call per 16 bytes. https://github.com/starius/aesctrat has its own assembly and would not be fips compliant).
When I looked at this six months ago I concluded that patching the standard library to make the CTR object seekable would be the best/only solution. I can't remember why but the crypto interfaces didn't seem amenable to "just" refactoring our ctr_stream.go
to use larger block sizes.
In the standard build, this compiles down to roughly a single hardware-accelerated CPU instruction.
Right. On x86, it's essentially PXOR
followed by multiple AESENC
, depending on the size of the key (see aes/asm_amd64.s
).
In the FIPS build, each call incurs CGO overhead, making it orders of magnitude more expensive.
Right. The flamegraph below shows CGO's overhead is substantial.
When I looked at this six months ago I concluded that patching the standard library to make the CTR object seekable would be the best/only solution. I can't remember why but the crypto interfaces didn't seem amenable to "just" refactoring our ctr_stream.go to use larger block sizes.
I also reached the same conclusion after looking at the current stdlib implementation. Note, as of [1] (in 1.20) we can get a speed up of using a single SIMD's PXOR
instead of the the 16 XORs we do in [2]. (See subtle/xor_amd64.s
.)
[1] https://github.com/golang/go/issues/53021 [2] https://github.com/cockroachdb/cockroach/blob/ced134707ca0b934da4a4997fc1357cfe7c42d07/pkg/ccl/storageccl/engineccl/ctr_stream.go#L195-L197
After ~2 weeks of nightly roachtests in FIPS-mode (cf. [1], [2]), a performance regression is evident across the benchmarks which use encryption-at-rest (i.e., enc=true
). The regression is more pronounced on TPC-C, ~20–25%
.
To investigate further, we ran tpccbench/nodes=3/cpu=16/enc=true
both in FIPS-mode and without. The setup section details the steps for reproducibility. The findings are summarized below.
Build fips-enabled crdb and workload binaries,
./dev build --cross=linuxfips workload cockroach-short
mv artifacts/workload artifacts/workload_fips
mv artifacts/cockroach-short artifacts/cockroach_fips
Provision 3-node cluster with prom/grafana,
roachprod create stan-tpcc-fips -n4 --gce-machine-type n1-standard-16 --fips
roachprod grafana-start stan-tpcc-fips
Sanity check that FIPS is enabled in the kernel,
roachprod run stan-tpcc-fips "cat /proc/sys/crypto/fips_enabled"|grep -v nil
stan-tpcc-fips: cat /proc/sys/crypto/fips_e....
1: 1
2: 1
3: 1
4: 1
Run tpccbench/nodes=3/cpu=16/enc=true
,
roachtest run --port 9090 'tpccbench/nodes=3/cpu=16/enc=true$' --workload artifacts/workload_fips --cockroach artifacts/cockroach_fips --cluster stan-tpcc-fips --debug-always
Get external ip of n1
,
roachprod ip --external stan-tpcc-fips:1
Profiles were collected using pprof-loop,
scripts/pprof-loop.sh "http://34.23.121.42:26258/debug/pprof/profile?seconds=3"
scripts/pprof-loop.sh "http://34.23.121.42:26258/debug/pprof/heap?seconds=3"
[1] https://roachperf.crdb.dev/release-23.1-fips/?filter=tpcc&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16%2Fenc%3Dtrue&tab=gce [2] https://roachperf.crdb.dev/release-23.1/?filter=tpcc&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16%2Fenc%3Dtrue&tab=gce
Recall that tpccbench
uses a line search to find an optimal number of warehouses (wrt to efficiency). The same time window, for the last iteration, is used to compare system metrics below.
FIPS-mode,
Without,
FIPS-mode,
Without,
FIPS-mode,
Without,
In summary, the bottleneck is CGO as clearly seen in the above flamegraphs. Note, the extra malloc
s are a one-time thing [1] (first call to Encrypt
); it accounts for <= 0.1%. Otherwise, the memory profiles of the two runs are very similar.
From the CPU utilization graphs, we can observe that the FIPS run pushes mean utilization to ~97% (cf. ~95% without). This is also reflected as a higher latency in the (raft) log commits.
The performance regression on TPC-C is indeed >= ~20%. In contrast to kv95
(~3–5%), we see a larger hit owing to the fact that tpccbench
is cpu-bound.
Our CTR implementation calls the AES encryption primitive once for every 16 bytes of data (line 194 of the previous link). In the standard build, this compiles down to roughly a single hardware-accelerated CPU instruction. In the FIPS build, each call incurs CGO overhead, making it orders of magnitude more expensive. So what we need is an implementation of CTR that is seekable but also uses blocks much larger than 16 bytes (i.e. kilobytes. Align this with pebble's favored read sizes).
I am wary of changing the CTR block size since it would mean supporting two different block sizes. I think there is a simpler solution: fileCipherStream.{Decrypt,Encrypt}
is being called with a byte slice that is much bigger than 16 bytes (the sstable block size for the reads, ~32KB). We can write a C implementation for these, and so do the CGO crossing once.
Yes. "Block" was the wrong word on my part, what we're really talking about is batching the same computations into one cgo crossing. There are two tricky things about using cgo here:
With the v2 implementation of encryption-at-rest, FIPS mode is actually faster than non-FIPS: https://github.com/cockroachdb/cockroach/pull/115549#issuecomment-1843108426
Describe the problem
We've observed a ~20% performance regression in the FIPS build of CockroachDB, when encryption is enabled.
TPCC (FIPS vs. regular binary):
KV95 (FIPS vs. regular binary):
Environment:
FIPS build of CockroachDB.
Additional context
Discussed internally, here.
Jira issue: CRDB-27950
Epic CRDB-16419