Open dt opened 7 months ago
cc @cockroachdb/disaster-recovery
wherein the s3 sdk does both md5 and sha256 hashes of the 8mb chunks it buffered before uploading
@dt do you know how long (roughly) the non-preemptable intervals are? < 10ms?
how long (roughly) the non-preemptable intervals are? < 10ms?
typically around or under 10ms but some traces indicated sometimes as much as 40ms
@dt @sumeerbhola what is our current plan with this? Who is owning its resolution? We have at least one customer who is actively waiting on improvements here, now that we've identified what those improvements should be.
I don't think this should have the AC label, since this is about work that is not preemptible by the goroutine scheduler.
There is a secondary issue that preemptible work should request admission, via the elastic CPU tokens mechanism -- that is already tracked in https://github.com/cockroachdb/cockroach/issues/107770
what is our current plan with this
On the DR side, we've determined that we don't have the option to simply avoid hashing -- turning off md5'ing in s3 uploads breaks when bucket versioning locking is enabled -- and the minimum buffer size we can pass to s3 to hash and upload is 5MB (the default is 8 in 23.1, and will be 5MB in 23.2).
work that is not preemptible by the goroutine scheduler.
I think this issue is tracking making some classes of work that is not preemptable actually be preemptable, i.e. patching our fork of the stdlib so that the two mentioned hashing functions become preemptible (by the scheduler) work. I believe AC is already maintaining our patched fork of go (for the grunning patches?).
preemptible work should request admission, via the elastic CPU tokens mechanism
This is on DR's backlog, to integrate AC's existing libraries for requesting tokens/pacing into the the backup process as it constructs SSTs and writes them into the various cloud SDK uploaders, but that won't help us here: we can write as slowly as we want but the uploaders will buffer what we write into a chunk before hashing it and uploading it, and the minimum chunk size is 5MB, so that's the unit that'll get passed to hashing at once.
Shall we add a T-
label here, so this doesn't fall through the cracks?
cc @cockroachdb/disaster-recovery
Here are some results comparing BenchmarkLatencyWhileHashing
between builds on master
@ cc4fdffa853. The results look promising.
https://gist.github.com/nicktrav/e1544fb1dc2d1bc6d04a1dd9db52e670
The RHS updates our runtime patch to include the following:
diff --git a/src/crypto/md5/md5.go b/src/crypto/md5/md5.go
index ccee4ea3a9..e15bc6d6d6 100644
--- a/src/crypto/md5/md5.go
+++ b/src/crypto/md5/md5.go
@@ -27,6 +27,10 @@ const Size = 16
// The blocksize of MD5 in bytes.
const BlockSize = 64
+// The maximum number of bytes that can be passed to block.
+const maxAsmIters = 1024
+const maxAsmSize = BlockSize * maxAsmIters // 64KiB
+
const (
init0 = 0x67452301
init1 = 0xEFCDAB89
@@ -130,6 +134,11 @@ func (d *digest) Write(p []byte) (nn int, err error) {
if len(p) >= BlockSize {
n := len(p) &^ (BlockSize - 1)
if haveAsm {
+ for n > maxAsmSize {
+ block(d, p[:maxAsmSize])
+ p = p[maxAsmSize:]
+ n -= maxAsmSize
+ }
block(d, p[:n])
} else {
blockGeneric(d, p[:n])
diff --git a/src/crypto/md5/md5_test.go b/src/crypto/md5/md5_test.go
index 851e7fb10d..e120be3718 100644
--- a/src/crypto/md5/md5_test.go
+++ b/src/crypto/md5/md5_test.go
@@ -120,10 +120,11 @@ func TestGoldenMarshal(t *testing.T) {
func TestLarge(t *testing.T) {
const N = 10000
+ const offsets = 4
ok := "2bb571599a4180e1d542f76904adc3df" // md5sum of "0123456789" * 1000
- block := make([]byte, 10004)
+ block := make([]byte, N+offsets)
c := New()
- for offset := 0; offset < 4; offset++ {
+ for offset := 0; offset < offsets; offset++ {
for i := 0; i < N; i++ {
block[offset+i] = '0' + byte(i%10)
}
@@ -142,6 +143,31 @@ func TestLarge(t *testing.T) {
}
}
+func TestExtraLarge(t *testing.T) {
+ const N = 100000
+ const offsets = 4
+ ok := "13572e9e296cff52b79c52148313c3a5" // md5sum of "0123456789" * 10000
+ block := make([]byte, N+offsets)
+ c := New()
+ for offset := 0; offset < offsets; offset++ {
+ for i := 0; i < N; i++ {
+ block[offset+i] = '0' + byte(i%10)
...skipping...
- for offset := 0; offset < 4; offset++ {
+ for offset := 0; offset < offsets; offset++ {
for i := 0; i < N; i++ {
block[offset+i] = '0' + byte(i%10)
}
@@ -142,6 +143,31 @@ func TestLarge(t *testing.T) {
}
}
+func TestExtraLarge(t *testing.T) {
+ const N = 100000
+ const offsets = 4
+ ok := "13572e9e296cff52b79c52148313c3a5" // md5sum of "0123456789" * 10000
+ block := make([]byte, N+offsets)
+ c := New()
+ for offset := 0; offset < offsets; offset++ {
+ for i := 0; i < N; i++ {
+ block[offset+i] = '0' + byte(i%10)
+ }
+ for blockSize := 10; blockSize <= N; blockSize *= 10 {
+ blocks := N / blockSize
+ b := block[offset : offset+blockSize]
+ c.Reset()
+ for i := 0; i < blocks; i++ {
+ c.Write(b)
+ }
+ s := fmt.Sprintf("%x", c.Sum(nil))
+ if s != ok {
+ t.Fatalf("md5 TestExtraLarge offset=%d, blockSize=%d = %s want %s", offset, blockSize, s, ok)
+ }
+ }
+ }
+}
+
// Tests that blockGeneric (pure Go) and block (in assembly for amd64, 386, arm) match.
func TestBlockGeneric(t *testing.T) {
gen, asm := New().(*digest), New().(*digest)
diff --git a/src/crypto/sha256/sha256.go b/src/crypto/sha256/sha256.go
index 2deafbc9fc..567a3c81f9 100644
--- a/src/crypto/sha256/sha256.go
+++ b/src/crypto/sha256/sha256.go
@@ -28,6 +28,10 @@ const Size224 = 28
// The blocksize of SHA256 and SHA224 in bytes.
const BlockSize = 64
+// The maximum number of bytes that can be passed to block.
+const maxAsmIters = 1024
+const maxAsmSize = BlockSize * maxAsmIters // 64KiB
+
const (
chunk = 64
init0 = 0x6A09E667
@@ -191,6 +195,11 @@ func (d *digest) Write(p []byte) (nn int, err error) {
}
if len(p) >= chunk {
n := len(p) &^ (chunk - 1)
+ for n > maxAsmSize {
+ block(d, p[:maxAsmSize])
+ p = p[maxAsmSize:]
+ n -= maxAsmSize
+ }
block(d, p[:n])
p = p[n:]
}
cc @cockroachdb/disaster-recovery
Nick is OOO, so assigning to @sumeerbhola so it doesn't get lost.
It appears that due to https://github.com/golang/go/issues/64417, we sometimes see long gc pause times and traces show STW pauses overlapping with block hashing. This has been observed for example during backups to s3, wherein the s3 sdk does both md5 and sha256 hashes of the 8mb chunks it buffered before uploading, but could affect other users of the hashing libraries as well.
We may want to consider patching the hashing functions on our go fork until the issue is resolved upstream.
Jira issue: CRDB-33925