cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

perf: quantify benefit of larger (32KB) goroutine initial stack size #17242

Closed petermattis closed 3 years ago

petermattis commented 7 years ago

A new goroutine is created with a 2KB stack. Unfortunately, cockroach often utilizes more stack space than that requiring a relatively expensive operation to create a larger stack and copy the existing stack contents to it. This issue has been filed upstream as https://github.com/golang/go/issues/18138. One of the Go maintainers suggested a possible solution: make the initial stack size larger. This isn't problematic to large numbers of goroutines because the stacks would be shrunk by GC as necessary, but would help with performance for short-lived goroutines such as those created by gRPC for request processing. Micro-benchmarks suggest a modest improvement above and beyond our growStack hack. See https://github.com/golang/go/issues/18138#issuecomment-308294968.

There is evidence that the microbenchmark numbers do translate to a modest improvement in real cluster throughput. We should quantify this improvement by running the scalability tests both with and without the larger stack size.

petermattis commented 7 years ago

kv --read-percent=95 workload for normal vs 32KB Go stack (using cockroach-824b975):

screen shot 2017-08-07 at 4 04 33 pm

Note that the cockroach binary for the blue line also contained a patch to get rid of our growStack hack which is no longer necessary. The short summary is that using a 32KB initial stack size improves our throughput by ~5%. The patch to the Go runtime is:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f41672d..ea68a9b 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -2846,6 +2846,10 @@ func newproc(siz int32, fn *funcval) {
    })
 }

+const stackInit = 32 << 10
+
+// const stackInit = _StackMin
+
 // Create a new g running fn with narg bytes of arguments starting
 // at argp and returning nret bytes of results.  callerpc is the
 // address of the go statement that created this. The new g is put
@@ -2872,7 +2876,7 @@ func newproc1(fn *funcval, argp *uint8, narg int32, nret int32, callerpc uintptr
    _p_ := _g_.m.p.ptr()
    newg := gfget(_p_)
    if newg == nil {
-       newg = malg(_StackMin)
+       newg = malg(stackInit)
        casgstatus(newg, _Gidle, _Gdead)
        newg.gcRescan = -1
        allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.
@@ -3041,10 +3045,10 @@ retry:
        if gp.stack.lo == 0 {
            // Stack was deallocated in gfput. Allocate a new one.
            systemstack(func() {
-               gp.stack, gp.stkbar = stackalloc(_FixedStack)
+               gp.stack, gp.stkbar = stackalloc(stackInit)
            })
            gp.stackguard0 = gp.stack.lo + _StackGuard
-           gp.stackAlloc = _FixedStack
+           gp.stackAlloc = stackInit
        } else {
            if raceenabled {
                racemalloc(unsafe.Pointer(gp.stack.lo), gp.stackAlloc)
diff --git a/src/runtime/stack.go b/src/runtime/stack.go
index 0f1a5c1..d055d6c 100644
--- a/src/runtime/stack.go
+++ b/src/runtime/stack.go
@@ -949,6 +949,12 @@ func round2(x int32) int32 {
    return 1 << s
 }

+var debugStack = gogetenv("GOSTACK") == "1"
+
+func DebugStack() {
+   debugStack = true
+}
+
 // Called from runtime·morestack when more stack is needed.
 // Allocate larger stack and relocate to new stack.
 // Stack growth is multiplicative, for constant amortized cost.
@@ -976,6 +982,14 @@ func newstack(ctxt unsafe.Pointer) {
    // morestack so it has the necessary write barrier.
    gp.sched.ctxt = ctxt

+   if debugStack {
+       oldsize := int(gp.stackAlloc)
+       newsize := oldsize * 2
+       print("runtime: newstack: ", oldsize, " -> ", newsize, "\n")
+       morebuf := thisg.m.morebuf
+       traceback(morebuf.pc, morebuf.sp, morebuf.lr, morebuf.g.ptr())
+   }
+
    if thisg.m.curg.throwsplit {
        // Update syscallsp, syscallpc in case traceback uses them.
        morebuf := thisg.m.morebuf

The changes to runtime/stack.go are unnecessary except they make it easy to verify when a binary contains the patch (our builder.sh tooling requires some manual cache invalidation to ensure changes to the runtime are picked up).

Note that we already distribute binaries built with a custom Go toolchain. Changing the initial stack size to 32KB would be an additional step where we'd distribute binaries with a custom Go runtime. Somewhat scary. The conservative approach would be to use the above numbers as evidence to push for a change upstream in go1.10.

Cc @cockroachdb/core

nvanbenschoten commented 5 years ago

I've seen this more and more often lately and runtime·morestack looks pretty expensive on CPU profiles (~10%). There's certainly still something to do here.

petermattis commented 5 years ago

@nvanbenschoten Mind poking that upstream issue. I believe the diff in https://github.com/cockroachdb/cockroach/issues/17242#issuecomment-320769444 still provides a benefit. You could apply that to go1.10 or go1.11 and double check that.