felixge / fgprof

🚀 fgprof is a sampling Go profiler that allows you to analyze On-CPU as well as Off-CPU (e.g. I/O) time together.
MIT License
2.93k stars 99 forks source link

overhead negligible goroutine sampling #31

Open zdyj3170101136 opened 5 months ago

zdyj3170101136 commented 5 months ago

The main cost of goroutine performance analysis comes from stop the world. And each goroutine needs to traverse the stack through an expensive unwinder function.

We can optimize it by using fp traceback.

The main cost of goroutine performance analysis comes from stop the world. And each goroutine needs to traverse the stack through an expensive unwinder function.

We can optimize it by using fp traceback. Even if we traverse thousands of goroutines, the cost is negligible, so the only overhead is to stop the world and restart the world.

see https://github.com/golang/go/issues/66915

felixge commented 5 months ago

This will eventually be implemented upstream in Go. Once this is done, fgprof will automatically support it. So I think this issue can be closed?

zdyj3170101136 commented 5 months ago

这最终将在 Go 上游实现。一旦完成,fgprof 将自动支持它。所以我认为这个问题可以关闭了?

Stop the world and start the world are so expensive that goroutine sampling cannot be enabled continuously online.

If we can use ebpf to uprobe goroutine status change and then extract the trace id from the request (the offset is fixed),

we can analyze the time spent on offcpu (block or syscall) for each request.

felixge commented 5 months ago

Stop the world and start the world are so expensive that goroutine sampling cannot be enabled continuously online.

Are you sure? It should not be too expensive.

If we can use ebpf to uprobe goroutine status change and then extract the trace id from the request (the offset is fixed),

uprobe's are pretty expensive (2-3usec).

Do you have benchmarks?

zdyj3170101136 commented 5 months ago

This will eventually be implemented upstream in Go

This will eventually be implemented upstream in Go

this would only supported in golang 1.23+。

but if we support it as a go module which could be imported.

fgprof could be enabled in lower go version.

zdyj3170101136 commented 5 months ago

benchmark.

run benchmark on src/encoding/json:

go version go1.20.12 linux/amd64

bench code

record all goroutine trace back with 100hz.

type stack struct {
    lo uintptr
    hi uintptr
}

type gobuf struct {
    sp   uintptr
    pc   uintptr
    g    uintptr
    ctxt unsafe.Pointer
    ret  uintptr
    lr   uintptr
    bp   uintptr
}

type g struct {
    // Stack parameters.
    // stack describes the actual stack memory: [stack.lo, stack.hi).
    // stackguard0 is the stack pointer compared in the Go stack growth prologue.
    // It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
    // stackguard1 is the stack pointer compared in the //go:systemstack stack growth prologue.
    // It is stack.lo+StackGuard on g0 and gsignal stacks.
    // It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
    stack       stack   // offset known to runtime/cgo
    stackguard0 uintptr // offset known to liblink
    stackguard1 uintptr // offset known to liblink

    _panic    unsafe.Pointer // innermost panic - offset known to liblink
    _defer    unsafe.Pointer // innermost defer
    m         unsafe.Pointer // current m; offset known to arm liblink
    sched     gobuf
    syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
    syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
    stktopsp  uintptr // expected sp at top of stack, to check in traceback
    // param is a generic pointer parameter field used to pass
    // values in particular contexts where other storage for the
    // parameter would be difficult to find. It is currently used
    // in four ways:
    // 1. When a channel operation wakes up a blocked goroutine, it sets param to
    //    point to the sudog of the completed blocking operation.
    // 2. By gcAssistAlloc1 to signal back to its caller that the goroutine completed
    //    the GC cycle. It is unsafe to do so in any other way, because the goroutine's
    //    stack may have moved in the meantime.
    // 3. By debugCallWrap to pass parameters to a new goroutine because allocating a
    //    closure in the runtime is forbidden.
    // 4. When a panic is recovered and control returns to the respective frame,
    //    param may point to a savedOpenDeferState.
    param        unsafe.Pointer
    atomicstatus atomic.Uint32
}

//go:linkname forEachGRace runtime.forEachGRace
func forEachGRace(fn func(gp *g)) bool

type stwReason uint

type worldStop struct {
    reason stwReason
    start  int64
}

//go:linkname stopTheWorld runtime.stopTheWorld
func stopTheWorld(reason stwReason) worldStop

//go:linkname startTheWorld runtime.startTheWorld
func startTheWorld(w worldStop)

func init() {
    go func() {
        t := time.NewTicker(10 * time.Millisecond)
        for range t.C {
            stw := stopTheWorld(4)
            forEachGRace(func(gp *g) {
                if gp.atomicstatus.Load() == 6 {
                    // gdead.
                    return
                }
                myfpTracebackPCs(unsafe.Pointer(gp.sched.bp), make([]uintptr, 32)[:])
            })
            startTheWorld(stw)
        }
    }()
}

func myfpTracebackPCs(fp unsafe.Pointer, pcBuf []uintptr) (i int) {
    for i = 0; i < len(pcBuf) && fp != nil; i++ {
        // return addr sits one word above the frame pointer
        pcBuf[i] = *(*uintptr)(unsafe.Pointer(uintptr(fp) + PtrSize))
        prevfp := fp
        // follow the frame pointer to the next one
        fp = unsafe.Pointer(*(*uintptr)(fp))
        if uintptr(fp) <= uintptr(prevfp) {
            break
        }
    }
    return i
}

const PtrSize = 4 << (^uintptr(0) >> 63)

bench result

benchstat:

goos: linux
goarch: amd64
pkg: encoding/json
cpu: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
                                    │     old      │                 new                 │
                                    │    sec/op    │    sec/op     vs base               │
CodeEncoder-32                        508.1µ ±  3%   508.0µ ±  2%       ~ (p=0.853 n=10)
CodeEncoderError-32                   547.4µ ±  3%   550.1µ ±  3%       ~ (p=0.853 n=10)
CodeMarshal-32                        587.8µ ±  5%   590.4µ ±  4%       ~ (p=0.853 n=10)
CodeMarshalError-32                   686.8µ ±  3%   689.1µ ±  2%       ~ (p=0.739 n=10)
MarshalBytes/32-32                    249.7n ±  3%   245.1n ±  1%  -1.80% (p=0.011 n=10)
MarshalBytes/256-32                   761.4n ±  2%   754.6n ±  2%       ~ (p=0.247 n=10)
MarshalBytes/4096-32                  7.335µ ±  2%   7.331µ ±  1%       ~ (p=0.869 n=10)
MarshalBytesError/32-32               526.4µ ±  4%   533.2µ ±  3%       ~ (p=0.436 n=10)
MarshalBytesError/256-32              523.1µ ±  8%   521.2µ ±  3%       ~ (p=0.529 n=10)
MarshalBytesError/4096-32             538.1µ ±  3%   536.2µ ±  2%       ~ (p=1.000 n=10)
CodeDecoder-32                        2.491m ±  2%   2.441m ±  3%       ~ (p=0.143 n=10)
UnicodeDecoder-32                     289.5n ±  2%   291.7n ±  2%       ~ (p=0.393 n=10)
DecoderStream-32                      213.8n ±  2%   216.0n ±  2%       ~ (p=0.143 n=10)
CodeUnmarshal-32                      3.051m ±  5%   3.092m ±  3%       ~ (p=0.315 n=10)
CodeUnmarshalReuse-32                 2.690m ±  3%   2.677m ±  3%       ~ (p=0.912 n=10)
UnmarshalString-32                    61.87n ±  1%   62.16n ±  1%       ~ (p=0.342 n=10)
UnmarshalFloat64-32                   57.03n ±  1%   56.88n ±  1%       ~ (p=0.363 n=10)
UnmarshalInt64-32                     53.05n ±  7%   51.38n ±  3%  -3.16% (p=0.043 n=10)
Issue10335-32                         71.35n ±  1%   71.36n ±  1%       ~ (p=0.781 n=10)
Issue34127-32                         31.81n ±  2%   31.17n ±  2%       ~ (p=0.055 n=10)
Unmapped-32                           135.4n ±  2%   135.9n ±  3%       ~ (p=0.753 n=10)
TypeFieldsCache/MissTypes1-32         23.97µ ±  2%   23.81µ ±  2%       ~ (p=0.218 n=10)
TypeFieldsCache/MissTypes10-32        51.72µ ±  1%   51.61µ ±  2%       ~ (p=0.971 n=10)
TypeFieldsCache/MissTypes100-32       262.2µ ±  3%   261.4µ ±  2%       ~ (p=0.912 n=10)
TypeFieldsCache/MissTypes1000-32      1.975m ±  5%   1.968m ±  4%       ~ (p=0.971 n=10)
TypeFieldsCache/MissTypes10000-32     17.94m ±  2%   18.02m ±  1%       ~ (p=0.143 n=10)
TypeFieldsCache/MissTypes100000-32    176.6m ±  3%   173.7m ±  2%       ~ (p=0.280 n=10)
TypeFieldsCache/MissTypes1000000-32    1.935 ± 10%    1.992 ±  9%       ~ (p=0.481 n=10)
TypeFieldsCache/HitTypes1-32          1.704n ±  8%   1.650n ± 14%       ~ (p=0.325 n=10)
TypeFieldsCache/HitTypes10-32         1.639n ±  7%   1.627n ±  2%       ~ (p=0.541 n=10)
TypeFieldsCache/HitTypes100-32        1.658n ±  5%   1.659n ±  9%       ~ (p=0.837 n=10)
TypeFieldsCache/HitTypes1000-32       1.666n ±  5%   1.655n ±  5%       ~ (p=0.592 n=10)
TypeFieldsCache/HitTypes10000-32      1.622n ±  4%   1.631n ±  3%       ~ (p=0.986 n=10)
TypeFieldsCache/HitTypes100000-32     1.649n ±  8%   1.667n ±  8%       ~ (p=0.403 n=10)
TypeFieldsCache/HitTypes1000000-32                   1.627n ±  2%
geomean                               7.194µ         5.647µ        -0.22%

from benchmark, seems the stop and start the world have small overhead for program.

But a running goroutine needs to be suspended before it can run again, which can be expensive in rpc services.

felixge commented 5 months ago

I think you are mistakenly assuming that the goroutine profile does a STW while taking the stack trace of all goroutines.

That has not been the case for a while now. See https://go-review.googlesource.com/c/go/+/387415

zdyj3170101136 commented 5 months ago

https://go-review.googlesource.com/c/go/+/387415

it always need stop the world, if we run goroutine profile per one minute, it is ok.

but run it with 100hz is unacceptable.

felixge commented 5 months ago

but run it with 100hz is unacceptable.

Can you provide an execution trace go test -trace go.trace -bench . that shows the unacceptable STW overhead?

I'm also not sure how to read your benchmark. Is the "old" version without goroutine profiling and the "new" version with goroutine profiling? If yes, then it seems like the version with goroutine profiling is actually faster (!).