Open zdyj3170101136 opened 5 months ago
This will eventually be implemented upstream in Go. Once this is done, fgprof will automatically support it. So I think this issue can be closed?
这最终将在 Go 上游实现。一旦完成,fgprof 将自动支持它。所以我认为这个问题可以关闭了?
Stop the world and start the world are so expensive that goroutine sampling cannot be enabled continuously online.
If we can use ebpf to uprobe goroutine status change and then extract the trace id from the request (the offset is fixed),
we can analyze the time spent on offcpu (block or syscall) for each request.
Stop the world and start the world are so expensive that goroutine sampling cannot be enabled continuously online.
Are you sure? It should not be too expensive.
If we can use ebpf to uprobe goroutine status change and then extract the trace id from the request (the offset is fixed),
uprobe's are pretty expensive (2-3usec).
Do you have benchmarks?
This will eventually be implemented upstream in Go
This will eventually be implemented upstream in Go
this would only supported in golang 1.23+。
but if we support it as a go module which could be imported.
fgprof could be enabled in lower go version.
benchmark.
run benchmark on src/encoding/json:
go version go1.20.12 linux/amd64
record all goroutine trace back with 100hz.
type stack struct {
lo uintptr
hi uintptr
}
type gobuf struct {
sp uintptr
pc uintptr
g uintptr
ctxt unsafe.Pointer
ret uintptr
lr uintptr
bp uintptr
}
type g struct {
// Stack parameters.
// stack describes the actual stack memory: [stack.lo, stack.hi).
// stackguard0 is the stack pointer compared in the Go stack growth prologue.
// It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
// stackguard1 is the stack pointer compared in the //go:systemstack stack growth prologue.
// It is stack.lo+StackGuard on g0 and gsignal stacks.
// It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
stack stack // offset known to runtime/cgo
stackguard0 uintptr // offset known to liblink
stackguard1 uintptr // offset known to liblink
_panic unsafe.Pointer // innermost panic - offset known to liblink
_defer unsafe.Pointer // innermost defer
m unsafe.Pointer // current m; offset known to arm liblink
sched gobuf
syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
stktopsp uintptr // expected sp at top of stack, to check in traceback
// param is a generic pointer parameter field used to pass
// values in particular contexts where other storage for the
// parameter would be difficult to find. It is currently used
// in four ways:
// 1. When a channel operation wakes up a blocked goroutine, it sets param to
// point to the sudog of the completed blocking operation.
// 2. By gcAssistAlloc1 to signal back to its caller that the goroutine completed
// the GC cycle. It is unsafe to do so in any other way, because the goroutine's
// stack may have moved in the meantime.
// 3. By debugCallWrap to pass parameters to a new goroutine because allocating a
// closure in the runtime is forbidden.
// 4. When a panic is recovered and control returns to the respective frame,
// param may point to a savedOpenDeferState.
param unsafe.Pointer
atomicstatus atomic.Uint32
}
//go:linkname forEachGRace runtime.forEachGRace
func forEachGRace(fn func(gp *g)) bool
type stwReason uint
type worldStop struct {
reason stwReason
start int64
}
//go:linkname stopTheWorld runtime.stopTheWorld
func stopTheWorld(reason stwReason) worldStop
//go:linkname startTheWorld runtime.startTheWorld
func startTheWorld(w worldStop)
func init() {
go func() {
t := time.NewTicker(10 * time.Millisecond)
for range t.C {
stw := stopTheWorld(4)
forEachGRace(func(gp *g) {
if gp.atomicstatus.Load() == 6 {
// gdead.
return
}
myfpTracebackPCs(unsafe.Pointer(gp.sched.bp), make([]uintptr, 32)[:])
})
startTheWorld(stw)
}
}()
}
func myfpTracebackPCs(fp unsafe.Pointer, pcBuf []uintptr) (i int) {
for i = 0; i < len(pcBuf) && fp != nil; i++ {
// return addr sits one word above the frame pointer
pcBuf[i] = *(*uintptr)(unsafe.Pointer(uintptr(fp) + PtrSize))
prevfp := fp
// follow the frame pointer to the next one
fp = unsafe.Pointer(*(*uintptr)(fp))
if uintptr(fp) <= uintptr(prevfp) {
break
}
}
return i
}
const PtrSize = 4 << (^uintptr(0) >> 63)
benchstat:
goos: linux
goarch: amd64
pkg: encoding/json
cpu: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
│ old │ new │
│ sec/op │ sec/op vs base │
CodeEncoder-32 508.1µ ± 3% 508.0µ ± 2% ~ (p=0.853 n=10)
CodeEncoderError-32 547.4µ ± 3% 550.1µ ± 3% ~ (p=0.853 n=10)
CodeMarshal-32 587.8µ ± 5% 590.4µ ± 4% ~ (p=0.853 n=10)
CodeMarshalError-32 686.8µ ± 3% 689.1µ ± 2% ~ (p=0.739 n=10)
MarshalBytes/32-32 249.7n ± 3% 245.1n ± 1% -1.80% (p=0.011 n=10)
MarshalBytes/256-32 761.4n ± 2% 754.6n ± 2% ~ (p=0.247 n=10)
MarshalBytes/4096-32 7.335µ ± 2% 7.331µ ± 1% ~ (p=0.869 n=10)
MarshalBytesError/32-32 526.4µ ± 4% 533.2µ ± 3% ~ (p=0.436 n=10)
MarshalBytesError/256-32 523.1µ ± 8% 521.2µ ± 3% ~ (p=0.529 n=10)
MarshalBytesError/4096-32 538.1µ ± 3% 536.2µ ± 2% ~ (p=1.000 n=10)
CodeDecoder-32 2.491m ± 2% 2.441m ± 3% ~ (p=0.143 n=10)
UnicodeDecoder-32 289.5n ± 2% 291.7n ± 2% ~ (p=0.393 n=10)
DecoderStream-32 213.8n ± 2% 216.0n ± 2% ~ (p=0.143 n=10)
CodeUnmarshal-32 3.051m ± 5% 3.092m ± 3% ~ (p=0.315 n=10)
CodeUnmarshalReuse-32 2.690m ± 3% 2.677m ± 3% ~ (p=0.912 n=10)
UnmarshalString-32 61.87n ± 1% 62.16n ± 1% ~ (p=0.342 n=10)
UnmarshalFloat64-32 57.03n ± 1% 56.88n ± 1% ~ (p=0.363 n=10)
UnmarshalInt64-32 53.05n ± 7% 51.38n ± 3% -3.16% (p=0.043 n=10)
Issue10335-32 71.35n ± 1% 71.36n ± 1% ~ (p=0.781 n=10)
Issue34127-32 31.81n ± 2% 31.17n ± 2% ~ (p=0.055 n=10)
Unmapped-32 135.4n ± 2% 135.9n ± 3% ~ (p=0.753 n=10)
TypeFieldsCache/MissTypes1-32 23.97µ ± 2% 23.81µ ± 2% ~ (p=0.218 n=10)
TypeFieldsCache/MissTypes10-32 51.72µ ± 1% 51.61µ ± 2% ~ (p=0.971 n=10)
TypeFieldsCache/MissTypes100-32 262.2µ ± 3% 261.4µ ± 2% ~ (p=0.912 n=10)
TypeFieldsCache/MissTypes1000-32 1.975m ± 5% 1.968m ± 4% ~ (p=0.971 n=10)
TypeFieldsCache/MissTypes10000-32 17.94m ± 2% 18.02m ± 1% ~ (p=0.143 n=10)
TypeFieldsCache/MissTypes100000-32 176.6m ± 3% 173.7m ± 2% ~ (p=0.280 n=10)
TypeFieldsCache/MissTypes1000000-32 1.935 ± 10% 1.992 ± 9% ~ (p=0.481 n=10)
TypeFieldsCache/HitTypes1-32 1.704n ± 8% 1.650n ± 14% ~ (p=0.325 n=10)
TypeFieldsCache/HitTypes10-32 1.639n ± 7% 1.627n ± 2% ~ (p=0.541 n=10)
TypeFieldsCache/HitTypes100-32 1.658n ± 5% 1.659n ± 9% ~ (p=0.837 n=10)
TypeFieldsCache/HitTypes1000-32 1.666n ± 5% 1.655n ± 5% ~ (p=0.592 n=10)
TypeFieldsCache/HitTypes10000-32 1.622n ± 4% 1.631n ± 3% ~ (p=0.986 n=10)
TypeFieldsCache/HitTypes100000-32 1.649n ± 8% 1.667n ± 8% ~ (p=0.403 n=10)
TypeFieldsCache/HitTypes1000000-32 1.627n ± 2%
geomean 7.194µ 5.647µ -0.22%
from benchmark, seems the stop and start the world have small overhead for program.
But a running goroutine needs to be suspended before it can run again, which can be expensive in rpc services.
I think you are mistakenly assuming that the goroutine profile does a STW while taking the stack trace of all goroutines.
That has not been the case for a while now. See https://go-review.googlesource.com/c/go/+/387415
it always need stop the world, if we run goroutine profile per one minute, it is ok.
but run it with 100hz is unacceptable.
but run it with 100hz is unacceptable.
Can you provide an execution trace go test -trace go.trace -bench .
that shows the unacceptable STW overhead?
I'm also not sure how to read your benchmark. Is the "old" version without goroutine profiling and the "new" version with goroutine profiling? If yes, then it seems like the version with goroutine profiling is actually faster (!).
The main cost of goroutine performance analysis comes from stop the world. And each goroutine needs to traverse the stack through an expensive unwinder function.
We can optimize it by using fp traceback.
The main cost of goroutine performance analysis comes from stop the world. And each goroutine needs to traverse the stack through an expensive unwinder function.
We can optimize it by using fp traceback. Even if we traverse thousands of goroutines, the cost is negligible, so the only overhead is to stop the world and restart the world.
see https://github.com/golang/go/issues/66915