KindlingProject / kindling

eBPF-based Cloud Native Monitoring Tool
http://kindling.harmonycloud.cn
Apache License 2.0
1.11k stars 183 forks source link

内存泄露,是我修改了inspector->set_snaplen(20000); 把这个参数写死20000了 这样导致的吗 #480

Closed xiaodaiit closed 1 year ago

xiaodaiit commented 1 year ago

fatal error: unexpected signal during runtime execution [signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x7fc1cc22c298]

runtime stack: runtime.throw({0x1990a16, 0x64e}) /usr/local/go/src/runtime/panic.go:1198 +0x71 runtime.sigpanic() /usr/local/go/src/runtime/signal_unix.go:719 +0x396

goroutine 115 [syscall]: runtime.cgocall(0x154cc70, 0xc000738f40) /usr/local/go/src/runtime/cgocall.go:156 +0x5c fp=0xc000738f18 sp=0xc000738ee0 pc=0x4058bc github.com/Kindling-project/kindling/collector/pkg/component/receiver/cgoreceiver._Cfunc_getKindlingEvent(0xc00060e038) _cgo_gotypes.go:142 +0x48 fp=0xc000738f40 sp=0xc000738f18 pc=0x1548f68 github.com/Kindling-project/kindling/collector/pkg/component/receiver/cgoreceiver.(CgoReceiver).startGetEvent.func1(0xc00011a010) /home/kindling/collector/pkg/component/receiver/cgoreceiver/cgoreceiver.go:78 +0x45 fp=0xc000738f78 sp=0xc000738f40 pc=0x1549765 github.com/Kindling-project/kindling/collector/pkg/component/receiver/cgoreceiver.(CgoReceiver).startGetEvent(0xc00012e320) /home/kindling/collector/pkg/component/receiver/cgoreceiver/cgoreceiver.go:78 +0x69 fp=0xc000738fc8 sp=0xc000738f78 pc=0x1549669 github.com/Kindling-project/kindling/collector/pkg/component/receiver/cgoreceiver.(CgoReceiver).Start·dwrap·2() /home/kindling/collector/pkg/component/receiver/cgoreceiver/cgoreceiver.go:65 +0x26 fp=0xc000738fe0 sp=0xc000738fc8 pc=0x1549566 runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000738fe8 sp=0xc000738fe0 pc=0x467641 created by github.com/Kindling-project/kindling/collector/pkg/component/receiver/cgoreceiver.(CgoReceiver).Start /home/kindling/collector/pkg/component/receiver/cgoreceiver/cgoreceiver.go:65 +0xf4

goroutine 1 [chan receive]: main.main() /home/kindling/collector/cmd/kindling-collector/main.go:66 +0x165

goroutine 56 [chan receive]: gopkg.in/natefinch/lumberjack%2ev2.(Logger).millRun(0xc00012c540) /home/gopath/pkg/mod/gopkg.in/natefinch/lumberjack.v2@v2.0.0/lumberjack.go:379 +0x45 created by gopkg.in/natefinch/lumberjack%2ev2.(Logger).mill.func1 /home/gopath/pkg/mod/gopkg.in/natefinch/lumberjack.v2@v2.0.0/lumberjack.go:390 +0x93

goroutine 66 [select]: github.com/robfig/cron.(Cron).run(0xc0000dc0a0) /home/gopath/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:191 +0x59a created by github.com/robfig/cron.(Cron).Start /home/gopath/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:144 +0x65

goroutine 30 [select]: github.com/robfig/cron.(Cron).run(0xc0005b4000) /home/gopath/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:191 +0x59a created by github.com/robfig/cron.(Cron).Start /home/gopath/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:144 +0x65

goroutine 59 [chan receive]: k8s.io/klog/v2.(*loggingT).flushDaemon(0xc0005b4000) /home/gopath/pkg/mod/k8s.io/klog/v2@v2.8.0/klog.go:1164 +0x6a created by k8s.io/klog/v2.init.0 /home/gopath/pkg/mod/k8s.io/klog/v2@v2.8.0/klog.go:418 +0xfb

goroutine 85 [select]: go.opentelemetry.io/otel/sdk/metric/controller/basic.(Controller).runTicker(0xc000428140, {0x1bad288, 0xc0000460a0}, 0xc000506060) /home/gopath/pkg/mod/go.opentelemetry.io/otel/sdk/metric@v0.25.0/controller/basic/controller.go:226 +0xd2 created by go.opentelemetry.io/otel/sdk/metric/controller/basic.(Controller).Start /home/gopath/pkg/mod/go.opentelemetry.io/otel/sdk/metric@v0.25.0/controller/basic/controller.go:191 +0x19d

goroutine 84 [select]: go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue(0xc00037a000) /home/gopath/pkg/mod/go.opentelemetry.io/otel/sdk@v1.2.0/trace/batch_span_processor.go:263 +0x138 go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func1() /home/gopath/pkg/mod/go.opentelemetry.io/otel/sdk@v1.2.0/trace/batch_span_processor.go:112 +0x65 created by go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor /home/gopath/pkg/mod/go.opentelemetry.io/otel/sdk@v1.2.0/trace/batch_span_processor.go:110 +0x247

goroutine 83 [chan receive]: gopkg.in/natefinch/lumberjack%2ev2.(Logger).millRun(0xc000854000) /home/gopath/pkg/mod/gopkg.in/natefinch/lumberjack.v2@v2.0.0/lumberjack.go:379 +0x45 created by gopkg.in/natefinch/lumberjack%2ev2.(Logger).mill.func1 /home/gopath/pkg/mod/gopkg.in/natefinch/lumberjack.v2@v2.0.0/lumberjack.go:390 +0x93

goroutine 86 [select]: github.com/Kindling-project/kindling/collector/pkg/component/consumer/processor/aggregateprocessor.(*AggregateProcessor).runTicker(0xc0005e0960) /home/kindling/collector/pkg/component/consumer/processor/aggregateprocessor/processor.go:95 +0x7d created by github.com/Kindling-project/kindling/collector/pkg/component/consumer/processor/aggregateprocessor.New /home/kindling/collector/pkg/component/consumer/processor/aggregateprocessor/processor.go:50 +0x235

goroutine 43 [select]: github.com/Kindling-project/kindling/collector/pkg/component/consumer/processor/aggregateprocessor.(*AggregateProcessor).runTicker(0xc00069e4b0) /home/kindling/collector/pkg/component/consumer/processor/aggregateprocessor/processor.go:95 +0x7d created by github.com/Kindling-project/kindling/collector/pkg/component/consumer/processor/aggregateprocessor.New /home/kindling/collector/pkg/component/consumer/processor/aggregateprocessor/processor.go:50 +0x235

goroutine 88 [select]: github.com/Kindling-project/kindling/collector/pkg/metadata/conntracker/internal.NewCircuitBreaker.func1() /home/kindling/collector/pkg/metadata/conntracker/internal/circuit_breaker.go:70 +0xbf created by github.com/Kindling-project/kindling/collector/pkg/metadata/conntracker/internal.NewCircuitBreaker /home/kindling/collector/pkg/metadata/conntracker/internal/circuit_breaker.go:67 +0xc7

goroutine 41 [IO wait]: internal/poll.runtime_pollWait(0x7fc1a2311788, 0x72) /usr/local/go/src/runtime/netpoll.go:303 +0x85 internal/poll.(pollDesc).wait(0xc0005b7140, 0x7fc1a0147338, 0x1) /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32 internal/poll.(pollDesc).waitRead(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:89 internal/poll.(FD).RawRead(0xc0005b7140, 0xc0008542a0) /usr/local/go/src/internal/poll/fd_unix.go:554 +0x145 os.(rawConn).Read(0xc00060e0f0, 0x44d572) /usr/local/go/src/os/rawconn.go:32 +0x56 github.com/Kindling-project/kindling/collector/pkg/metadata/conntracker/internal.(Socket).recvmsg(0xc000584c40, {0xc000620000, 0x8000, 0x8000}, {0xc0003d4060, 0x28, 0x28}, 0x0) /home/kindling/collector/pkg/metadata/conntracker/internal/socket.go:269 +0x16d github.com/Kindling-project/kindling/collector/pkg/metadata/conntracker/internal.(Socket).ReceiveInto(0xc000584c40, {0xc00062e000, 0x1000, 0x1000}) /home/kindling/collector/pkg/metadata/conntracker/internal/socket.go:123 +0x9f github.com/Kindling-project/kindling/collector/pkg/metadata/conntracker/internal.(Consumer).receive(0xc000858180, 0xc000144200) /home/kindling/collector/pkg/metadata/conntracker/internal/consumer.go:407 +0xcd github.com/Kindling-project/kindling/collector/pkg/metadata/conntracker/internal.(Consumer).Events.func1() /home/kindling/collector/pkg/metadata/conntracker/internal/consumer.go:158 +0x85 created by github.com/Kindling-project/kindling/collector/pkg/metadata/conntracker/internal.(*Consumer).Events /home/kindling/collector/pkg/metadata/conntracker/internal/consumer.go:150 +0xe8

goroutine 42 [select]: github.com/Kindling-project/kindling/collector/pkg/metadata/conntracker/internal.(realConntracker).run.func1() /home/kindling/collector/pkg/metadata/conntracker/internal/conntracker.go:262 +0xd4 created by github.com/Kindling-project/kindling/collector/pkg/metadata/conntracker/internal.(realConntracker).run /home/kindling/collector/pkg/metadata/conntracker/internal/conntracker.go:260 +0x8f

goroutine 98 [chan receive]: github.com/Kindling-project/kindling/collector/pkg/component/analyzer/network.(NetworkAnalyzer).consumerFdNoReusingTrace(0xc00034b6b0) /home/kindling/collector/pkg/component/analyzer/network/network_analyzer.go:186 +0x37 created by github.com/Kindling-project/kindling/collector/pkg/component/analyzer/network.(NetworkAnalyzer).Start /home/kindling/collector/pkg/component/analyzer/network/network_analyzer.go:90 +0x93

goroutine 99 [select]: github.com/Kindling-project/kindling/collector/pkg/component/analyzer/tcpconnectanalyzer.(TcpConnectAnalyzer).Start.func1() /home/kindling/collector/pkg/component/analyzer/tcpconnectanalyzer/analyzer.go:70 +0xd9 created by github.com/Kindling-project/kindling/collector/pkg/component/analyzer/tcpconnectanalyzer.(TcpConnectAnalyzer).Start /home/kindling/collector/pkg/component/analyzer/tcpconnectanalyzer/analyzer.go:67 +0x5b

goroutine 114 [select]: github.com/Kindling-project/kindling/collector/pkg/component/receiver/cgoreceiver.(CgoReceiver).consumeEvents(0xc00012e320) /home/kindling/collector/pkg/component/receiver/cgoreceiver/cgoreceiver.go:91 +0xa5 created by github.com/Kindling-project/kindling/collector/pkg/component/receiver/cgoreceiver.(CgoReceiver).Start /home/kindling/collector/pkg/component/receiver/cgoreceiver/cgoreceiver.go:64 +0xb2

goroutine 117 [syscall]: os/signal.signal_recv() /usr/local/go/src/runtime/sigqueue.go:169 +0x98 os/signal.loop() /usr/local/go/src/os/signal/signal_unix.go:24 +0x19 created by os/signal.Notify.func1.1 /usr/local/go/src/os/signal/signal.go:151 +0x2c

dxsup commented 1 year ago

这个是kindling-probe的部分发生了段错误,下面应该还有别的日志,会标注出来具体是哪一行代码出问题。

xiaodaiit commented 1 year ago

没有报错显示具体的哪一行。。。 再问一下 SingleNetRequestMetricGroup = "single_net_request_metric_group" // AggregatedNetRequestMetricGroup stands for the dataGroup after aggregation. AggregatedNetRequestMetricGroup = "aggregated_net_request_metric_group" 这个聚合指标单位时间内统计的数据和单个请求数据得到的结果是对应的关系吗? 就是说统计数据的单位时间总数是单个请求单位时间的总数吗

dxsup commented 1 year ago

是改了代码以后没有用仓库里的Dockerfile构建镜像?如果构建了镜像应该会自动在控制台里输出probe代码的堆栈。

聚合指标就是把这段时间的单次请求数据统计起来,比如15秒钟时间范围内的请求有100个,聚合指标就是100次。

xiaodaiit commented 1 year ago

对 没有构建镜像,直接启动测试的 所以没有显示具体行数,但是通过这个看有可能是存在增大采集长度,cgo和go直接使用了累死unsafe的指针,导致内存没有对齐,错位了吧

聚合这边数据有点对不上感觉 不知道是不是因为增加了聚合统计时间段区分导致的

dxsup commented 1 year ago

有可能是这个原因,如果有生成coredump文件,可以用gdb看一下具体报错位置,建议用我们提供的环境变量来更改SNAPLEN参数。

聚合的地方是也做了修改?最终的指标是否符合预期?请求是什么协议?

xiaodaiit commented 1 year ago

对 聚合的数据我们做了时间范围的统计,比如0.1到0.5 0.5到1 1到2秒这样的区间统计,协议是配置文件支持的, 最终的结果是一分钟统计聚合的数据比单个上报的数据累积 差几十倍

dxsup commented 1 year ago

具体是什么协议的什么通信模式?可以看一下默认提供的kindling_topology_request_total数量能否匹配上。能匹配的话就要看看新增代码是否有问题,不能匹配的话可以把场景抛出来,社区一起看一下是什么问题。

xiaodaiit commented 1 year ago

在做技术验证,怀疑是统计的时候 func (i Metric) DataType() MetricType { switch i.GetData().(type) { case Int: return IntMetricType case *Histogram: return HistogramMetricType default: return NoneMetricType } }我们就统计了intMetricType 其他两种没有统计吧

xiaodaiit commented 1 year ago

k8s容器采集的时候 single_net_request_metric_group 单个模式的数据是采集不到的吗? 还是利用的别的机制,我看非k8s是能正常采集到http协议之类的请求的

dxsup commented 1 year ago

我们就统计了intMetricType 其他两种没有统计吧

这里没关系,聚合统计本来就可以只统计一个类型。

k8s容器采集的时候 single_net_request_metric_group 单个模式的数据是采集不到的吗? 还是利用的别的机制,我看非k8s是能正常采集到http协议之类的请求的

不会的,这个数据的采集与容器还是非容器没有关系。可以开networkanalyzer的debug日志看看是否有获取到请求。

xiaodaiit commented 1 year ago
image

得到的数据模型统计来说, 总数应该比单个请求上报的多的多啊

xiaodaiit commented 1 year ago
if data.Name == "aggregated_net_request_metric_group" {
    metricMap := make(map[string]int64)
    for _, metric := range metrics {
        if metric.DataType() != 0 {
            e.telemetry.Logger.Sugar().Debugf("name : %s, type is not IntMetricType, is %+v", metric.Name, metric.DataType())
            continue
        }
        metricMap[metric.Name] = metric.GetInt().Value
    }
    requestCount := metricMap["request_count"]
    e.telemetry.Logger.Sugar().Warnf("=====>>>>>>60 array total : %+v, time: %s  <<===========", requestCount, time.Now().Format("2006-01-02 15:04:05"))
} else if data.Name == "single_net_request_metric_group" {
    e.telemetry.Logger.Sugar().Warnf("=====>>>>>>60 single total : %+v, time: %s  <<===========", 1, time.Now().Format("2006-01-02 15:04:05"))
}
xiaodaiit commented 1 year ago

我们就统计了intMetricType 其他两种没有统计吧

这里没有关系,聚集统计本就可以只统计一个类型。

k8s容器采集的时候single_net_request_metric_group 单个模型的数据是采集不到的吗? 还是利 用的不同的机制,我看非k8s是能够正常采集的http协议之类的

不会的,这个数据的采集与容器还是非容器没有关系。可以打开的networkanalyzer调试日志看是否有获得请求。

问题解决了,对数据做了二次聚合,导致最终数据结果是少了