deepflowio / deepflow

eBPF Observability - Distributed Tracing and Profiling
https://deepflow.io
Apache License 2.0
2.84k stars 313 forks source link

[BUG] deepflow server 运行一段时间后出现异常Error服务重启 #5738

Closed monokoo closed 6 months ago

monokoo commented 6 months ago

Search before asking

DeepFlow Component

Server

What you expected to happen

deepflow server 运行一段时间后出现异常Error服务重启,pod 状态及异常日志输出如下: image image

2024-03-08 10:55:29.515 [INFO] [clickhouse.client] client.go:203 query_uuid: 7abe6f53-e3fd-4004-8b12-e6391fd130b6. query api statistics: 0 rows, 2 columns, 0 bytes, cost 5.000000 ms 2024-03-08 10:55:29.519 [INFO] [clickhouse.client] client.go:203 query_uuid: 7abe6f53-e3fd-4004-8b12-e6391fd130b6. query api statistics: 0 rows, 2 columns, 0 bytes, cost 4.000000 ms 2024-03-08 10:55:29.525 [INFO] [clickhouse.client] client.go:203 query_uuid: 7abe6f53-e3fd-4004-8b12-e6391fd130b6. query api statistics: 0 rows, 3 columns, 0 bytes, cost 5.000000 ms 2024-03-08 10:55:29.537 [INFO] [receiver] receiver.go:758 TCP client(100.122.193.1:51524) connect success, set read buffer 4194304. 2024-03-08 10:55:29.537 [WARN] [receiver] receiver.go:830 TCP client(100.122.193.1:51524) connection read error.read tcp 10.183.62.38:20033->100.122.193.1:51524: read: connection reset by peer 2024-03-08 10:55:31.017 [INFO] [receiver] receiver.go:758 TCP client(100.122.186.129:48752) connect success, set read buffer 4194304. 2024-03-08 10:55:31.017 [WARN] [receiver] receiver.go:830 TCP client(100.122.186.129:48752) connection read error.read tcp 10.183.62.38:20033->100.122.186.129:48752: read: connection reset by peer 2024-03-08 10:55:31.025 [INFO] [trisolaris/synchronize] tsdb.go:64 ctrl_ip is 127.0.0.1, (platform data version 1709806472 -> 0), (acl version 3709805594 -> 0), (groups version 1709805594 -> 0), NAME:server-pod-names-watcher 2024-03-08 10:55:31.568 [INFO] [receiver] receiver.go:758 TCP client(100.122.193.1:60118) connect success, set read buffer 4194304. 2024-03-08 10:55:31.569 [WARN] [receiver] receiver.go:830 TCP client(100.122.193.1:60118) connection read error.read tcp 10.183.62.38:20033->100.122.193.1:60118: read: connection reset by peer 2024-03-08 10:55:33.062 [INFO] [receiver] receiver.go:758 TCP client(100.122.186.129:37232) connect success, set read buffer 4194304. 2024-03-08 10:55:33.062 [WARN] [receiver] receiver.go:830 TCP client(100.122.186.129:37232) connection read error.read tcp 10.183.62.38:20033->100.122.186.129:37232: read: connection reset by peer 2024-03-08 10:55:33.573 [INFO] [receiver] receiver.go:758 TCP client(100.122.193.1:30074) connect success, set read buffer 4194304. 2024-03-08 10:55:33.574 [WARN] [receiver] receiver.go:830 TCP client(100.122.193.1:30074) connection read error.read tcp 10.183.62.38:20033->100.122.193.1:30074: read: connection reset by peer 2024-03-08 10:55:33.781 [INFO] [trisolaris/vtap] process_info.go:1067 start generate gpid data from timed 2024-03-08 10:55:33.793 [INFO] [trisolaris/vtap] process_info.go:1069 end generate gpid data from timed 2024-03-08 10:55:34.995 [INFO] [receiver] receiver.go:758 TCP client(100.122.186.129:39878) connect success, set read buffer 4194304. 2024-03-08 10:55:34.995 [WARN] [receiver] receiver.go:830 TCP client(100.122.186.129:39878) connection read error.read tcp 10.183.62.38:20033->100.122.186.129:39878: read: connection reset by peer 2024-03-08 10:55:35.665 [INFO] [receiver] receiver.go:758 TCP client(100.122.193.1:35380) connect success, set read buffer 4194304. 2024-03-08 10:55:35.666 [WARN] [receiver] receiver.go:830 TCP client(100.122.193.1:35380) connection read error.read tcp 10.183.62.38:20033->100.122.193.1:35380: read: connection reset by peer 2024-03-08 10:55:36.655 [INFO] [genesis] grpc_server.go:174 genesis sync received version 1709793031 -> 1709793032 from ip 10.183.34.210 vtap_id 92 2024-03-08 10:55:36.914 [INFO] [receiver] receiver.go:758 TCP client(100.122.186.129:26390) connect success, set read buffer 4194304. 2024-03-08 10:55:36.914 [WARN] [receiver] receiver.go:830 TCP client(100.122.186.129:26390) connection read error.read tcp 10.183.62.38:20033->100.122.186.129:26390: read: connection reset by peer 2024-03-08 10:55:37.157 [INFO] [trisolaris/synchronize] tsdb.go:64 ctrl_ip is 10.183.35.79, (platform data version 1709806472 -> 1709806472), (acl version 3709805594 -> 0), (groups version 1709805594 -> 1709805594), NAME:ingester 2024-03-08 10:55:37.157 [INFO] [trisolaris/synchronize] tsdb.go:75 ctrl_ip:10.183.35.79, cpu_num:24, memory_size:99454423040, arch:x86_64, os:alpine 3.15.0, kernel_version:4.19.91, pcap_data_mount_path: fatal error: concurrent map read and map write fatal error: concurrent map read and map write

goroutine 5062 [running]: github.com/deepflowio/deepflow/server/libs/grpc.(PlatformInfoTable).QueryPodIdInfo(...) /root/niuqilei/deepflow-saic/server/libs/grpc/grpc_platformdata.go:318 github.com/deepflowio/deepflow/server/ingester/flow_log/log_data.(KnowledgeGraph).fill(0xc3c804ce10, 0xc0019ec180, 0x1?, 0x0?, 0x0?, 0x2, 0x2, 0x0?, 0xd598bb98?, {0xc4ff9da550, ...}, ...) /root/niuqilei/deepflow-saic/server/ingester/flow_log/log_data/l4_flow_log.go:755 +0x386 github.com/deepflowio/deepflow/server/ingester/flow_log/log_data.(KnowledgeGraph).FillL7(0x4902d8?, 0xc0d598bc30?, 0x4902d8?, 0xc5?) /root/niuqilei/deepflow-saic/server/ingester/flow_log/log_data/l7_flow_log.go:589 +0x1a5 github.com/deepflowio/deepflow/server/ingester/flow_log/log_data.(L7Base).Fill(0x4a518a0?, 0xc4ece11de8?, 0x541e740?) /root/niuqilei/deepflow-saic/server/ingester/flow_log/log_data/l7_flow_log.go:585 +0x4a6 github.com/deepflowio/deepflow/server/ingester/flow_log/log_data.(L7FlowLog).Fill(0xc3c804ce00, 0xc447200fc0, 0x4902d8?, 0xc52a0072f0?) /root/niuqilei/deepflow-saic/server/ingester/flow_log/log_data/l7_flow_log.go:371 +0x2f github.com/deepflowio/deepflow/server/ingester/flow_log/log_data.ProtoLogToL7FlowLog(0xc447200fc0, 0xc0019ec180, 0x178?) /root/niuqilei/deepflow-saic/server/ingester/flow_log/log_data/l7_flow_log.go:630 +0xca github.com/deepflowio/deepflow/server/ingester/flow_log/decoder.(Decoder).sendProto(0xc003a2aea0, 0xc447200fc0) /root/niuqilei/deepflow-saic/server/ingester/flow_log/decoder/decoder.go:310 +0x125 github.com/deepflowio/deepflow/server/ingester/flow_log/decoder.(Decoder).handleProtoLog(0xc003a2aea0, 0xc0d598be50) /root/niuqilei/deepflow-saic/server/ingester/flow_log/decoder/decoder.go:205 +0x2b github.com/deepflowio/deepflow/server/ingester/flow_log/decoder.(Decoder).Run(0xc003a2aea0) /root/niuqilei/deepflow-saic/server/ingester/flow_log/decoder/decoder.go:157 +0x627 created by github.com/deepflowio/deepflow/server/ingester/flow_log/flow_log.(*Logger).Start in goroutine 1 /root/niuqilei/deepflow-saic/server/ingester/flow_log/flow_log/flow_log.go:307 +0x85

goroutine 1 [chan receive, 1013 minutes]: main.main() /root/niuqilei/deepflow-saic/server/cmd/server/main.go:120 +0x825

goroutine 11 [select, 1016 minutes]: github.com/baidubce/bce-sdk-go/util/log.NewLogger.func1() /root/niuqilei/deepflow-saic/server/vendor/github.com/baidubce/bce-sdk-go/util/log/logger.go:375 +0xa8 created by github.com/baidubce/bce-sdk-go/util/log.NewLogger in goroutine 1 /root/niuqilei/deepflow-saic/server/vendor/github.com/baidubce/bce-sdk-go/util/log/logger.go:368 +0x125

goroutine 12 [select]: github.com/golang/glog.(*fileSink).flushDaemon(0x9fde7b8) /root/niuqilei/deepflow-saic/server/vendor/github.com/golang/glog/glog_file.go:351 +0xb9 created by github.com/golang/glog.init.1 in goroutine 1 /root/niuqilei/deepflow-saic/server/vendor/github.com/golang/glog/glog_file.go:166 +0x135

goroutine 161 [chan receive]: github.com/deepflowio/deepflow/server/querier/engine/clickhouse/trans_prometheus.GeneratePrometheusMap() /root/niuqilei/deepflow-saic/server/querier/engine/clickhouse/trans_prometheus/prometheus_cache.go:121 +0x77 created by github.com/deepflowio/deepflow/server/querier/querier.Start in goroutine 80 /root/niuqilei/deepflow-saic/server/querier/querier/querier.go:67 +0x3c6

goroutine 97 [chan receive]: github.com/deepflowio/deepflow/server/libs/stats.run() /root/niuqilei/deepflow-saic/server/libs/stats/stats.go:336 +0x53 created by github.com/deepflowio/deepflow/server/libs/stats.init.0 in goroutine 1 /root/niuqilei/deepflow-saic/server/libs/stats/stats.go:428 +0x131

goroutine 77 [IO wait, 1016 minutes]: internal/poll.runtime_pollWait(0x7f5303ebdeb8, 0x72) /usr/lib/golang/src/runtime/netpoll.go:343 +0x85 internal/poll.(pollDesc).wait(0xc001ab2600?, 0xc0009a6000?, 0x0) /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:84 +0x27 internal/poll.(pollDesc).waitRead(...) /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:89 internal/poll.(FD).ReadFromInet6(0xc001ab2600, {0xc0009a6000, 0x25ee, 0x25ee}, 0x419953?) /usr/lib/golang/src/internal/poll/fd_unix.go:274 +0x22b net.(netFD).readFromInet6(0xc001ab2600, {0xc0009a6000?, 0x7f5303ec2e70?, 0x0?}, 0xc00017dd08?) /usr/lib/golang/src/net/fd_posix.go:72 +0x25 net.(UDPConn).readFrom(0x30?, {0xc0009a6000?, 0xc0018f68a0?, 0x0?}, 0xc0018f68a0) /usr/lib/golang/src/net/udpsock_posix.go:59 +0x79 net.(UDPConn).readFromUDP(0xc000a28010, {0xc0009a6000?, 0xc0016e6a40?, 0xc00017de38?}, 0x588efb?) /usr/lib/golang/src/net/udpsock.go:149 +0x30 net.(*UDPConn).ReadFromUDP(...) /usr/lib/golang/src/net/udpsock.go:141 github.com/deepflowio/deepflow/server/libs/debug.process(0xc001458ed0?) /root/niuqilei/deepflow-saic/server/libs/debug/udp.go:143 +0x8a github.com/deepflowio/deepflow/server/libs/debug.debugListener.func1() /root/niuqilei/deepflow-saic/server/libs/debug/udp.go:191 +0x385 created by github.com/deepflowio/deepflow/server/libs/debug.debugListener in goroutine 1 /root/niuqilei/deepflow-saic/server/libs/debug/udp.go:177 +0x1a

How to reproduce

No response

DeepFlow version

v6.4 分支最新版本

DeepFlow agent list

No response

Kubernetes CNI

No response

Operation-System/Kernel version

No response

Anything else

No response

Are you willing to submit a PR?

Code of Conduct

monokoo commented 6 months ago

修改后的版本运行了几个小时候,再次出现异常:@lzf575 image image

monokoo commented 6 months ago

是 podIDInfos这个map类型并发存在异常吧 image

lzf575 commented 6 months ago

@monokoo 解决了,使用最新的6.4即可

monokoo commented 6 months ago

@monokoo 解决了,使用最新的6.4即可

刚验证了最新版本,还是存在map异常的情况 image

monokoo commented 6 months ago

@lzf575 是否可以考虑使用sync.Map来规避多线程并发存在的线程安全问题

lzf575 commented 6 months ago

@monokoo 这回是改好了,不好意思改错好几回。

monokoo commented 6 months ago

@monokoo 这回是改好了,不好意思改错好几回。

你好,目前最新修复版本稳定运行24h,未再出现异常重启的情况。 image