deepflowio / deepflow

eBPF Observability - Distributed Tracing and Profiling
https://deepflow.io
Apache License 2.0
2.84k stars 313 forks source link

[BUG] agent mem pprof看不到符号表 #6068

Open qyzhaoxun opened 5 months ago

qyzhaoxun commented 5 months ago

Search before asking

DeepFlow Component

Agent

What you expected to happen

希望能通过heap pprof文件知道具体内存占用情况

How to reproduce

  1. 基于branch v6.4 cherry-pick https://github.com/deepflowio/deepflow/pull/5280/commits/2036076c81f61e6087bc79b7eb1ffcffc2dd0180 最终修改见 https://github.com/qyzhaoxun/deepflow/tree/v6.4
  2. 使用容器编译
    docker run --privileged --rm -it -v     $(pwd):/deepflow hub.deepflow.yunshan.net/public/rust-build bash -c     "cd /deepflow/agent && cargo build"
  3. 构建容器镜像
    FROM registry.cn-hongkong.aliyuncs.com/deepflow-ce/deepflow-agent:v6.4
    RUN rm /usr/bin/deepflow-agent
    ADD ./deepflow-agent.tgz /usr/bin/
  4. 获取heap pprof文件并生成svg profile

DeepFlow version

No response

DeepFlow agent list

v6.4 agent使用standalone模式启动

Kubernetes CNI

不涉及

Operation-System/Kernel version

"Ubuntu 22.04 LTS" Linux VM-11-12-ubuntu 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Anything else

No response

Are you willing to submit a PR?

Code of Conduct

qyzhaoxun commented 5 months ago

另外这里提个需求,希望可以把heap pprof做成配置项合并到主干和v6.4,默认不开启,但是有需求的话可以通过配置开启

yuanchaoa commented 5 months ago

@qyzhaoxun 有什么报错么 ? 生成svg的命令也发一下吧

qyzhaoxun commented 5 months ago

jeprof --svg ./deepflow-agent ./agent.profile >profile.svg

没有报错

yuanchaoa commented 5 months ago

1 screenshot-20240411-105000 image

yuanchaoa commented 5 months ago

image

yuanchaoa commented 5 months ago

cargo.toml里开启的debug=true, 编译的agent都是带有符号表的,可以检查下,上面的heap都是用你的分支生成的是可以的;另外容器内运行可能有问题可以试试直接在主机上运行 image

qyzhaoxun commented 5 months ago

我这里在容器环境运行,有什么需要额外配置的吗? @yuanchaoa

qyzhaoxun commented 5 months ago

另外这里对编译的命令有要求吗?cargo build --release,这里需要添加--release吗?

yuanchaoa commented 5 months ago

是的 用cargo build --release

qyzhaoxun commented 5 months ago

想问下,如果使用容器这里应该怎么pprof,我这边是采集的容器的heap文件,然后在节点上执行的jeprof @yuanchaoa 我这里用--release方式build,还是找不到对应符号表

yuanchaoa commented 5 months ago

agent内存pprof是通过这个库实现的:https://crates.io/crates/jemalloc_pprof 其中有段说明应该对你有帮助:

image

sharang commented 5 months ago

@qyzhaoxun 如下 agent 配置可用于降低内存 https://github.com/deepflowio/deepflow/blob/main/server/agent_config/example.yaml

cBPF 采集哪些网卡

## Regular Expression for TAP (Traffic Access Point)
## Length: [0, 65535]
## Default:
##   Localhost:   lo
##   Common NIC:  eth.*|en[osipx].*
##   QEMU VM NIC: tap.*
##   Flannel:     veth.*
##   Calico:      cali.*
##   Cilium:      lxc.*
##   Kube-OVN:    [0-9a-f]+_h$
## Note: Regular expression of NIC name for collecting traffic
#tap_interface_regex: ^(tap.*|cali.*|veth.*|eth.*|en[osipx].*|lxc.*|lo|[0-9a-f]+_h)$

默认也会采集 lo 网卡,如果不需要的话,去掉可降低内存消耗。

cBPF 忽略哪些流量

## Traffic Capture Filter
## Length: [1, 512]
## Note: If not configured, all traffic will be collected. Please
##   refer to BPF syntax: https://biot.com/capstats/bpf.html
#capture_bpf:

如果明确知道有些流量不需要关心,可以配置 bpf 表达式过滤

cBPF 流量采集截断和应用协议解析截断 ⭐️

## Maximum Packet Capture Length
## Unit: bytes. Default: 65535. Range: [128, 65535]
## Note: DPDK environment does not support this configuration.
#capture_packet_size: 65535

## Protocol Identification Maximun Packet Length
## Default: 1024. Bpf Range: [256, 65535], Ebpf Range: [256, 8192]
## Note: The maximum data length used for application protocol identification,
##   note that the effective value is less than or equal to the value of
##   capture_packet_size.
#l7_log_packet_size: 1024

目前我们的应用协议解析最大支持解析 8192 字节,因此这两个配置可以统一为 1024 ~ 8192 之间某个值。降低 capture_packet_size 有助于降低内存。

关闭隧道解析的尝试

## Decapsulation Tunnel Protocols
## Default: [1, 2], means VXLAN and IPIP. Options: 1 (VXLAN), 2 (IPIP), 3 (GRE), 4 (Geneve)
#decap_type:
#- 1
#- 2

有助于降低 CPU 消耗

关闭 X-Forwarded-For、X-Request-ID、TraceID、SpanID 的解析

## HTTP Real Client Key
## Default: X-Forwarded-For.
## Note: It is used to extract the real client IP field in the HTTP header,
##   such as X-Forwarded-For, etc. Leave it empty to disable this feature.
#http_log_proxy_client: X-Forwarded-For

## HTTP X-Request-ID Key
## Default: X-Request-ID
## Note: It is used to extract the fields in the HTTP header that are used
##   to uniquely identify the same request before and after the gateway,
##   such as X-Request-ID, etc. This feature can be turned off by setting
##   it to empty.
#http_log_x_request_id: X-Request-ID

## TraceID Keys
## Default: traceparent, sw8.
## Note: Used to extract the TraceID field in HTTP and RPC headers, supports filling
##   in multiple values separated by commas. This feature can be turned off by
##   setting it to empty.
#http_log_trace_id: traceparent, sw8

## SpanID Keys
## Default: traceparent, sw8.
## Note: Used to extract the SpanID field in HTTP and RPC headers, supports filling
##   in multiple values separated by commas. This feature can be turned off by
##   setting it to empty.
#http_log_span_id: traceparent, sw8

若不关心 l7_flow_log 中的这些字段,可以关闭

降低 cBPF 缓冲区大小 ⭐️

  ###############
  ## AF_PACKET ##
  ###############
  ## AF_PACKET Blocks Switch
  ## Note: When tap_mode != 2, you need to explicitly turn on this switch to
  ##   configure 'afpacket-blocks'.
  #afpacket-blocks-enabled: false

  ## AF_PACKET Blocks
  ## Default: 128, Range: [8, +oo)
  ## Note: deepflow-agent will automatically calculate the number of blocks
  ##   used by AF_PACKET according to max_memory, which can also be specified
  ##   using this configuration item. The size of each block is fixed at 1MB.
  #afpacket-blocks: 128

默认会根据 max-memory 计算一个合适的 afpacket-blocks( agent 日志里能看到),如果还希望降低内存,可以明确配置。一个 block = 1MB。

降低 eBPF 缓冲区大小 ⭐️

    ## eBPF dispatch ring size
    ## Default: 65536. Range: [8192, 131072]
    ## Note: The size of the ring cache queue, The value is 2^n ( n range [13, 17] ).
    ##   If the value is between 2^n and 2^(n+1), it will be automatically adjusted by the ebpf configurator to the minimum value (2^n).
    #ring-size: 65536

可以认为这里的 1 个单位(是一个指针)对应的存储空间最大可能是 l7_log_packet_size 的大小(默认是 1KB)。即默认情况下这里最大会有 64K * 1KB = 64MB 的内存消耗。

其他可以降低数据量的配置

https://deepflow.io/docs/zh/best-practice/reduce-storage-overhead/