deepflowio / deepflow

eBPF Observability - Distributed Tracing and Profiling
https://deepflow.io
Apache License 2.0
2.92k stars 330 forks source link

Deepflow离线环境基于chart包安装,部分pod偶发启动异常 #6951

Closed LemonSqi closed 4 months ago

LemonSqi commented 4 months ago

Search before asking

DeepFlow Component

Helm Chart

What you expected to happen

我试图通过helm chart安装社区版deepflow v6.5.8,来监控单个 K8s 集群 但是我的安装失败了,我尝试了各种问题的分析与定位,由于deepflow内部的pod是最小化部署模式,因此,可用于定位问题的工具就比较有限了,对此我做了大量的安装尝试,以期将问题明确,事实上,每次安装都有很多的偶然与不确定因素,我不知道问题的原因,具体的pod启动状态如下图所示: f07a2dc5d6092033b65463125014d42

对于K8s集群是1个master节点+3个node节点的情况,有一些pod,在每次deepflow启动时,表现都是不同的,有的pod可能会重启多次才能成功,有的始终无法成功启动,如grafana与server: 09648e151ff4c70a34af82e21266ca8 66aaf9b2e29f3519512a8f8203241fd

对于K8s集群是 1个master节点+1个node节点的情况,相较于上面的情况有些不同,单个node节点是能成功启动的,但每次pod可能也会重启多次才能被成功拉起

综上所述,pod需要多次重启才能成功启动的情况很奇怪,dns解析也不稳定,这点也很奇怪,如果启动阶段有什么需要配置或是注意的地方,还请告知,期待各位技术达人的答疑解惑,谢谢!!

How to reproduce

No response

DeepFlow version

v6.5.8

DeepFlow agent list

No response

Kubernetes CNI

calico

Operation-System/Kernel version

Linux 5.10.0-136.12.0.87

Anything else

No response

Are you willing to submit a PR?

Code of Conduct

1473371932 commented 4 months ago

Hello, you can temporarily disregard the status of Grafana. Based on the screenshot, MySQL and Clickhouse from the chart package are being used. The main issue currently is the unsuccessful connection between Grafana and deepflow-server to MySQL. If using the DeepFlow chart package's databases, there's no need for additional configurations. Could you please attempt to resolve the MySQL resource in the network space of the deepflow-agent pod using nslookup?

LemonSqi commented 4 months ago
  1. 我通过kubectl exec -it deeoflow-agent -n deepflow -- bash 进入了容器内部,但是容器内没有 nslookup 指令可以操作

  2. 然后我又试图找到deepflow-grafana所在节点,在该节点下启动一个busybox容器:

    kubectl run -i --tty --rm busybox --image=cargo.unionpay.com/library/busybox:1.30.0 -n deepflow --overrides='
    {
    "apiVersion": "v1",
    "spec": {
    "nodeSelector": {
      "kubernetes.io/hostname": "node2"
    }
    }
    }'  --restart=Never -- sh

    试图通过这个相同网络空间的busybox容器去访问deepflow-mysql服务,但是失败了: 9c51669b012dd2a852e7d4e59e07754

  3. 我有看到nameserver 10.96.0.10已经成功被写入resolve.conf,通过kube-dns进行解析,但事实上,并没有起到效果: 8aadf8378a530b282d8a4cf9d13ecd5

  4. 对此我检查了dns服务: 3c29973e9174ddec07892e4f7fa17de 没有发现什么问题

  5. 回头看我所遇到的问题: Error: ✗ failed to connect to database: dial tcp 10.107.174.9:30130: connect: connection timed out

这行错误日志显示,当前deepflow-mysql的容器ip已经被成功解析了:10.107.174.9:30130,理论上dns是OK的,但看如下server日志,我仍然是疑惑的: 2024/06/18 09:37:50 /home/runnerx/actions-runner/_work/deepflow/deepflow/server/controller/db/mysql/common/gorm.go:59 [error] failed to initialize database, got error dial tcp: lookup deepflow-mysql: i/o timeout 2024-06-18 09:37:50.750 [ERRO] [db.mysql.common] gorm.go:78 failed to initialize session: dial tcp: lookup deepflow-mysql: i/o timeout, dsn: root:deepflow@tcp(deepflow-mysql:30130)/?charset=utf8mb4&parseTime=True&loc=Local&timeout=30s 2024-06-18 09:37:50.750 [ERRO] [controller] master.go:69 migrate mysql failed: dial tcp: lookup deepflow-mysql: i/o timeout

[lookup mysql i/o timeout]

似乎都是连接超时的问题,期待再一次收到您的回复,并描述问题产生的原因,以及如何去解决,谢谢!

1473371932 commented 4 months ago
  1. 我通过kubectl exec -it deeoflow-agent -n deepflow -- bash 进入了容器内部,但是容器内没有 nslookup 指令可以操作
  2. 然后我又试图找到deepflow-grafana所在节点,在该节点下启动一个busybox容器:
kubectl run -i --tty --rm busybox --image=cargo.unionpay.com/library/busybox:1.30.0 -n deepflow --overrides='
{
  "apiVersion": "v1",
  "spec": {
    "nodeSelector": {
      "kubernetes.io/hostname": "node2"
    }
  }
}'  --restart=Never -- sh

试图通过这个相同网络空间的busybox容器去访问deepflow-mysql服务,但是失败了: 9c51669b012dd2a852e7d4e59e07754

  1. 我有看到nameserver 10.96.0.10已经成功被写入resolve.conf,通过kube-dns进行解析,但事实上,并没有起到效果: 8aadf8378a530b282d8a4cf9d13ecd5
  2. 对此我检查了dns服务: 3c29973e9174ddec07892e4f7fa17de 没有发现什么问题
  3. 回头看我所遇到的问题: Error: ✗ failed to connect to database: dial tcp 10.107.174.9:30130: connect: connection timed out

这行错误日志显示,当前deepflow-mysql的容器ip已经被成功解析了:10.107.174.9:30130,理论上dns是OK的,但看如下server日志,我仍然是疑惑的: 2024/06/18 09:37:50 /home/runnerx/actions-runner/_work/deepflow/deepflow/server/controller/db/mysql/common/gorm.go:59 [error] failed to initialize database, got error dial tcp: lookup deepflow-mysql: i/o timeout 2024-06-18 09:37:50.750 [ERRO] [db.mysql.common] gorm.go:78 failed to initialize session: dial tcp: lookup deepflow-mysql: i/o timeout, dsn: root:deepflow@tcp(deepflow-mysql:30130)/?charset=utf8mb4&parseTime=True&loc=Local&timeout=30s 2024-06-18 09:37:50.750 [ERRO] [controller] master.go:69 migrate mysql failed: dial tcp: lookup deepflow-mysql: i/o timeout

[lookup mysql i/o timeout]

似乎都是连接超时的问题,期待再一次收到您的回复,并描述问题产生的原因,以及如何去解决,谢谢!

Based on the current execution results, there are also issues with resolving the address deepflow-agent.deepflow.svc.cluster.local First, it is necessary to confirm whether there are any DNS issues within the Kubernetes cluster. If everything checks out, DNS resolution issues can be verified using the following method.

The process of injecting a busybox container, as mentioned above, seems overly cumbersome. Instead, you can use nsenter -t -n 123[pod pid] to enter the container's network namespace for testing. You may attempt the following steps:

  1. Enter a pod in another namespace and use nslookup to resolve another namespace's pod.
  2. Enter a pod in another namespace and use nslookup to resolve a pod in the deepflow namespace.
  3. Use nslookup within a deepflow namespace pod to resolve another namespace's pod.
LemonSqi commented 4 months ago

我按照你说的通过nsenter进入了容器网络命名空间,我发现了一个奇怪的现象,只有通过本节点的coredns才能获取到服务的IP地址,由于我之前尝试修改过coredns的pod数量,使得每个节点均落有一个coredns的dp,即master+3个node各一个coredns,于是,当我尝试nslookup获取服务deepflow-mysql的IP时: nslookup deepflow-mysql.deepflow.svc.cluster.local 10.96.0.10 奇怪的现象发生了: 偶发能够解析到deepflow-mysql的服务IP,我注意到了这个细节,想到,会不会是kube-dns每次负载均衡选择一个coredns做dns解析造成的,这个dns解析的流量必须通过kube-dns转发到deepflow-mysql所在节点的那个coredns才能解析成功,否则转发到任意coredns,都会只进不出,通过 ipvsadm -Ln --stats 也印证了我的猜想,即dns若想成功解析,就必须要确保kube-dns在做流量转发时,将流量路由到指定coredns下才能输出。理论上这些coredns应该是完全对等的,且均缓存有本命名空间内的全部dns路由表

之后我又做了一个实验,我重启了k8s集群,对coredns不做任何处理,默认通过kubeadm启动会启动两个coredns,这两个coredns均会落在master节点,我在master节点通过 nslookup deepflow-mysql.deepflow.svc.cluster.local 10.96.0.10 能够直接获取到服务的IP地址,但是我在其他node节点执行却没有应答(deepflow命名空间下的所有pod均在node节点运行),这也从侧面印证了我的猜想,即dns解析的入口流量与出口流量与服务所在节点绑定了,只有路由到本服务所在节点的coredns才能成功解析dns

我无法解释这种现象,根据我对现象的描述,您能否发现问题的蛛丝马迹,我使用的cni网络插件是calico,对其没有做任何配置与修改,只是通过 kubectl apply -f calico.yaml 启动的,倘若不是dns的问题引起的,那会是什么原因,是否是cni网络的异常?希望得到您们的帮助,谢谢!!!

LemonSqi commented 4 months ago

查看了一下,calico默认使用BGP边界网关协议,在不修改配置的情况下不能直接跨机通信,替换为flannel cni网络插件以后deepflow已经成功启动,感谢支持!!!

LemonSqi commented 4 months ago

Hello, you can temporarily disregard the status of Grafana. Based on the screenshot, MySQL and Clickhouse from the chart package are being used. The main issue currently is the unsuccessful connection between Grafana and deepflow-server to MySQL. If using the DeepFlow chart package's databases, there's no need for additional configurations. Could you please attempt to resolve the MySQL resource in the network space of the deepflow-agent pod using nslookup?