alibaba / kubeskoop

Network monitoring & diagnosis suite for Kubernetes
https://kubeskoop.io
Apache License 2.0
556 stars 66 forks source link

Some problems with Diagnosis #313

Open Robin3Chan opened 1 month ago

Robin3Chan commented 1 month ago
  1. Capture: When Capturing in pods like kubeskoop's "pod controller" , meet this error: "failed to run command: pod not found on nettop cache". When Capturing in pods like kubeskoop's "pod agent", it meet this error: "failed to upload task result", and "pod controller" seems to crash and reload.

  2. Connectivity Diagnosis: How could I submit args like "--network-plugin", when auto-detection failed on web console?

Lyt99 commented 1 month ago

Hi, thanks for reporting this issue!

  1. Seems like the controller is not working properly. Could you please provide the error log of controller?

  2. You can use diagnose.extraArgs in controller config to pass custom arguments to the diagnose. The controller config is in a ConfigMap named controller-config or kubeskoop-controller-config

For example:

logLevel: debug
server:
  httpPort: 10264
  agentPort: 10263
controller:
  namespace: kubeskoop
  prometheus: ""
  loki: ""
  database:
    type: sqlite3
  diagnose:
    # Set network plugin to flannel.
    extraArgs:
      - --network-plugin=flannel
Lyt99 commented 1 month ago

Sorry for the late. Cloud you please provide detailed log in text? And for the pod crashing scenario, you can fetch logs of the crashed pod by adding -p to kubectl logs.

kubectl logs -n kubeskoop <controller pod name> -p
Robin3Chan commented 1 month ago
  1. Capture controller-pod: controller-pod logs:

    [GIN] 2024/08/02 - 02:44:20 | 200 |     284.332µs |       10.42.0.1 | GET      "/pods"
    [GIN] 2024/08/02 - 02:44:37 | 200 |    3.723619ms |       10.42.0.1 | POST     "/capture"
    I0802 02:44:37.710384       1 agent.go:37] send task to agent node_name:"sqaobxdn011161204144.sa128" type:Capture type:Ping: server:{} task:{id:"2" capture:{pod:{name:"controller-78c945f8cc-7w4mt" namespace:"kubeskoop"} node:{name:"sqaobxdn011161204144.sa128"} capture_type:"Pod" capture_duration_seconds:10}}
    I0802 02:44:37.712404       1 capture.go:252] store capture result for 2, failed to run command: pod not found on nettop cache
    [GIN] 2024/08/02 - 02:44:37 | 200 |      52.019µs |       10.42.0.1 | GET      "/captures"

    agent-pod logs:

    time="2024-08-02T02:44:30Z" level=info msg="cache process finished, latency: 0.000571s"
    time="2024-08-02T02:44:30Z" level=debug msg="loki ingester send of maxwait"
    time="2024-08-02T02:44:35Z" level=debug msg="loki ingester send of maxwait"
    time="2024-08-02T02:44:37Z" level=info msg="process task: server:{}  task:{id:\"2\"  capture:{pod:{name:\"controller-78c945f8cc-7w4mt\"  namespace:\"kubeskoop\"}  node:{name:\"sqaobxdn011161204144.sa128\"}  capture_type:\"Pod\"  capture_duration_seconds:10}}"
    time="2024-08-02T02:44:37Z" level=error msg="failed to run command: pod not found on nettop cache"
    time="2024-08-02T02:44:40Z" level=debug msg="finished cache process"
    time="2024-08-02T02:44:40Z" level=info msg="cache process finished, latency: 0.000601s"
    time="2024-08-02T02:44:40Z" level=debug msg="loki ingester send of maxwait"
  2. Capture agent-pod: controller-pod logs:

    
    $kubectl logs -n kubeskoop controller-78c945f8cc-7w4mt
    ...
    [GIN-debug] PUT    /config                   --> github.com/alibaba/kubeskoop/pkg/controller/cmd.(*Server).UpdateExporterConfig-fm (3 handlers)
    [GIN-debug] [WARNING] You trusted all proxies, this is NOT safe. We recommend you to set a value.
    Please check https://pkg.go.dev/github.com/gin-gonic/gin#readme-don-t-trust-all-proxies for details.
    [GIN-debug] Listening and serving HTTP on 0.0.0.0:10264
    [GIN] 2024/08/02 - 02:49:08 | 200 |     172.503µs |       10.42.0.1 | GET      "/captures"
    I0802 02:49:08.641722       1 agent.go:31] watch tasks for filter: node_name:"sqaobxdn011161204144.sa128" type:Capture type:Ping

$kubectl logs -n kubeskoop controller-78c945f8cc-7w4mt -p ... [GIN] 2024/08/02 - 02:44:37 | 200 | 52.019µs | 10.42.0.1 | GET "/captures" [GIN] 2024/08/02 - 02:48:48 | 200 | 454.246µs | 10.42.0.1 | GET "/pods" [GIN] 2024/08/02 - 02:48:53 | 200 | 3.920268ms | 10.42.0.1 | POST "/capture" I0802 02:48:53.129156 1 agent.go:37] send task to agent node_name:"sqaobxdn011161204144.sa128" type:Capture type:Ping: server:{} task:{id:"3" capture:{pod:{name:"agent-dm5qr" namespace:"kubeskoop" hostNetwork:true} node:{name:"sqaobxdn011161204144.sa128"} capture_type:"Pod" capture_duration_seconds:10}} [GIN] 2024/08/02 - 02:48:53 | 200 | 79.636µs | 10.42.0.1 | GET "/captures" [GIN] 2024/08/02 - 02:48:56 | 200 | 170.669µs | 10.42.0.1 | GET "/captures" [GIN] 2024/08/02 - 02:48:59 | 200 | 42.782µs | 10.42.0.1 | GET "/captures" [GIN] 2024/08/02 - 02:49:02 | 200 | 46.882µs | 10.42.0.1 | GET "/captures" [GIN] 2024/08/02 - 02:49:05 | 200 | 51.859µs | 10.42.0.1 | GET "/captures"


agent-pod logs:

... time="2024-08-02T02:48:53Z" level=info msg="process task: server:{} task:{id:\"3\" capture:{pod:{name:\"agent-dm5qr\" namespace:\"kubeskoop\" hostNetwork:true} node:{name:\"sqaobxdn011161204144.sa128\"} capture_type:\"Pod\" capture_duration_seconds:10}}" time="2024-08-02T02:48:53Z" level=info msg="start capture: [{ tcpdump -i any -C 100 -w /tmp/3_sqaobxdn011161204144.sa128_host.pcap /tmp/3_sqaobxdn011161204144.sa128_host.pcap 10000000000}]" time="2024-08-02T02:48:55Z" level=debug msg="loki ingester send of maxwait" time="2024-08-02T02:49:00Z" level=debug msg="finished cache process" time="2024-08-02T02:49:00Z" level=info msg="cache process finished, latency: 0.000588s" time="2024-08-02T02:49:00Z" level=debug msg="loki ingester send of maxwait" time="2024-08-02T02:49:05Z" level=debug msg="loki ingester send of maxwait" time="2024-08-02T02:49:06Z" level=error msg="ipcache failed to receive task: rpc error: code = Unavailable desc = error reading from server: EOF" time="2024-08-02T02:49:06Z" level=info msg="connecting to controller." time="2024-08-02T02:49:06Z" level=error msg="failed to receive task: rpc error: code = Unavailable desc = error reading from server: EOF" time="2024-08-02T02:49:06Z" level=info msg="connecting to controller." time="2024-08-02T02:49:06Z" level=error msg="failed to upload task result: rpc error: code = Unavailable desc = error reading from server: EOF" time="2024-08-02T02:49:06Z" level=error msg="failed to watch task: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.43.134.4:10263: connect: connection refused\"" time="2024-08-02T02:49:06Z" level=warning msg="retry watching task after 2 seconds" time="2024-08-02T02:49:06Z" level=error msg="failed to watch ipcache: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.43.134.4:10263: connect: connection refused\"" time="2024-08-02T02:49:06Z" level=warning msg="retry watching ipcache after 2 seconds" time="2024-08-02T02:49:08Z" level=info msg="connecting to controller." time="2024-08-02T02:49:08Z" level=info msg="connecting to controller." time="2024-08-02T02:49:08Z" level=info msg="controller connected." time="2024-08-02T02:49:08Z" level=error msg="ipcache failed to receive task: rpc error: code = InvalidArgument desc = invalid period" time="2024-08-02T02:49:08Z" level=info msg="connecting to controller." time="2024-08-02T02:49:08Z" level=warning msg="list ipcache" time="2024-08-02T02:49:08Z" level=info msg="current period:88f6bca0-46d0-4653-9338-8d3a49aaf77d, revision: 0" time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=11.161.204.144,type=node,node=sqaobxdn011161204144.sa128,pod=,namespace=]" time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.105,type=pod,node=,pod=coredns-6799fbcd5-pl8mn,namespace=kube-system]" time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.120,type=pod,node=,pod=local-path-provisioner-6c86858495-h2v9c,namespace=kube-system]" time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.108,type=pod,node=,pod=metrics-server-54fd9b65b-9kgrj,namespace=kube-system]" time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.116,type=pod,node=,pod=svclb-traefik-71ce8185-vq7jn,namespace=kube-system]" time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.112,type=pod,node=,pod=traefik-7d5f6474df-52wv5,namespace=kube-system]" time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.113,type=pod,node=,pod=controller-78c945f8cc-7w4mt,namespace=kubeskoop]" time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.110,type=pod,node=,pod=webconsole-6d64c54cc4-m8n8x,namespace=kubeskoop]" ...

Lyt99 commented 1 month ago

Seems no obvious error log here. Cloud you please find there is any golang panic message in controller pod? I'll also try to reproduce it on my local machine.

Thank you!

Robin3Chan commented 1 month ago

Seems no obvious error log here. Cloud you please find there is any golang panic message in controller pod? I'll also try to reproduce it on my local machine.

Thank you!

I'm wondering where to find these messages in controller pod? Could you please give some suggestions?

Lyt99 commented 1 month ago

Seems no obvious error log here. Cloud you please find there is any golang panic message in controller pod? I'll also try to reproduce it on my local machine. Thank you!

I'm wondering where to find these messages in controller pod? Could you please give some suggestions?

Usually, the controller pod crashed and restarted should be due to a golang program panic. The detailed info (reason and call stack) can be found at the end of the log text using kubectl logs -p.

Robin3Chan commented 1 month ago

Seems no obvious error log here. Cloud you please find there is any golang panic message in controller pod? I'll also try to reproduce it on my local machine. Thank you!

I'm wondering where to find these messages in controller pod? Could you please give some suggestions?

Usually, the controller pod crashed and restarted should be due to a golang program panic. The detailed info (reason and call stack) can be found at the end of the log text using kubectl logs -p.

I find out that the nettop cache always stores nothing because the function in exporter/nettop/cache.go always returns empty:

    sandboxList, err := criClient.ListPodSandbox(&v1.PodSandboxFilter{
        State: &v1.PodSandboxStateValue{
            State: v1.PodSandboxState_SANDBOX_READY,
        },
    })

Do you know how to figure it out ? Or maybe sth wrong with my env? Thank you very much!

Lyt99 commented 1 month ago

Seems no obvious error log here. Cloud you please find there is any golang panic message in controller pod? I'll also try to reproduce it on my local machine. Thank you!

I'm wondering where to find these messages in controller pod? Could you please give some suggestions?

Usually, the controller pod crashed and restarted should be due to a golang program panic. The detailed info (reason and call stack) can be found at the end of the log text using kubectl logs -p.

I find out that the nettop cache always stores nothing because the function in exporter/nettop/cache.go always returns empty:

  sandboxList, err := criClient.ListPodSandbox(&v1.PodSandboxFilter{
      State: &v1.PodSandboxStateValue{
          State: v1.PodSandboxState_SANDBOX_READY,
      },
  })

Do you know how to figure it out ? Or maybe sth wrong with my env? Thank you very much!

Hi, thank for your information!

Maybe the cause is that the agent can't list pods on nodes properly. What container runtime are you using, and which version is it?

Robin3Chan commented 1 month ago

I'm using Docker Engine and cri-dockerd, as follows:

$ tr \\0 ' ' < /proc/"$(pgrep kubelet)"/cmdline
/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=unix:///run/cri-dockerd.sock --cgroup-driver=systemd --container-runtime-endpoint=/run/cri-dockerd.sock 

My k8s version is

$kubectl version
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:38:15Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}

My docker version is

Server: Docker Engine - Community
 Engine:
  Version:          26.1.3
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.10
  Git commit:       8e96db1
  Built:            Thu May 16 08:33:34 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.32
  GitCommit:        8b3b7ca2e5ce38e8f31a34f35b2b68ceb8470d89
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

My cri-dockerd version is

cri-dockerd 0.3.6 (877dc6a4)