chaosblade-io / chaosblade-box

chaos-platform
216 stars 98 forks source link

ERROR,com.alibaba.chaosblade.box.service.collect.CollectorTimer - collect container fail! #48

Open searchgithub opened 3 years ago

searchgithub commented 3 years ago

Access: box的log报这个错误, kubeconfig应该有最高权限了. 场景是部署了chaosblade-box, 通过k8s获取pod等相关信息 信息是能获取的, 但稍等一会, 感觉轮询的时候就会报这个错误.

用rancher的api, 和直接的k8 api都报错:

Log detail:(rancher) 2021-07-01 11:41:58.356, ERROR, [OkHttp https://rancher.cn.nonprod/...] com.alibaba.chaosblade.box.service.collect.CollectorTimer - collect container fail!

Log detail:(k8s api) 2021-07-01 14:47:03.474,ERROR,com.alibaba.chaosblade.box.service.collect.CollectorTimer - collect container fail! io.kubernetes.client.openapi.ApiException: java.net.SocketTimeoutException: timeout

rancher的kubeconfig:

apiVersion: v1
kind: Config
clusters:
- name: "performance"
  cluster:
    server: "https://rancher.cn.nonprod/k8s/clusters/c-65jpj"
    certificate-authority-data: "LS0t...RFLS0tLS0="

users:
- name: "performance"
  user:
    token: "kubeconfig-user-9vn68:g6vc5sqjdlltsvf4ml9k6gznrc6xv4xprwmk4vmms4dxqq8rq5b9gb"

contexts:
- name: "performance"
  context:
    user: "performance"
    cluster: "performance"

current-context: "performance"

k8s 源生kubeconfig:

apiVersion: v1
kind: Config
clusters:
- cluster:
    api-version: v1
    certificate-authority-data: LS0...0K
    server: "https://whdrcsrv220.cn.nonprod:6443"
  name: "cluster.perf"
contexts:
- context:
    cluster: "cluster.perf"
    user: "kube-admin-cluster.perf"
  name: "cluster.perf"
current-context: "cluster.perf"
users:
- name: "kube-admin-cluster.perf"
  user:
    client-certificate-data: LS0...LQo=
    client-key-data: LS0...LS0tCg==

目前的workaround就是不做experiment的时候 "关闭采集"

tiny-x commented 3 years ago

有详细的错误调用栈信息吗

searchgithub commented 3 years ago

[root@xxxx_chaos /var/log/chaosblade-box/logs]# tar -czvf 21_7_6_log.tar.gz default_2021-07-06.0.log output_log.txt sh_jstack.sh default_2021-07-06.0.log output_log.txt sh_jstack.sh [root@aldi_chaos /var/log/chaosblade-box/logs]# du -sm 21_7_6_log.tar.gz 6 21_7_6_log.tar.gz 21_7_6_log.tar.gz

@tiny-x

searchgithub commented 3 years ago

经开发初步排查, 日志问题基本定位了,是探针采集的时候,发请求太凶猛导致k8s的api server cpu打满。

能否暴露频率参数, 运行jar的时候进行调整 --
拿到node列表后,每隔30秒就轮询每个node上pod 再每30秒轮询每个pod上container

douxiaoniu77 commented 2 years ago

经开发初步排查, 日志问题基本定位了,是探针采集的时候,发请求太凶猛导致k8s的api server cpu打满。

能否暴露频率参数, 运行jar的时候进行调整

 拿到node列表后,每隔30秒就轮询每个node上pod
 再每30秒轮询每个pod上container

同遇到类似问题, 这样的话,确实看到apiserver cpu使用率爆了,这样岂不是对apiserver 压力过大。。