astefanutti / kubebox

⎈❏ Terminal and Web console for Kubernetes
http://astefanutti.github.io/kubebox
MIT License
2.14k stars 142 forks source link

"Resource usage metrics unavailable" after deploying cAdvisor daemonset #117

Closed s3asfour closed 1 year ago

s3asfour commented 3 years ago

I am trying to use kubebox to access resource usage of pods in my cluster. I installed kubebox v0.9.0 and I deployed the cAdvisor daemonset, as mentioned in the README, but now i get the error message "Resource usage metrics unavailable". I don't see any logs in the cadvisor pod, which i was hoping would lead me to the issue. So I have no idea what's the problem.

Any help is appreciated!

astefanutti commented 3 years ago

Could you double check that you have the expected permissions, by running the commands from the FAQ:

https://github.com/astefanutti/kubebox#faq

Also could you double check the cAdvisor pod logs, to check for any particular issue?

s3asfour commented 3 years ago

Both commands return "yes".

The cAdvisor pod logs are empty, which is very weird :/ that's the first thing I tried to look at. But maybe that in itself can hint to the issue?

astefanutti commented 3 years ago

Yes, that's surprising the cAdvisor logs are empty. I've checked on one of my setups and can see lots of statements.

Could you check the events for the cAdvisor pods, or directly looking into the pods manifests?

s3asfour commented 3 years ago

Events:

Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  5m57s  default-scheduler  Successfully assigned cadvisor/cadvisor-jlwb9 to gke-staging-cloud-cl-staging-cloud-no-ce55b49b-wk4r
  Normal  Pulling    5m55s  kubelet            Pulling image "k8s.gcr.io/cadvisor:v0.36.0"
  Normal  Pulled     5m37s  kubelet            Successfully pulled image "k8s.gcr.io/cadvisor:v0.36.0"
  Normal  Created    5m36s  kubelet            Created container cadvisor
  Normal  Started    5m36s  kubelet            Started container cadvisor

And here's my running pods manifest:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2021-01-28T14:51:12Z"
  generateName: cadvisor-
  labels:
    app: cadvisor
    controller-revision-hash: 795f564df9
    name: cadvisor
    pod-template-generation: "1"
  name: cadvisor-jlwb9
  namespace: cadvisor
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: cadvisor
    uid: 1cc42669-42d7-4a07-acbb-62e33cd02eed
  resourceVersion: "776256"
  selfLink: /api/v1/namespaces/cadvisor/pods/cadvisor-jlwb9
  uid: 95dff31a-748d-42c5-9710-718efcac52af
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - gke-staging-cloud-cl-staging-cloud-no-ce55b49b-wk4r
  automountServiceAccountToken: false
  containers:
  - args:
    - --storage_duration=5m0s
    - --housekeeping_interval=10s
    image: k8s.gcr.io/cadvisor:v0.36.0
    imagePullPolicy: IfNotPresent
    name: cadvisor
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    resources:
      limits:
        cpu: 300m
        memory: 2000Mi
      requests:
        cpu: 150m
        memory: 200Mi
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /rootfs
      name: rootfs
      readOnly: true
    - mountPath: /var/log
      name: var-log
      readOnly: true
    - mountPath: /var/run
      name: var-run
      readOnly: true
    - mountPath: /sys
      name: sys
      readOnly: true
    - mountPath: /var/lib/containers
      name: containers
      readOnly: true
    - mountPath: /var/lib/docker
      name: docker
      readOnly: true
    - mountPath: /dev/disk
      name: disk
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: gke-staging-cloud-cl-staging-cloud-no-ce55b49b-wk4r
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: cadvisor
  serviceAccountName: cadvisor
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  volumes:
  - hostPath:
      path: /
      type: ""
    name: rootfs
  - hostPath:
      path: /var/log
      type: ""
    name: var-log
  - hostPath:
      path: /var/run
      type: ""
    name: var-run
  - hostPath:
      path: /sys
      type: ""
    name: sys
  - hostPath:
      path: /var/lib/containers
      type: ""
    name: containers
  - hostPath:
      path: /var/lib/docker
      type: ""
    name: docker
  - hostPath:
      path: /dev/disk
      type: ""
    name: disk
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-01-28T14:51:12Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-01-28T14:52:06Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-01-28T14:52:06Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-01-28T14:51:12Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://21b1453a6b71dd31f4f654dc633ba3dd3a97dfb8c7a4f5a55293a54cdf0437a7
    image: k8s.gcr.io/cadvisor:v0.36.0
    imageID: docker-pullable://k8s.gcr.io/cadvisor@sha256:16bc6858dc5b7063c7d89153ad6544370eb79cb27a1b8d571f31b98673f7a324
    lastState: {}
    name: cadvisor
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-01-28T14:51:33Z"
  hostIP: 10.0.0.6
  phase: Running
  podIP: 10.0.4.2
  podIPs:
  - ip: 10.0.4.2
  qosClass: Burstable
  startTime: "2021-01-28T14:51:12Z"
astefanutti commented 3 years ago

This looks normal.

Could you run:

$ kubectl get --raw "/api/v1/namespaces/cadvisor/pods/cadvisor-jlwb9/proxy/api/v2.0/spec?recursive=true"
s3asfour commented 3 years ago

i get the error message

Error from server (ServiceUnavailable): the server is currently unable to handle the request

astefanutti commented 3 years ago

Thanks, this is clearly the issue.

Here is the output that I have:

$ kubectl get --raw "/api/v1/namespaces/cadvisor/pods/cadvisor-rd24x/proxy/api/v2.0/spec?recursive=true"
{"/":{"creation_time":"2021-01-28T11:27:47.669999981Z","has_cpu":true,"cpu":{"limit":1024,"max_limit":0,"mask":"0-3"},"has_memory":true,"memory":{"limit":3959975936,"reservation":8796093018112,"swap_limit":104853504},"has_custom_metrics":false,"has_processes":false,"processes":{},"has_network":true,"has_filesystem":true,"has_diskio":true}...

Could you run:

$ kubectl get --raw "/api/v1/namespaces/cadvisor/pods/cadvisor-jlwb9/proxy"

To determine whether that's an issue with the API server proxy or the cAdvisor pod.

s3asfour commented 3 years ago

I get the same error message

Error from server (ServiceUnavailable): the server is currently unable to handle the request

astefanutti commented 3 years ago

Thanks, if your cluster supports it, it'd be useful to run, as a final check:

$ kubectl debug -it -n cadvisor cadvisor-jlwb9 --image=busybox
# curl http://localhost:8080/

For some reasons, it seems cAdvisor does not start correctly, still the pod reports an healthy condition!

One approach could be to try deploying from the cAdvisor documentation:

https://github.com/google/cadvisor/tree/master/deploy/kubernetes

There may be a compatibility issue with the version of the cAdvisor template that Kubebox provides and your cluster.

s3asfour commented 3 years ago

I deployed cAdvisor from it's repo and now i see some logs in the pod, but the kubebox resource metrics window still shows the same error:

Resource usage metrics unavailable

I0128 20:58:39.323099       1 storagedriver.go:50] Caching stats in memory for 2m0s
I0128 20:58:39.323729       1 manager.go:154] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct"
I0128 20:58:39.417720       1 fs.go:142] Filesystem UUIDs: map[1089-6870:/dev/sda12 33ee302f-5e82-4695-b3ff-6e803d26508b:/dev/sda1 e286b489-3849-4a10-b7be-42e853faaa8d:/dev/sda8]
I0128 20:58:39.417758       1 fs.go:143] Filesystem partitions: map[tmpfs:{mountpoint:/dev major:0 minor:268 fsType:tmpfs blockSize:0} /dev/root:{mountpoint:/rootfs major:253 minor:0 fsType:ext2 blockSize:0} /dev/sda8:{mountpoint:/rootfs/usr/share/oem major:8 minor:8 fsType:ext4 blockSize:0} /dev/sda1:{mountpoint:/var/lib/docker major:8 minor:1 fsType:ext4 blockSize:0} shm:{mountpoint:/rootfs/var/lib/docker/containers/86f6702134b6c286fb185c69d9a414430bb0fa6e94c012c585bde84c5182159f/mounts/shm major:0 minor:59 fsType:tmpfs blockSize:0}]
I0128 20:58:39.425589       1 manager.go:227] Machine: {NumCores:2 CpuFrequency:2299998 MemoryCapacity:4140908544 HugePages:[{PageSize:2048 NumPages:0}] MachineID:f113b713760c17bb1c10725e60cceac4 SystemUUID:f113b713-760c-17bb-1c10-725e60cceac4 BootID:dc3a5eb4-472f-48dc-b294-8930a56a0440 Filesystems:[{Device:/dev/sda8 DeviceMajor:8 DeviceMinor:8 Capacity:12042240 Type:vfs Inodes:4096 HasInodes:true} {Device:/dev/sda1 DeviceMajor:8 DeviceMinor:1 Capacity:101241290752 Type:vfs Inodes:6258720 HasInodes:true} {Device:shm DeviceMajor:0 DeviceMinor:59 Capacity:67108864 Type:vfs Inodes:505482 HasInodes:true} {Device:overlay DeviceMajor:0 DeviceMinor:252 Capacity:101241290752 Type:vfs Inodes:6258720 HasInodes:true} {Device:tmpfs DeviceMajor:0 DeviceMinor:268 Capacity:67108864 Type:vfs Inodes:505482 HasInodes:true} {Device:/dev/root DeviceMajor:253 DeviceMinor:0 Capacity:1279787008 Type:vfs Inodes:79360 HasInodes:true}] DiskMap:map[253:0:{Name:dm-0 Major:253 Minor:0 Size:1300234240 Scheduler:none} 9:0:{Name:md0 Major:9 Minor:0 Size:0 Scheduler:none} 8:0:{Name:sda Major:8 Minor:0 Size:107374182400 Scheduler:mq-deadline}] NetworkDevices:[{Name:cbr0 MacAddress:36:e3:3f:b0:97:d5 Speed:0 Mtu:1460} {Name:eth0 MacAddress:42:01:0a:00:00:07 Speed:-1 Mtu:1460}] Topology:[{Id:0 Memory:4140908544 Cores:[{Id:0 Threads:[0 1] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:47185920 Type:Unified Level:3}]}] CloudProvider:GCE InstanceType:e2-medium InstanceID:2096945664381373919}
I0128 20:58:39.449820       1 manager.go:233] Version: {KernelVersion:4.19.112+ ContainerOsVersion:Alpine Linux v3.7 DockerVersion:19.03.1 DockerAPIVersion:1.40 CadvisorVersion:v0.30.2 CadvisorRevision:de723a09}
I0128 20:58:39.482021       1 factory.go:356] Registering Docker factory
I0128 20:58:39.506078       1 factory.go:136] Registering containerd factory
I0128 20:58:39.506313       1 factory.go:54] Registering systemd factory
I0128 20:58:39.509619       1 factory.go:86] Registering Raw factory
I0128 20:58:39.513067       1 manager.go:1205] Started watching for new ooms in manager
W0128 20:58:39.513129       1 manager.go:340] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
I0128 20:58:39.514451       1 manager.go:356] Starting recovery of all containers
I0128 20:58:41.516086       1 manager.go:361] Recovery completed
I0128 20:58:42.705861       1 cadvisor.go:165] Starting cAdvisor version: v0.30.2-de723a09 on port 8080
astefanutti commented 3 years ago

Kubebox expects cAdvisor to be deployed in the cadvisor namespace. Also there are a couple of things to be done to make sure cAdvisor is configured for the container runtime used in the cluster:

https://github.com/astefanutti/kubebox/blob/4ae0a2929a17c132a1ea61144e17b51f93eb602f/cadvisor.yaml#L23-L39

Also it looks like the version deployed from the cAdvisor repository is quite old, v0.30.2-de723a09, and it may not have the latest version of the API.

A quick check is to run:

$ kubectl get --raw "/api/v1/namespaces/cadvisor/pods/<cadvisor_pod>/proxy/api/v2.0/spec?recursive=true"

which is the first request Kubebox does.

widnyana commented 3 years ago
$ debug -it -n cadvisor cadvisor-jlwb9 --image=busybox
# curl http://localhost:8080/

hi @astefanutti, how do I can get that debug command? kubectl exec doesnt have the --image flag :(

astefanutti commented 3 years ago

@widnyana it's kubectl debug: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-running-pod/#ephemeral-container-example. I forgot to add kubectl by mistake.

4m3ndy commented 3 years ago

I had the same issue, and I have managed to resolve it on my private GKE cluster. I added a firewall rule to enable the connection from Kube API server ( master-nodes ) to the worker nodes on port 8080

astefanutti commented 1 year ago

Let me close this. The cAdvisor deployment example has been updated with the latest version. Let me know if you still face the issue.