getanteon / alaz

Alaz: Advanced eBPF Agent for Kubernetes Observability – Effortlessly monitor K8s service interactions and performance metrics in your K8s environment. Gain in-depth insights with service maps, metrics, and more, while staying alert to crucial system anomalies 🐝
https://getanteon.com
GNU Affero General Public License v3.0
644 stars 28 forks source link

Crashes on self-hosted with panic: runtime error: integer divide by zero #182

Open orhun opened 2 months ago

orhun commented 2 months ago

My setup is the following:

$ kubectl logs -n anteon alaz-daemonset-sskqx

{"level":"info","tag":"v0.11.3","time":1723187890,"message":"alaz tag"}
{"level":"info","time":1723187890,"message":"k8sCollector initializing..."}
{"level":"info","time":1723187890,"message":"Connected successfully to CRI using endpoint unix:///proc/1/root/run/containerd/containerd.sock"}
panic: runtime error: integer divide by zero

goroutine 47 [running]:
github.com/ddosify/alaz/aggregator.(*ClusterInfo).handleSocketMapCreation(0xc0002dc5b0)
    /app/aggregator/cluster.go:89 +0x33d
created by github.com/ddosify/alaz/aggregator.newClusterInfo in goroutine 1
    /app/aggregator/cluster.go:59 +0x1a9
kubectl describe pod -n anteon alaz-daemonset-sskqx ``` Name: alaz-daemonset-sskqx Namespace: anteon Priority: 0 Service Account: alaz-serviceaccount Node: thinkpad/192.168.1.38 Start Time: Fri, 09 Aug 2024 10:01:44 +0300 Labels: app=alaz controller-revision-hash=6f9d87bfc4 pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: 003a6554ea84ff581daee5b353ccf9b6619a8febdb6302ce34a566764f0e45f3 cni.projectcalico.org/podIP: 10.1.19.183/32 cni.projectcalico.org/podIPs: 10.1.19.183/32 Status: Running IP: 10.1.19.183 IPs: IP: 10.1.19.183 Controlled By: DaemonSet/alaz-daemonset Containers: alaz-pod: Container ID: containerd://c6c904add2264b0016798d11550f2ff05e683fe713c681c3f3a415e31de9f07c Image: ddosify/alaz:v0.11.3 Image ID: docker.io/ddosify/alaz@sha256:08dbbb8ba337ce340a8ba8800e710ff5a2df9612ea258cdc472867ea0bb97224 Port: 8181/TCP Host Port: 0/TCP Args: --no-collector.wifi --no-collector.hwmon --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/) --collector.netclass.ignored-devices=^(veth.*)$ State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 2 Started: Fri, 09 Aug 2024 10:18:10 +0300 Finished: Fri, 09 Aug 2024 10:18:11 +0300 Ready: False Restart Count: 8 Limits: memory: 1Gi Requests: cpu: 1 memory: 400Mi Environment: TRACING_ENABLED: true METRICS_ENABLED: true LOGS_ENABLED: false BACKEND_HOST: http://bore.pub:39548/api-alaz LOG_LEVEL: 1 MONITORING_ID: 7c6a484a-ec47-46a6-946d-4071ff6cf883 SEND_ALIVE_TCP_CONNECTIONS: false NODE_NAME: (v1:spec.nodeName) Mounts: /sys/kernel/debug from debugfs (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df6xh (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: debugfs: Type: HostPath (bare host directory volume) Path: /sys/kernel/debug HostPathType: kube-api-access-df6xh: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning BackOff 3m54s (x68 over 18m) kubelet Back-off restarting failed container alaz-pod in pod alaz-daemonset-sskqx_anteon(a3d74951-574e-4149-8db3-9749a627f5fd) ```
alaz.yaml ```yml apiVersion: v1 kind: ServiceAccount metadata: name: alaz-serviceaccount namespace: anteon --- # For alaz to keep track of changes in cluster apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: alaz-role namespace: anteon rules: - apiGroups: - "*" resources: - pods - services - endpoints - replicasets - deployments - daemonsets - statefulsets verbs: - "get" - "list" - "watch" --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: alaz-role-binding namespace: anteon roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: alaz-role subjects: - kind: ServiceAccount name: alaz-serviceaccount namespace: anteon --- apiVersion: apps/v1 kind: DaemonSet metadata: name: alaz-daemonset namespace: anteon spec: selector: matchLabels: app: alaz template: metadata: labels: app: alaz spec: hostPID: true containers: - env: - name: TRACING_ENABLED value: "true" - name: METRICS_ENABLED value: "true" - name: LOGS_ENABLED value: "false" - name: BACKEND_HOST value: http://bore.pub:39548/api-alaz - name: LOG_LEVEL value: "1" # - name: EXCLUDE_NAMESPACES # value: "^anteon.*" - name: MONITORING_ID value: 7c6a484a-ec47-46a6-946d-4071ff6cf883 - name: SEND_ALIVE_TCP_CONNECTIONS # Send undetected protocol connections (unknown connections) value: "false" - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName args: - --no-collector.wifi - --no-collector.hwmon - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/) - --collector.netclass.ignored-devices=^(veth.*)$ image: ddosify/alaz:v0.11.3 imagePullPolicy: IfNotPresent name: alaz-pod ports: - containerPort: 8181 protocol: TCP resources: limits: memory: 1Gi requests: cpu: "1" memory: 400Mi securityContext: privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File # needed for linking ebpf trace programs volumeMounts: - mountPath: /sys/kernel/debug name: debugfs readOnly: false dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: alaz-serviceaccount serviceAccountName: alaz-serviceaccount terminationGracePeriodSeconds: 30 # needed for linking ebpf trace programs volumes: - name: debugfs hostPath: path: /sys/kernel/debug ```

Only thing that I did different compared to the documentation was using bore.pub instead of ngrok which shouldn't be a problem I think.

I'm running Arch Linux with the kernel 6.10.1-arch1-1.

orhun commented 2 months ago

I'm getting the same issue when I deploy via Helm chart as well:

{"level":"info","tag":"v0.12.0","time":1723477886,"message":"alaz tag"}
{"level":"info","time":1723477886,"message":"k8sCollector initializing..."}
{"level":"info","time":1723477886,"message":"Connected successfully to CRI using endpoint unix:///proc/1/root/run/containerd/containerd.sock"}
{"level":"error","time":1723477887,"message":"error creating gpu collector: failed to load nvidia driver: <nil>"}
{"level":"error","time":1723477887,"message":"error exporting gpu metrics: failed to load nvidia driver: <nil>"}
panic: runtime error: integer divide by zero

goroutine 85 [running]:
github.com/ddosify/alaz/aggregator.(*ClusterInfo).handleSocketMapCreation(0xc0002fcd90)
    /app/aggregator/cluster.go:89 +0x33d
created by github.com/ddosify/alaz/aggregator.newClusterInfo in goroutine 1
    /app/aggregator/cluster.go:59 +0x1a9
orhun commented 2 months ago

I guess there is a race condition on this line:

https://github.com/getanteon/alaz/blob/2f383f1da32b5a7173fff44104696166a1f16d45/aggregator/cluster.go#L89