Open orhun opened 2 months ago
I'm getting the same issue when I deploy via Helm chart as well:
{"level":"info","tag":"v0.12.0","time":1723477886,"message":"alaz tag"}
{"level":"info","time":1723477886,"message":"k8sCollector initializing..."}
{"level":"info","time":1723477886,"message":"Connected successfully to CRI using endpoint unix:///proc/1/root/run/containerd/containerd.sock"}
{"level":"error","time":1723477887,"message":"error creating gpu collector: failed to load nvidia driver: <nil>"}
{"level":"error","time":1723477887,"message":"error exporting gpu metrics: failed to load nvidia driver: <nil>"}
panic: runtime error: integer divide by zero
goroutine 85 [running]:
github.com/ddosify/alaz/aggregator.(*ClusterInfo).handleSocketMapCreation(0xc0002fcd90)
/app/aggregator/cluster.go:89 +0x33d
created by github.com/ddosify/alaz/aggregator.newClusterInfo in goroutine 1
/app/aggregator/cluster.go:59 +0x1a9
I guess there is a race condition on this line:
My setup is the following:
kubectl describe pod -n anteon alaz-daemonset-sskqx
``` Name: alaz-daemonset-sskqx Namespace: anteon Priority: 0 Service Account: alaz-serviceaccount Node: thinkpad/192.168.1.38 Start Time: Fri, 09 Aug 2024 10:01:44 +0300 Labels: app=alaz controller-revision-hash=6f9d87bfc4 pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: 003a6554ea84ff581daee5b353ccf9b6619a8febdb6302ce34a566764f0e45f3 cni.projectcalico.org/podIP: 10.1.19.183/32 cni.projectcalico.org/podIPs: 10.1.19.183/32 Status: Running IP: 10.1.19.183 IPs: IP: 10.1.19.183 Controlled By: DaemonSet/alaz-daemonset Containers: alaz-pod: Container ID: containerd://c6c904add2264b0016798d11550f2ff05e683fe713c681c3f3a415e31de9f07c Image: ddosify/alaz:v0.11.3 Image ID: docker.io/ddosify/alaz@sha256:08dbbb8ba337ce340a8ba8800e710ff5a2df9612ea258cdc472867ea0bb97224 Port: 8181/TCP Host Port: 0/TCP Args: --no-collector.wifi --no-collector.hwmon --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/) --collector.netclass.ignored-devices=^(veth.*)$ State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 2 Started: Fri, 09 Aug 2024 10:18:10 +0300 Finished: Fri, 09 Aug 2024 10:18:11 +0300 Ready: False Restart Count: 8 Limits: memory: 1Gi Requests: cpu: 1 memory: 400Mi Environment: TRACING_ENABLED: true METRICS_ENABLED: true LOGS_ENABLED: false BACKEND_HOST: http://bore.pub:39548/api-alaz LOG_LEVEL: 1 MONITORING_ID: 7c6a484a-ec47-46a6-946d-4071ff6cf883 SEND_ALIVE_TCP_CONNECTIONS: false NODE_NAME: (v1:spec.nodeName) Mounts: /sys/kernel/debug from debugfs (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df6xh (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: debugfs: Type: HostPath (bare host directory volume) Path: /sys/kernel/debug HostPathType: kube-api-access-df6xh: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional:alaz.yaml
```yml apiVersion: v1 kind: ServiceAccount metadata: name: alaz-serviceaccount namespace: anteon --- # For alaz to keep track of changes in cluster apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: alaz-role namespace: anteon rules: - apiGroups: - "*" resources: - pods - services - endpoints - replicasets - deployments - daemonsets - statefulsets verbs: - "get" - "list" - "watch" --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: alaz-role-binding namespace: anteon roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: alaz-role subjects: - kind: ServiceAccount name: alaz-serviceaccount namespace: anteon --- apiVersion: apps/v1 kind: DaemonSet metadata: name: alaz-daemonset namespace: anteon spec: selector: matchLabels: app: alaz template: metadata: labels: app: alaz spec: hostPID: true containers: - env: - name: TRACING_ENABLED value: "true" - name: METRICS_ENABLED value: "true" - name: LOGS_ENABLED value: "false" - name: BACKEND_HOST value: http://bore.pub:39548/api-alaz - name: LOG_LEVEL value: "1" # - name: EXCLUDE_NAMESPACES # value: "^anteon.*" - name: MONITORING_ID value: 7c6a484a-ec47-46a6-946d-4071ff6cf883 - name: SEND_ALIVE_TCP_CONNECTIONS # Send undetected protocol connections (unknown connections) value: "false" - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName args: - --no-collector.wifi - --no-collector.hwmon - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/) - --collector.netclass.ignored-devices=^(veth.*)$ image: ddosify/alaz:v0.11.3 imagePullPolicy: IfNotPresent name: alaz-pod ports: - containerPort: 8181 protocol: TCP resources: limits: memory: 1Gi requests: cpu: "1" memory: 400Mi securityContext: privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File # needed for linking ebpf trace programs volumeMounts: - mountPath: /sys/kernel/debug name: debugfs readOnly: false dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: alaz-serviceaccount serviceAccountName: alaz-serviceaccount terminationGracePeriodSeconds: 30 # needed for linking ebpf trace programs volumes: - name: debugfs hostPath: path: /sys/kernel/debug ```Only thing that I did different compared to the documentation was using bore.pub instead of
ngrok
which shouldn't be a problem I think.I'm running Arch Linux with the kernel
6.10.1-arch1-1
.