Alaz crashes on self-hosted Anteon/Alaz with connection error: desc = "transport: Error while dialing: dial unix /proc/1/root/var/run/cri-dockerd.sock: connect: no such file or directory"

getanteon / alaz

Alaz: Advanced eBPF Agent for Kubernetes Observability – Effortlessly monitor K8s service interactions and performance metrics in your K8s environment. Gain in-depth insights with service maps, metrics, and more, while staying alert to crucial system anomalies 🐝

https://getanteon.com

GNU Affero General Public License v3.0

644 stars 28 forks source link

Alaz crashes on self-hosted Anteon/Alaz with connection error: desc = "transport: Error while dialing: dial unix /proc/1/root/var/run/cri-dockerd.sock: connect: no such file or directory" #163

Closed degola closed 2 months ago

degola commented 2 months ago

I have a self-hosted Kubernetes via Rancher (RKE2), version: v1.27.12 +rke2r1.

Anteon itself works fine after deployment, but Alaz crashes with the following output on each individual pod:

{"level":"info","tag":"v0.10.0","time":1721739005,"message":"alaz tag"}
{"level":"info","time":1721739005,"message":"k8sCollector initializing..."}
{"level":"error","error":"validate service connection: validate CRI v1 runtime API for endpoint \"unix:///proc/1/root/var/run/cri-dockerd.sock\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /proc/1/root/var/run/cri-dockerd.sock: connect: no such file or directory\"","time":1721739005,"message":"failed to create cri tool"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x181059e]

goroutine 1 [running]:
github.com/ddosify/alaz/cri.(*CRITool).GetPidsRunningOnContainers(0x0)
    /app/cri/cri.go:129 +0xbe
github.com/ddosify/alaz/aggregator.NewAggregator({0x24a5d70?, 0xc0003dc280?}, 0x0, 0xc00006ad80, 0xc0001266c0, 0xc000126720, 0xc000126600, 0xc000326000, {0x24b37d8, 0xc000176960})
    /app/aggregator/data.go:159 +0x21f
main.main()
    /app/main.go:105 +0xafd

I used the following Helm chart to install Alaz:

helm upgrade --install --namespace anteon alaz anteon/alaz --set monitoringID=$(MONITORING_ID) --set backendHost=$(BACKEND_HOST)

$(MONITORING_ID) and $(BACKEND_HOST) are set properly.

Any suggestions/hints for this?

kenanfarukcakir commented 2 months ago

Hi, seems like Alaz could not find a CRI socket to connect to on your nodes.

What is the underlying OS on your nodes? Linux or Windows? We only support Linux machines. If Linux, what container runtime do you use? If you could specify the socket path of the CRI, it'd be great.

This link can help about CRIs(container runtime interfaces.)

degola commented 2 months ago

I'm using Ubuntu 22.04.4 LTS, but rke2 runs with k3s and containerd, so the CRI socket path is /run/k3s/containerd/containerd.sock.

The helm chart I'm using (https://github.com/getanteon/anteon-helm-charts/blob/master/charts/alaz/templates/daemonset.yaml) seems to not have support to specify the CRI socket path.

Also, looking further into https://github.com/getanteon/alaz/blob/master/cri/cri.go#L24C5-L24C28 it seems actually to be hard-coded there?

I guess I can put a PR to extend the list as quick-fix but probably a good idea to have it manageable via ENV-vars as well or do you have a better solution?

kenanfarukcakir commented 2 months ago

Managing through ENV-vars in case of hard-coded paths not matching the CRI socket path on the node would be more flexible like you said. If you could send a PR, we can quickly review and release a new version. You can checkout a new branch from develop branch btw.

degola commented 2 months ago

@kenanfarukcakir PR is in: https://github.com/getanteon/alaz/pull/164

Once merged + released also merge PR for the helm-chart: https://github.com/getanteon/anteon-helm-charts/pull/10

kastl-ars commented 1 month ago

Hi all,

unfortunately this does not seem to be working for me with alaz 0.12.0 (installed via the chart).

The daemonset contains the CRI_RUNTIME_ENDPOINT environment variable:

$ k get ds alaz-daemonset -o yaml|grep -A1 CRI
        - name: CRI_RUNTIME_ENDPOINT
          value: unix:///run/k3s/containerd/containerd.sock
$

But the pod nevertheless crashes:

$ k logs alaz-daemonset-khmwf
{"level":"info","tag":"v0.12.0","time":1726058274,"message":"alaz tag"}
{"level":"info","time":1726058274,"message":"k8sCollector initializing..."}
{"level":"error","error":"validate service connection: validate CRI v1 runtime API for endpoint \"unix:///proc/1/root/var/run/cri-dockerd.sock\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /proc/1/root/var/run/cri-dockerd.sock: connect: no such file or directory\"","time":1726058274,"message":"failed to create cri tool"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x18106fe]

The endpoint seems to be still the default one, not the one set via then environment variable.

Is there just not yet a release of the chart that contains this fix?

Kind Regards, Johannes

yasin-herken commented 4 weeks ago

Hi all,

unfortunately this does not seem to be working for me with alaz 0.12.0 (installed via the chart).

The daemonset contains the CRI_RUNTIME_ENDPOINT environment variable:
$ k get ds alaz-daemonset -o yaml|grep -A1 CRI
        - name: CRI_RUNTIME_ENDPOINT
          value: unix:///run/k3s/containerd/containerd.sock
$
But the pod nevertheless crashes:
$ k logs alaz-daemonset-khmwf
{"level":"info","tag":"v0.12.0","time":1726058274,"message":"alaz tag"}
{"level":"info","time":1726058274,"message":"k8sCollector initializing..."}
{"level":"error","error":"validate service connection: validate CRI v1 runtime API for endpoint \"unix:///proc/1/root/var/run/cri-dockerd.sock\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /proc/1/root/var/run/cri-dockerd.sock: connect: no such file or directory\"","time":1726058274,"message":"failed to create cri tool"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x18106fe]
The endpoint seems to be still the default one, not the one set via then environment variable.

Is there just not yet a release of the chart that contains this fix?

Kind Regards, Johannes

I am getting same error.

kenanfarukcakir commented 4 weeks ago

Try to give unix:///proc/1/root/run/k3s/containerd/containerd.sock instead of unix:///run/k3s/containerd/containerd.sock. Alaz needs proc/1/root prefix to access the cri endpoint on the host. A pr that will automate this would be much welcome.

yasin-herken commented 3 weeks ago

Try to give unix:///proc/1/root/run/k3s/containerd/containerd.sock instead of unix:///run/k3s/containerd/containerd.sock. Alaz needs proc/1/root prefix to access the cri endpoint on the host. A pr that will automate this would be much welcome.

I tried it also but the result same.

yasin-herken commented 3 weeks ago

I removed the k3s installation and tried with kubespray(production ready cluster) on my local. It runs with default configs.

kastl-ars commented 3 weeks ago

Try to give unix:///proc/1/root/run/k3s/containerd/containerd.sock instead of unix:///run/k3s/containerd/containerd.sock. Alaz needs proc/1/root prefix to access the cri endpoint on the host. A pr that will automate this would be much welcome.

In my comment I stated that there was a different endpoint specified in the daemonset. But the error only mentions the default one.

To me it looks like whatever I set is not being respected actually.