Closed jer closed 7 years ago
Hey folks,
Any thoughts on this issue? We are in a weird situation where we are hitting around 20 Segfaults a day on the version of Falco that we have to use. We want to upgrade to the latest Falco in case that helps the Segfaults, but the system fails gathering Kubernetes metadata. So we either need to suck up having no metadata from kubernetes, or we need to suck up repeated Segfaults.
We are happy to experiment with new versions and settings, or to provide any sort of additional information that would help debug this.
Hi, I'm going to look at this issue in more detail next week. I'll probably have follow up questions, so stay tuned.
I looked at the error in more detail and the changes around that time and I think the problem is this change: https://github.com/draios/sysdig/commit/f41b877815e654bdaa0e35b75866d69c12072f4f#diff-ef0695fa60df23b4d886b6376cc935a9L365. It looks like in order to speed up responsiveness, we reduced the number of reads we would perform when fetching k8s response data. 100 reads * 1k packets, if you were unlucky enough to only read one packet for each read, is 1mb, below the 2M response size you had. Even if reading multiple packets at once, I could imagine not reading all of the response within 100 reads.
I'll ask to see why we reduced the number of reads from 1000 to 100 and see if we can find a better way to handle these big responses.
If you'd like to try our latest dev build (sysdig/sysdig:dev for docker), we added additional logging that should show how many bytes were actually read if it gives up after 100 reads. That will help confirm that this is the problem you're seeing.
Yep, it looks like it is only reading the first of 13 megs:
root@sysdig:/# sysdig -k https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT -K /var/run/secrets/kubernetes.io/serviceaccount/token -pk
Socket handler (k8s_node_handler_state): unable to retrieve data from https://10.10.10.1:443/api/v1/nodes?pretty=false (101 attempts, read 1316727 bytes)
root@sysdig:/# curl -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" "https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/api/v1/pods?fieldSelector=status.phase%3DRunning&pretty=false" | wc -c
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 13.3M 0 13.3M 0 0 20.4M 0 --:--:-- --:--:-- --:--:-- 20.4M
13988926
Ok, we'll see how we can make that limit configurable or something. It's a bit embedded in the generic socket reading code so we'll have to find a good knob to expose.
Hi, we plan on doing a new sysdig release today that should address this problem. Could you try it out and let us know what you think? (I'm going to reopen this in the meantime)
Yep, I'd be happy to try it once a Docker image is available.
The 0.17.0 image seems to resolve things for me. Thanks!
Awseome!
We are seeing sysdig and falco both fail if kubernetes integration is enabled, but only in a small number of clusters.
This happens with the sysdig docker image version 0.16.0, 0.15.1, and 0.15.0. It works as expected in 0.14.0.
I run the following command from inside the sysdig container:
and it results in the following error:
However, if I take the exact URL that it output in the error and curl it, I get a ton of data back:
Interestingly, this same container and
sysdig
command works on some other, identically-configured clusters. The only difference we can come up with is that these clusters have less data:A little more info: