Kubernetes integration breaks for some clusters with the latest sysdig

jer commented 7 years ago

We are seeing sysdig and falco both fail if kubernetes integration is enabled, but only in a small number of clusters.

This happens with the sysdig docker image version 0.16.0, 0.15.1, and 0.15.0. It works as expected in 0.14.0.

I run the following command from inside the sysdig container:

root@sysdig:/# sysdig --version
sysdig version 0.16.0

root@sysdig:/# sysdig -k https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT -K /var/run/secrets/kubernetes.io/serviceaccount/token -pk

and it results in the following error:

Socket handler (k8s_pod_handler_state): unable to retrieve data from https://10.10.10.1:443/api/v1/pods?fieldSelector=status.phase%3DRunning&pretty=false (101 attempts)

However, if I take the exact URL that it output in the error and curl it, I get a ton of data back:

root@sysdig:/# env | grep KUBERNETES
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_ADDR=10.10.10.1
KUBERNETES_PORT=tcp://10.10.10.1:443
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP=tcp://10.10.10.1:443
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_HOST=10.10.10.1

root@sysdig:/# curl -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" "https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/api/v1/pods?fieldSelector=status.phase%3DRunning&pretty=false" | wc -c
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2296k    0 2296k    0     0  11.2M      0 --:--:-- --:--:-- --:--:-- 11.3M
2351987

Interestingly, this same container and sysdig command works on some other, identically-configured clusters. The only difference we can come up with is that these clusters have less data:

root@sysdig:/# curl -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" "https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/api/v1/pods?fieldSelector=status.phase%3DRunning&pretty=false" | wc -c
   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  567k    0  567k    0     0  8030k      0 --:--:-- --:--:-- --:--:-- 8111k
581449

A little more info:

➜ kubectl version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1+b0b7a32", GitCommit:"63c55b72820d6a95978e3aec6cbbf9c7b757da34", GitTreeState:"dirty", BuildDate:"2017-04-18T22:51:56Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

RHEL 7.1
Kernel 3.10.0-514.6.1.el7

jer commented 7 years ago

Hey folks,

Any thoughts on this issue? We are in a weird situation where we are hitting around 20 Segfaults a day on the version of Falco that we have to use. We want to upgrade to the latest Falco in case that helps the Segfaults, but the system fails gathering Kubernetes metadata. So we either need to suck up having no metadata from kubernetes, or we need to suck up repeated Segfaults.

We are happy to experiment with new versions and settings, or to provide any sort of additional information that would help debug this.