Filter by node name in `k8s_pod_handler`

leogr commented 3 years ago

Motivation

This proposed feature aims to address the issue https://github.com/falcosecurity/falco/issues/778.

It is intended to be a further improvement alongside other two valid fixes/mitigations already proposed:

(and does not supersede them).

I also want to thank @leodido, @ldegio, and @zemek since my analysis is in some way a derivate of their early findings and proposed solutions.

Rationale

libsinsp implements a mechanism to gather the K8s state, which is then used to enrich syscall events with metadata. During the initial phases, the whole K8s state is fetched from the K8s api-server. Later updates are received using the watch API.

In a huge cluster, the problem arises when the K8s api-server response is too big to be handled by libsinsp. Usually, it occurs at startup time when the response of the following call exceeds the hard-coded limit of 30MB:

/api/v1/pods?fieldSelector=status.phase!=Failed,status.phase!=Unknown,status.phase!=Succeeded&pretty=false

Note: it might happen with other API calls too, but it is unlikely to happen

Although https://github.com/falcosecurity/libs/pull/40 allows making the limit configurable (thus, users can overcome the 30MB limitation), the performance issue remains unsolved.

That being said, there are some considerations to take into account to tackle the problem:

libscap only captures syscall events from the system in which it's running (for this reason, Falco needs to be deployed on each node of the cluster)
thus, metadata coming from pods that live in other nodes are never consumed
but unfortunately libsinsp requests the state of all pods in the cluster regardless of what is needed

Luckily, the /api/v1/pods API allows to filter by node (eg., using fieldSelector=spec.nodeName=<node>) that give us room for a substantial improvement.

Testing hypothesis

To test my hypothesis, I created a testing cluster to hit the 30MB limit, with the following parameters:

Nodes: 5
Pods: 300
Single Pod JSON size: ~122KB (includes dummy annotations on purpose, deployment manifest here)

The results were:

No filter
- /api/v1/pods?fieldSelector=status.phase!=Failed,status.phase!=Unknown,status.phase!=Succeeded&pretty=false
- Response size: 36.7MB
Filtering by node
- /api/v1/pods?fieldSelector=status.phase!=Failed,status.phase!=Unknown,status.phase!=Succeeded,spec.nodeName=kind-control-plane&pretty=false
- Response size: 6.4mb

It is now clear that the response size reduces significatively. It would not only avoid hitting the 30MB limit but also would substantially increase the overall performance. Moreover, that benefit would also apply during the watching phase.

Finally, considering those are the default control-plane limits specified by Kubernets for large clusters:

No more than 100 pods per node (actually, the kubelet default is 110 and can be changed, see maxPods in the KubeletConfiguration)
No more than 5000 nodes
No more than 150000 total pods
No more than 300000 total containers)

We could also assume that:

an API call that returns metadata of 150000 is not sustainable anyways, for example:
- assuming 1KB per pod on average: ~146MB
- assuming 10KB per pod on average: ~1.43GB
- and so on.
the 110 pods per node limit should cap the response to an acceptable size (at least with Kubernetes defaults)

Feature

Introduce the ability to filter by the current node when fetching pods metadata from the K8S api-server.

Also, libsinsp should expose an API to set the node name to be used in the filter and not trying to autodetect it since libs implementers could use different deployment strategies that we cannot know in advance (I am open to feedback, see also the Additional context section below).

Alternatives

As suggested privately by @zuc, an alternative could be to directly query the local kubelet to gather those metadata. Although it would be for sure a further performance improvement (since we would not pass thru the api-server), we concluded that it is not feasible because:

kubelet API is NOT officially stable (so our implementation could break at any time without notice)
we would then have to deal with the kubelet authentication mechanism

Additional context

Detecting node name in an automated fashion can break in some scenarios, therefore we should implement this change in a way that allows consumers (e.g. Falco) to set it explicitly and pass it down to the libsinsp API mentioned above.

Setting the node name should also be optional since filtering by a single node might not apply to all use cases (ie. with kind more nodes can live on the same host).

For example, in a Falco K8s deployment, we could make use of the downward API to pass current node name to Falco:

...
    containers:
        - name: falco
        ...
          args:
            - /usr/bin/falco
            - --cri
            - /run/containerd/containerd.sock
            - -K
            - /var/run/secrets/kubernetes.io/serviceaccount/token
            - -k
            - "https://$(KUBERNETES_SERVICE_HOST)"
            - -pk
            - --k8s-node
            - "$(MY_NODE_NAME)"
        env:
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
...

leodido commented 3 years ago

As said, I believe this approach is really beneficial and needed.

The really important point here is the last one about node name detection.

What if the user forgets to pass it down? Which behavior the libs will have in such a situation?

leogr commented 3 years ago

What if the user forgets to pass it down? Which behavior the libs will have in such a situation?

I believe node autodetection is not feasible since there're too many ways to use libs (also too many ways to deploy Falco), so it would be hard to cover all use cases reliably. On the other hand, explicitly passing that value via command args (for example, using the downward API like in the above code snipped) seemed to be a solid solution (at least, I can't see any issue with that).

Finally, to answer your question, when the node name is not specified, the behavior should be simply no filtering (as it is now). IMHO, documenting the performance penalty of not passing that value should be enough to make users aware of the implications.

@leodido wdyt?

cc @shane-lawrence @cashwilliams

shane-lawrence commented 3 years ago

It should be straightforward for us to pass the node name in an argument to Falco instead of autodetecting in libs, just like we currently pass the k8s API address and auth info. Defaulting to no filtering makes sense to me: doing nothing preserves the status quo, which is fine in smaller clusters; if you have a larger cluster, you could either pass a larger max. size or specify a node filter (or both).

leodido commented 3 years ago

Defaulting to the current behavior and documenting the performance improvements by using this future flag resonates a lot with me!

falcosecurity / libs

Filter by node name in `k8s_pod_handler` #43