falcosecurity / libs

libsinsp, libscap, the kernel module driver, and the eBPF driver sources
https://falcosecurity.github.io/libs/
Apache License 2.0
216 stars 160 forks source link

Filter by node name in `k8s_pod_handler` #43

Closed leogr closed 3 years ago

leogr commented 3 years ago

Motivation

This proposed feature aims to address the issue https://github.com/falcosecurity/falco/issues/778.

It is intended to be a further improvement alongside other two valid fixes/mitigations already proposed:

(and does not supersede them).

I also want to thank @leodido, @ldegio, and @zemek since my analysis is in some way a derivate of their early findings and proposed solutions.

Rationale

libsinsp implements a mechanism to gather the K8s state, which is then used to enrich syscall events with metadata. During the initial phases, the whole K8s state is fetched from the K8s api-server. Later updates are received using the watch API.

In a huge cluster, the problem arises when the K8s api-server response is too big to be handled by libsinsp. Usually, it occurs at startup time when the response of the following call exceeds the hard-coded limit of 30MB:

/api/v1/pods?fieldSelector=status.phase!=Failed,status.phase!=Unknown,status.phase!=Succeeded&pretty=false

Note: it might happen with other API calls too, but it is unlikely to happen

Although https://github.com/falcosecurity/libs/pull/40 allows making the limit configurable (thus, users can overcome the 30MB limitation), the performance issue remains unsolved.

That being said, there are some considerations to take into account to tackle the problem:

Luckily, the /api/v1/pods API allows to filter by node (eg., using fieldSelector=spec.nodeName=<node>) that give us room for a substantial improvement.

Testing hypothesis

To test my hypothesis, I created a testing cluster to hit the 30MB limit, with the following parameters:

The results were:

It is now clear that the response size reduces significatively. It would not only avoid hitting the 30MB limit but also would substantially increase the overall performance. Moreover, that benefit would also apply during the watching phase.

Finally, considering those are the default control-plane limits specified by Kubernets for large clusters:

We could also assume that:

Feature

Introduce the ability to filter by the current node when fetching pods metadata from the K8S api-server.

Also, libsinsp should expose an API to set the node name to be used in the filter and not trying to autodetect it since libs implementers could use different deployment strategies that we cannot know in advance (I am open to feedback, see also the Additional context section below).

Alternatives

As suggested privately by @zuc, an alternative could be to directly query the local kubelet to gather those metadata. Although it would be for sure a further performance improvement (since we would not pass thru the api-server), we concluded that it is not feasible because:

Additional context

Detecting node name in an automated fashion can break in some scenarios, therefore we should implement this change in a way that allows consumers (e.g. Falco) to set it explicitly and pass it down to the libsinsp API mentioned above.

Setting the node name should also be optional since filtering by a single node might not apply to all use cases (ie. with kind more nodes can live on the same host).

For example, in a Falco K8s deployment, we could make use of the downward API to pass current node name to Falco:

...
    containers:
        - name: falco
        ...
          args:
            - /usr/bin/falco
            - --cri
            - /run/containerd/containerd.sock
            - -K
            - /var/run/secrets/kubernetes.io/serviceaccount/token
            - -k
            - "https://$(KUBERNETES_SERVICE_HOST)"
            - -pk
            - --k8s-node
            - "$(MY_NODE_NAME)"
        env:
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
...
leodido commented 3 years ago

As said, I believe this approach is really beneficial and needed.

The really important point here is the last one about node name detection.

What if the user forgets to pass it down? Which behavior the libs will have in such a situation?

leogr commented 3 years ago

What if the user forgets to pass it down? Which behavior the libs will have in such a situation?

I believe node autodetection is not feasible since there're too many ways to use libs (also too many ways to deploy Falco), so it would be hard to cover all use cases reliably. On the other hand, explicitly passing that value via command args (for example, using the downward API like in the above code snipped) seemed to be a solid solution (at least, I can't see any issue with that).

Finally, to answer your question, when the node name is not specified, the behavior should be simply no filtering (as it is now). IMHO, documenting the performance penalty of not passing that value should be enough to make users aware of the implications.

@leodido wdyt?

cc @shane-lawrence @cashwilliams

shane-lawrence commented 3 years ago

It should be straightforward for us to pass the node name in an argument to Falco instead of autodetecting in libs, just like we currently pass the k8s API address and auth info. Defaulting to no filtering makes sense to me: doing nothing preserves the status quo, which is fine in smaller clusters; if you have a larger cluster, you could either pass a larger max. size or specify a node filter (or both).

leodido commented 3 years ago

Defaulting to the current behavior and documenting the performance improvements by using this future flag resonates a lot with me!