falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.23k stars 893 forks source link

Runtime error: Socket handler (k8s_pod_handler_state): read more than 30MB of data #778

Closed axot closed 2 years ago

axot commented 5 years ago

What happened: We observed an error log under a large scale load test.

Wed Aug 21 02:21:40 2019: Runtime error: Socket handler (k8s_pod_handler_state): read more than 30MB of data from https://10.40.48.1/api/v1/pods?fieldSelector=status.phase!=Failed,status.phase!=Unknown,status.phase!=Succeeded&pretty=false (31463287 bytes, 2162 reads). Giving up. Exiting.

What you expected to happen: Falco should work in a large scale Kubernetes cluster.

How to reproduce it (as minimally and precisely as possible): Use a large Kubernetes cluster with > 1k node and > 10k PODs.

Anything else we need to know?:

Environment:

leodido commented 5 years ago

Hello @axot would you please use the issues template? It helps us to have more info to approach the issue users are reporting. Thanks!

/triage needs-information

axot commented 5 years ago

@leodido Hello, I updated the information with issues template format.

fntlnz commented 5 years ago

This is very interesting @axot - can I ask you how many pods did your load testing create when you experienced this? I'm only interested in the order of magnitude.

axot commented 5 years ago

I did not remember the specific number, but I’m sure the number of Pods were greater than 5k in a single cluster.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

fntlnz commented 4 years ago

We still want to fix this.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

a0145 commented 4 years ago

I ran into this today with a large cluster - is it possible to make this limit configurable? @fntlnz

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Issues labeled "cncf", "roadmap" and "help wanted" will not be automatically closed. Please refer to a maintainer to get such label added if you think this should be kept open.

leogr commented 4 years ago

/help

poiana commented 4 years ago

@leogr: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/falcosecurity/falco/issues/778): >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
djsly commented 3 years ago

We got hit by that as well.

5000 pods in a 500 cluster. Right now this is causing falco to go into a crash back loop for ever saturating the apiserver with new watch requests.

It's unfortunate to say but falco might not be ready to be used under big cluster.

This needs attention if we want falco to become the standard.

lorenzo-biava commented 3 years ago

@fntlnz @leodido @leogr I see this issue has been around for quite some time now... What are your plans for raising this limit? We are using Falco to help with PCI and we are now reaching 5K pods in a pretty big cluster, where Falco just stopped working. That's kind of a deal breaker at the moment...

fntlnz commented 3 years ago

I'll bring this up for discussion in the community call. This needs a proposal to find a solution. @lorenzo-biava @djsly you're more than welcome to join to share your findings.

aleksandr-morozov commented 3 years ago

Hello, are there any updates or ETA for getting this fixed?

oownus commented 3 years ago

Hello Falco, has this been discussed in the community call yet? any update on this issue? We are experiencing this issue, and therefore missing visibility into our containers entirely. Would appreciate any updates or ETA, thanks!

lorenzo-biava commented 3 years ago

Looks like this is a "soft limit" that comes from Sysdig itself; and it was already risen in the past (https://github.com/draios/sysdig/pull/892/files).

@fntlnz Should we open an issue there? Do you think it's something that can be somehow changed to be set dynamically? Or "gigantic k8s environments" are just out of scope (PS: I would not consider a 4K-pod a gigantic cluster these days)? 😀 Can/should we just disable the K8s integration as a temporary workaround in the meantime?

PS: sorry for missing the community call. Though I'm not sure how we could have helped in proposing a solution...

Please give us some hints/indications 😉

leodido commented 3 years ago

Another user (Robbie Bytheway on Slack) reported this issue.

Thread here.

brianirish commented 3 years ago

I run a decently sized k8s cluster, around 2600 pods. When installing Falco, and its DaemonSet pods came online for the first time, they all began reporting this socket error due to the 30MB fixed size limit. I'm unable to proceed any further with Falco and I'm forced to consider alternatives such as Wazuh.

I feel like I'm in Bizarro World, when I see a project like Falco incubated by the CNCF, marketed as the best choice for k8s security analysis, and yet there's a hard-coded 30MB socket limit which gets breached by a relatively small cluster of 2600 pods?!

This tells me Falco hasn't been proven in any sort of decently-sized cluster. Unless there's some other variable here that I'm missing, in which case I would love for a maintainer or community contributor to politely whack me upside the head and tell me why I'm wrong.

IanRobertson-wpe commented 3 years ago

I'm now running into this issue as well. Are there any current workarounds?

djsly commented 3 years ago

@fntlnz @leodido @leogr do you think this issue could get back some momentum ?

gyoza commented 3 years ago
 └─┴> kc get nodes |wc -l;kc get pods -A |wc -l
     187
     13294

unable to start falco on this environment with the same error.

Kubelet Version: v1.19.6 Kube-Proxy Version: v1.19.6

Any way we can work around this? Project is now dead since we cant spin up falco on our production environment..

gyoza commented 3 years ago

Any chance this could get looked at?

IanRobertson-wpe commented 3 years ago

Same, I'm still blocked here as well. I'm happy to provide working pizza, if that helps. :)

CashWilliams commented 3 years ago

Hitting this as well, just commenting updated link to offending file https://github.com/falcosecurity/libs/blob/master/userspace/libsinsp/socket_handler.h#L393

pogao commented 3 years ago

This is also affecting us. Can we expect this to be patched at some point? What is the best workaround here?

leodido commented 3 years ago

A fix is in the working. See https://github.com/falcosecurity/libs/pull/40

In the meantime, I'm going to prepare a PR here on Falco to let you configure the limits.

leogr commented 3 years ago

Another update: I'm working on https://github.com/falcosecurity/libs/pull/49, which should definitively address this issue (at least, I hope :smile_cat: ) in the way described by https://github.com/falcosecurity/libs/issues/43.

asvasyanin commented 3 years ago

Hi! Any updates?

IndraWiradinataK commented 3 years ago

I've found this issue too with same error, when trying to deploy using falco version 0.29.1, in OCP 4.6.xx, is there any workaround or suggestion for this issue meanwhile waiting this issue to be fixed ?

leogr commented 3 years ago

Hey,

there are two patches under review:

IndraWiradinataK commented 3 years ago

Hey,

there are two patches under review:

I've been trying using node filtering, but still got the limited read more in replication controller . Log like this, Tue Jul 13 12:24:22 2021: Runtime error: Socket handler (k8s_replicationcontroller_handler_state): read more than 30MB of data from https://172.23.0.1/api/v1/replicationcontrollers?pretty=false (31463287 bytes, 2530 reads). Giving up. Exiting.

leogr commented 3 years ago

I've been trying using node filtering, but still got the limited read more in replication controller . Log like this, Tue Jul 13 12:24:22 2021: Runtime error: Socket handler (k8s_replicationcontroller_handler_state): read more than 30MB of data from https://172.23.0.1/api/v1/replicationcontrollers?pretty=false (31463287 bytes, 2530 reads). Giving up. Exiting.

Thank you for reporting!

The node filtering patch applies to the /api/v1/pods resource only, so it does not solve the /api/v1/replicationcontrollers case. On the other hand, the "metadata download" patch should address your case.

In general, both patches are needed to solve the whole set of cases that can cause the problem described by this issue.

IndraWiradinataK commented 3 years ago

I've been trying using node filtering, but still got the limited read more in replication controller . Log like this, Tue Jul 13 12:24:22 2021: Runtime error: Socket handler (k8s_replicationcontroller_handler_state): read more than 30MB of data from https://172.23.0.1/api/v1/replicationcontrollers?pretty=false (31463287 bytes, 2530 reads). Giving up. Exiting.

Thank you for reporting!

The node filtering patch applies to the /api/v1/pods resource only, so it does not solve the /api/v1/replicationcontrollers case. On the other hand, the "metadata download" patch should address your case.

In general, both patches are needed to solve the whole set of cases that can cause the problem described by this issue.

so these two patches already merged ?

leogr commented 3 years ago

so these two patches already merged ?

They are both included into libs since commit https://github.com/falcosecurity/libs/commit/f7029e2522cc4c81841817abeeeaa515ed944b6c

On Falco side, there're two PRs - not yet merged - that integrate these patches:

The first (#1667) only includes the https://github.com/falcosecurity/libs/pull/40 and allows the user to configure the limit. The latter (#1671) includes both patches (since it uses the latest libs commit). So, if you wanted to try #1671 you will get both the limit raised to 100MB (the new default value) and the node filtering.

PS I apologize for all these complications and delays, but it was necessary to attack the problem from different angles. Which required more PRs. If you need any support to test these patches, feel free to contact me on Slack. I will be happy to help.

peleduri commented 3 years ago

└─┴> kubectl get nodes |wc -l; kubectl get pods -A |wc -l 51 3135 unable to start Falco in this environment with the same error: Runtime error: Socket handler (k8s_pod_handler_state): read more than 30MB of data from blablaba (31463287 bytes, 2032 reads). Giving up. Exiting. │ │

K8s Rev: v1.18.16-eks-7737de

Any ETA for merging the above PRs into master?

MattUebel commented 3 years ago

👋 I am experiencing this issue as well and am interested in seeing the fix merged! 🙇

s7an-it commented 1 year ago

There is some disconnection between code and config - for me metadata_download mb setting does nothing and reading the code makes sense it doesn't.