Closed axot closed 2 years ago
Hello @axot would you please use the issues template? It helps us to have more info to approach the issue users are reporting. Thanks!
/triage needs-information
@leodido Hello, I updated the information with issues template format.
This is very interesting @axot - can I ask you how many pods did your load testing create when you experienced this? I'm only interested in the order of magnitude.
I did not remember the specific number, but I’m sure the number of Pods were greater than 5k in a single cluster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We still want to fix this.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I ran into this today with a large cluster - is it possible to make this limit configurable? @fntlnz
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Issues labeled "cncf", "roadmap" and "help wanted" will not be automatically closed. Please refer to a maintainer to get such label added if you think this should be kept open.
/help
@leogr: This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
We got hit by that as well.
5000 pods in a 500 cluster. Right now this is causing falco to go into a crash back loop for ever saturating the apiserver with new watch requests.
It's unfortunate to say but falco might not be ready to be used under big cluster.
This needs attention if we want falco to become the standard.
@fntlnz @leodido @leogr I see this issue has been around for quite some time now... What are your plans for raising this limit? We are using Falco to help with PCI and we are now reaching 5K pods in a pretty big cluster, where Falco just stopped working. That's kind of a deal breaker at the moment...
I'll bring this up for discussion in the community call. This needs a proposal to find a solution. @lorenzo-biava @djsly you're more than welcome to join to share your findings.
Hello, are there any updates or ETA for getting this fixed?
Hello Falco, has this been discussed in the community call yet? any update on this issue? We are experiencing this issue, and therefore missing visibility into our containers entirely. Would appreciate any updates or ETA, thanks!
Looks like this is a "soft limit" that comes from Sysdig itself; and it was already risen in the past (https://github.com/draios/sysdig/pull/892/files).
@fntlnz Should we open an issue there? Do you think it's something that can be somehow changed to be set dynamically? Or "gigantic k8s environments" are just out of scope (PS: I would not consider a 4K-pod a gigantic cluster these days)? 😀 Can/should we just disable the K8s integration as a temporary workaround in the meantime?
PS: sorry for missing the community call. Though I'm not sure how we could have helped in proposing a solution...
Please give us some hints/indications 😉
Another user (Robbie Bytheway on Slack) reported this issue.
Thread here.
I run a decently sized k8s cluster, around 2600 pods. When installing Falco, and its DaemonSet pods came online for the first time, they all began reporting this socket error due to the 30MB fixed size limit. I'm unable to proceed any further with Falco and I'm forced to consider alternatives such as Wazuh.
I feel like I'm in Bizarro World, when I see a project like Falco incubated by the CNCF, marketed as the best choice for k8s security analysis, and yet there's a hard-coded 30MB socket limit which gets breached by a relatively small cluster of 2600 pods?!
This tells me Falco hasn't been proven in any sort of decently-sized cluster. Unless there's some other variable here that I'm missing, in which case I would love for a maintainer or community contributor to politely whack me upside the head and tell me why I'm wrong.
I'm now running into this issue as well. Are there any current workarounds?
@fntlnz @leodido @leogr do you think this issue could get back some momentum ?
└─┴> kc get nodes |wc -l;kc get pods -A |wc -l
187
13294
unable to start falco on this environment with the same error.
Kubelet Version: v1.19.6 Kube-Proxy Version: v1.19.6
Any way we can work around this? Project is now dead since we cant spin up falco on our production environment..
Any chance this could get looked at?
Same, I'm still blocked here as well. I'm happy to provide working pizza, if that helps. :)
Hitting this as well, just commenting updated link to offending file https://github.com/falcosecurity/libs/blob/master/userspace/libsinsp/socket_handler.h#L393
This is also affecting us. Can we expect this to be patched at some point? What is the best workaround here?
A fix is in the working. See https://github.com/falcosecurity/libs/pull/40
In the meantime, I'm going to prepare a PR here on Falco to let you configure the limits.
Another update: I'm working on https://github.com/falcosecurity/libs/pull/49, which should definitively address this issue (at least, I hope :smile_cat: ) in the way described by https://github.com/falcosecurity/libs/issues/43.
Hi! Any updates?
I've found this issue too with same error, when trying to deploy using falco version 0.29.1, in OCP 4.6.xx, is there any workaround or suggestion for this issue meanwhile waiting this issue to be fixed ?
Hey,
there are two patches under review:
Hey,
there are two patches under review:
I've been trying using node filtering, but still got the limited read more in replication controller . Log like this, Tue Jul 13 12:24:22 2021: Runtime error: Socket handler (k8s_replicationcontroller_handler_state): read more than 30MB of data from https://172.23.0.1/api/v1/replicationcontrollers?pretty=false (31463287 bytes, 2530 reads). Giving up. Exiting.
I've been trying using node filtering, but still got the limited read more in replication controller . Log like this,
Tue Jul 13 12:24:22 2021: Runtime error: Socket handler (k8s_replicationcontroller_handler_state): read more than 30MB of data from https://172.23.0.1/api/v1/replicationcontrollers?pretty=false (31463287 bytes, 2530 reads). Giving up. Exiting.
Thank you for reporting!
The node filtering patch applies to the /api/v1/pods
resource only, so it does not solve the /api/v1/replicationcontrollers
case. On the other hand, the "metadata download" patch should address your case.
In general, both patches are needed to solve the whole set of cases that can cause the problem described by this issue.
I've been trying using node filtering, but still got the limited read more in replication controller . Log like this,
Tue Jul 13 12:24:22 2021: Runtime error: Socket handler (k8s_replicationcontroller_handler_state): read more than 30MB of data from https://172.23.0.1/api/v1/replicationcontrollers?pretty=false (31463287 bytes, 2530 reads). Giving up. Exiting.
Thank you for reporting!
The node filtering patch applies to the
/api/v1/pods
resource only, so it does not solve the/api/v1/replicationcontrollers
case. On the other hand, the "metadata download" patch should address your case.In general, both patches are needed to solve the whole set of cases that can cause the problem described by this issue.
so these two patches already merged ?
so these two patches already merged ?
They are both included into libs since commit https://github.com/falcosecurity/libs/commit/f7029e2522cc4c81841817abeeeaa515ed944b6c
On Falco side, there're two PRs - not yet merged - that integrate these patches:
The first (#1667) only includes the https://github.com/falcosecurity/libs/pull/40 and allows the user to configure the limit.
The latter (#1671) includes both patches (since it uses the latest libs
commit). So, if you wanted to try #1671 you will get both the limit raised to 100MB (the new default value) and the node filtering.
PS I apologize for all these complications and delays, but it was necessary to attack the problem from different angles. Which required more PRs. If you need any support to test these patches, feel free to contact me on Slack. I will be happy to help.
└─┴> kubectl get nodes |wc -l; kubectl get pods -A |wc -l 51 3135 unable to start Falco in this environment with the same error: Runtime error: Socket handler (k8s_pod_handler_state): read more than 30MB of data from blablaba (31463287 bytes, 2032 reads). Giving up. Exiting. │ │
K8s Rev: v1.18.16-eks-7737de
Any ETA for merging the above PRs into master?
👋 I am experiencing this issue as well and am interested in seeing the fix merged! 🙇
There is some disconnection between code and config - for me metadata_download mb setting does nothing and reading the code makes sense it doesn't.
What happened: We observed an error log under a large scale load test.
Wed Aug 21 02:21:40 2019: Runtime error: Socket handler (k8s_pod_handler_state): read more than 30MB of data from https://10.40.48.1/api/v1/pods?fieldSelector=status.phase!=Failed,status.phase!=Unknown,status.phase!=Succeeded&pretty=false (31463287 bytes, 2162 reads). Giving up. Exiting.
What you expected to happen: Falco should work in a large scale Kubernetes cluster.
How to reproduce it (as minimally and precisely as possible): Use a large Kubernetes cluster with > 1k node and > 10k PODs.
Anything else we need to know?:
Environment:
Falco version (use
falco --version
): falco version 0.1.2736devSystem info <!-- Falco has a built-in support command you can use "falco --support |
Cloud provider or hardware configuration: GKE
OS (e.g:
cat /etc/os-release
): PRETTY_NAME="Debian GNU/Linux buster/sid" NAME="Debian GNU/Linux" ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"Kernel (e.g.
uname -a
):Install tools (e.g. in kubernetes, rpm, deb, from source): Kubernetes
Others: