Closed caquino closed 2 years ago
Thanks for opening @caquino - need some time to look at this because I need to get to a gcp account to do tests.
/assign @fntlnz /assign @leodido
FWIW, getting the same on a 3-server (1cpu2Gbram) cluster at DigitalOcean Kubernetes. There are a bunch of additional things already deployed:
Do these errors usually mean the cluster's capacity is reached?
On my case the cluster has 3 n1-standard-2 (2 vCPUs, 7.5 GB memory) nodes, other than what comes pre-configured on the cluster from Google is running only a Deployment with nginx and php.
This is not a production cluster, so it has no traffic on it yet, so I would expect to not be a capacity issue.
I am also having the same issue ("syscall event drop" logs several times an hour) in my almost completely unused GKE cluster with 3 nodes.
Re: the fluentd log tampering alerts, I created #684 to address it, and I describe the fix that worked to stop those from flowing in.
Continuously seeing this after installing k8s-with-rbac/falco-daemonset-configmap.yaml in a 3 node gke environment:
Falco internal: syscall event drop. 7 system calls dropped in last second. Falco internal: syscall event drop. 15 system calls dropped in last second.
Occurs in: falcosecurity/falco:dev falcosecurity/falco:latest
Not yet checked: falcosecurity/falco:15.0.x (container won't start with config from dev branch)
Just to be sure: what syscall_event_drops.rate
and syscall_event_drops.max_burst
config are you folks using?
The default one?
@leodido I've just install falco to the GKE using:
helm install --name falco stable/falco --set ebpf.enabled=true
Having exactly the same errors.
@leodido
1) I have used the defaults, 2) then tried the increased limits:
syscall_event_drops: actions:
Same result. Maybe I should have decreased?
Using falco-daemonset-configmap.yaml with the following changes:
env:
- name: SYSDIG_BPF_PROBE
value: ""
- name: KBUILD_EXTRA_CPPFLAGS
value: -DCOS_73_WORKAROUND
Same here, I'm using the default configuration.
Has anyone found a workaround for this issue?
I see the same behaviour with Falco 0.17 on five node GKE cluster. CPU load on the nodes is between 18% and 35%. Falco is using 1-5% .
Seeing this in Azure with falco using abysmal amounts of CPU and using the set defaults
I'm also seeing 1 n_drops
per node, every 60 to 61 mins on my COS
4.14.127+
v1.12.8-gke.10
GKE cluster (8CPUs/30+GB RAM nodes). Running falcosecurity/falco:latest
.
I have the same issue. As a workaround i have just set the action to "ignore" until this feature works with GKE
I'm seeing this when installed as a binary on the host OS in AWS EC2 running CentOS 7 w/ kernel 5.3 mainline.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Could we keep this open until it has been resolved?
@shane-lawrence i agree we want to keep this open, I was coming here to un-stale this but you already did :)
I abandoned falco due to this issue.
On Nov 25, 2019, at 5:26 PM, Lorenzo Fontana notifications@github.com wrote:
@shane-lawrence i agree we want to keep this open, I was coming here to I’m imi un-stale this but you already did :)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Guys, any progress/updates? Thanks!
For everyone looking at this: The reason why Falco drops is that right now it does not have a way to "offload" events coming from kernel space while receiving them in userspace.
In other words, once inputs are consumed by the Falco driver they are sent immediately by the engine to processes them.
For that reason, we had to implement an artifact called "Token Bucket" that it's basically a rate limiter for the engine and is responsible for the drops.
Now, this kind of system is limited because on machines that have lot of activity in kernel space (and fill the ring buffer fast) Falco drops.
After discussions in many many office hours and repo planning calls, we decided to redesign the Inputs to achieve two goals:
You can find a Diagram with an explanation here.
Long story short: It's likely that this issue will persist until we implement the new input API and then release 1.0 - We don't have an official release date yet but IIRC many community members wanted to set that for March 2020.
We are grateful for everyone's help and we are trying to do our best to make Falco more and more reliable for all kinds of workloads. To stay updated and have a voice in the process please join our weekly calls.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
open, same issue with falco master or 0.21.0 on gke with ebpf true
I see it as well. falco master on GKE COS. It shows up every hour
Falco internal: syscall event drop. 1 system calls dropped in last second.
shows up very often if there is traffic on the cluster
Anyone experiencing this issue, please test again with Falco master and report your findings. Your help is very needed here!
can we use image: falcosecurity/falco:latest or should it be image: falcosecurity/falco:master for pulling the images
@fntlnz I just tried the master
tag. The dropped events still pretty much start immediately:
{"output":"Falco internal: syscall event drop. 11 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"11","n_drops_buffer":"11","n_drops_bug":"0","n_drops_pf":"0","n_evts":"9452"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:45:41.803714581Z"}
{"output":"Falco internal: syscall event drop. 20 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"20","n_drops_buffer":"20","n_drops_bug":"0","n_drops_pf":"0","n_evts":"20671"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:46:12.741877678Z"}
{"output":"Falco internal: syscall event drop. 1 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"1","n_drops_buffer":"1","n_drops_bug":"0","n_drops_pf":"0","n_evts":"13463"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:46:17.902355730Z"}
{"output":"Falco internal: syscall event drop. 1 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"1","n_drops_buffer":"1","n_drops_bug":"0","n_drops_pf":"0","n_evts":"12563"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:46:22.106036046Z"}
{"output":"Falco internal: syscall event drop. 3 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"3","n_drops_buffer":"3","n_drops_bug":"0","n_drops_pf":"0","n_evts":"9336"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:46:35.545513952Z"}
{"output":"Falco internal: syscall event drop. 11 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"11","n_drops_buffer":"11","n_drops_bug":"0","n_drops_pf":"0","n_evts":"11962"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:46:41.596002790Z"}
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-dev-gke-dev-gke-pool-0-2111bc99-gc2x Ready <none> 8h v1.15.9-gke.26
gke-dev-gke-dev-gke-pool-0-284a14fb-4qrr Ready <none> 9h v1.15.9-gke.26
gke-dev-gke-dev-gke-pool-0-80ef08eb-069h Ready <none> 2d6h v1.15.9-gke.26
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-dev-gke-dev-gke-pool-0-2111bc99-gc2x 182m 9% 1332Mi 23%
gke-dev-gke-dev-gke-pool-0-284a14fb-4qrr 391m 20% 2290Mi 40%
gke-dev-gke-dev-gke-pool-0-80ef08eb-069h 256m 13% 1870Mi 33%
@jaychan-q it's falcosecurity/falco:master
to run from the latest git commit, and it's falcosecurity/falco:latest
to run the latest tagged version.
@victortrac Thanks for testing - I just opened an issue regarding this and how we are working on tooling to help you help us tackling this problem https://github.com/falcosecurity/falco/issues/1204 - @shane-lawrence and @natalysheinin might be interested too.
I tried this with master today. It's using this version:
/usr/bin/falco --version
Falco version: 0.22.1-52+d0f4f7c
Driver version: 96bd9bc560f67742738eb7255aeb4d03046b8045
We're still dropping syscalls:
output: Falco internal: syscall event drop. 23 system calls dropped in last second.
output_fields: { [-]
ebpf_enabled: 1
n_drops: 23
n_drops_buffer: 23
n_drops_bug: 0
n_drops_pf: 0
n_evts: 27379
}
Does this output indicate that 27379 syscalls were detected, and of those 23 were dropped?
i am seeing the drops as well from the master image.
{"output":"Falco internal: syscall event drop. 1 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"1","n_drops_buffer":"0","n_drops_bug":"0","n_drops_pf":"1","n_evts":"16858"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-15T00:24:23.033865841Z"}
Hello guys.
I am using falco in kops
provisioned AWS EC2 instances, of 3 workers - 1 master, without k8saudit
integration.
I have exactly the same issue, here is a Kibana screenshot to show the volume:
I'm writing this to raise the issue that it probably is not GKE specific, as I have the same issue for a whole different setup (I can share setup details in case).
Thanks for the report Jhon.
Would you mind sharing info also about the other setup?
In case it needs to stay private you can share such info on slack with us.
On Wed, 20 May 2020 at 12:18, John Torakis notifications@github.com wrote:
Hello guys. I am using falco in kops provisioned AWS EC2 instances, of 3 workers - 1 master, without k8saudit integration. I have exactly the same issue, here is a Kibana screenshot to show the volume: [image: εικόνα] https://user-images.githubusercontent.com/6146328/82434884-0ac07600-9a83-11ea-9c81-a3abe388a9a9.png
I'm writing this to raise the issue that it probably is not GKE specific, as I have the same issue for a whole different setup (I can share setup details in case).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/falcosecurity/falco/issues/669#issuecomment-631384482, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5J4YVV7RJP6LYSXQ23SLRSOU7PANCNFSM4HXZI6DA .
-- L.
Hello @leodido!
Sure, it's a 3 worker - 1 master cluster of k8s on EC2s provisioned through kops
.
Kubernetes version:
$ k version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T23:34:25Z", GoVersion:"go1.14.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.11", GitCommit:"d94a81c724ea8e1ccc9002d89b7fe81d58f89ede", GitTreeState:"clean", BuildDate:"2020-03-12T21:00:06Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
Falco version:
$ k -n kube-system exec -ti falco-9zlmn -- falco --version
Falco version: 0.21.0
Default helm chart syscallEventDrops
options. Same for resources.
Falco pod usage is:
$ k top pods -n kube-system | grep falco
falco-9zlmn 11m 39Mi
falco-hsdkj 9m 18Mi
falco-nr4cx 13m 22Mi
If you need specific information, I can provide!
So the helm chart being used threw me for a loop.
I opened https://github.com/helm/charts/pull/22481 to remove the falco stable chart in favor of https://github.com/falcosecurity/charts/pull/12 which updates the container image from 0.21.0
to 0.23.0
.
This update was required for me to deploy to GKE.
I was getting the following error with 0.21.0
[nova@nova ~]$ k logs falco-xgg9l -f
* Setting up /usr/src links from host
* Mounting debugfs
Found kernel config at /proc/config.gz
* COS detected (build 12371.208.0), using cos kernel headers...
* Downloading https://storage.googleapis.com/cos-tools/12371.208.0/kernel-headers.tgz
* Extracting kernel sources
* Configuring kernel
* Trying to compile BPF probe falco-probe-bpf (falco-probe-bpf-latest-x86_64-4.19.109+-c49f5fbc35a5b76e9ca8d47acc2b1913.o)
make: *** /usr/src/falco-latest/bpf: No such file or directory. Stop.
mv: cannot stat '/usr/src/falco-latest/bpf/probe.o': No such file or directory
* Trying to download precompiled BPF probe from https://s3.amazonaws.com/download.draios.com/stable/sysdig-probe-binaries/falco-probe-bpf-latest-x86_64-4.19.109%2B-c49f5fbc35a5b76e9ca8d47acc2b1913.o
curl: (22) The requested URL returned error: 404 Not Found
* Failure to find a BPF probe
Fri May 22 11:22:15 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:22:15 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:22:16 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:22:16 2020: Unable to load the driver. Exiting.
Fri May 22 11:22:16 2020: Runtime error: can't open BPF probe '/root/.falco/falco-probe-bpf.o': Errno 2. Exiting.
After making the change to the DaemonSet for 0.23.0
I have had Falco running with BPF in GKE without any dropped system calls in the logs.
[nova@nova stable]$ k logs -l app=falco -f
* Extracting kernel sources
* Configuring kernel
* Trying to compile the eBPF probe (falco_cos_4.19.109+_1.o)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
* Extracting kernel sources
* Configuring kernel
* Trying to compile the eBPF probe (falco_cos_4.19.109+_1.o)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:59 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:59 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:27:00 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:27:02 2020: Starting internal webserver, listening on port 8765
* Extracting kernel sources
* Configuring kernel
* Trying to compile the eBPF probe (falco_cos_4.19.109+_1.o)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
^C
[nova@nova stable]$ k get po
NAME READY STATUS RESTARTS AGE
falco-7kkpk 1/1 Running 0 18m
falco-9ndjk 1/1 Running 0 18m
falco-dkx2c 1/1 Running 0 19m
[nova@nova stable]$ k get no
NAME STATUS ROLES AGE VERSION
gke-nova-default-pool-4935e89d-cc6p Ready <none> 24m v1.16.8-gke.15
gke-nova-default-pool-4935e89d-g8td Ready <none> 24m v1.16.8-gke.15
gke-nova-default-pool-4935e89d-k6jq Ready <none> 24m v1.16.8-gke.15
[nova@nova stable]$ k get ds -oyaml | grep "image:"
image: docker.io/falcosecurity/falco:0.23.0
[nova@nova charts]$ k get cm -oyaml | grep -A 5 drops
syscall_event_drops:
actions:
- log
- alert
rate: 0.03333
max_burst: 10
Granted my pods have only been up for ~19 minutes. I will keep an eye on them over the next few weeks and will report any occurrences of
Falco internal: syscall event drop
in the hopes that sharing replication steps can help us fix this. This is a top priority.
After the pods were successfully running without any drops in the logs for ~60 minutes I decided to try to run a small workload.
[nova@nova event-generator]$ k run events --image krisnova/falco-event-generator:latest
pod/events created
[nova@nova event-generator]$ k get po
NAME READY STATUS RESTARTS AGE
events 0/1 ContainerCreating 0 2s
falco-7kkpk 1/1 Running 0 57m
falco-9ndjk 1/1 Running 0 57m
falco-dkx2c 1/1 Running 0 58m
[nova@nova event-generator]$ k logs -l app=falco
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:59 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:59 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:27:00 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:27:02 2020: Starting internal webserver, listening on port 8765
11:27:02.275043000: Notice Privileged container started (user=<NA> command=container:8b95e4277435 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-cc6p container=8b95e4277435 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-cc6p container=8b95e4277435
11:27:02.563498000: Notice Privileged container started (user=root command=container:9cc49632aa15 k8s.ns=default k8s.pod=falco-7kkpk container=9cc49632aa15 image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-7kkpk container=9cc49632aa15
12:18:48.157470217: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5876)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
11:26:56.478515000: Notice Privileged container started (user=<NA> command=container:65258398bcf1 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-k6jq container=65258398bcf1 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-k6jq container=65258398bcf1
11:26:56.798178000: Notice Privileged container started (user=root command=container:10a38778b1ee k8s.ns=default k8s.pod=falco-9ndjk container=10a38778b1ee image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-9ndjk container=10a38778b1ee
12:19:00.057431524: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4902)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
11:26:56.121409000: Notice Privileged container started (user=<NA> command=container:4c6e998a6835 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-g8td container=4c6e998a6835 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-g8td container=4c6e998a6835
11:26:56.403484000: Notice Privileged container started (user=<NA> command=container:422045a95533 k8s.ns=default k8s.pod=falco-dkx2c container=422045a95533 image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-dkx2c container=422045a95533
12:18:58.369831574: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6256)
I was now able to replicate the bug.
[nova@nova event-generator]$ k run events --image krisnova/falco-event-generator:latest
pod/events created
[nova@nova event-generator]$ k get po
NAME READY STATUS RESTARTS AGE
events 0/1 ContainerCreating 0 2s
falco-7kkpk 1/1 Running 0 57m
falco-9ndjk 1/1 Running 0 57m
falco-dkx2c 1/1 Running 0 58m
[nova@nova event-generator]$ k logs -l app=falco
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:59 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:59 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:27:00 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:27:02 2020: Starting internal webserver, listening on port 8765
11:27:02.275043000: Notice Privileged container started (user=<NA> command=container:8b95e4277435 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-cc6p container=8b95e4277435 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-cc6p container=8b95e4277435
11:27:02.563498000: Notice Privileged container started (user=root command=container:9cc49632aa15 k8s.ns=default k8s.pod=falco-7kkpk container=9cc49632aa15 image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-7kkpk container=9cc49632aa15
12:18:48.157470217: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5876)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
11:26:56.478515000: Notice Privileged container started (user=<NA> command=container:65258398bcf1 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-k6jq container=65258398bcf1 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-k6jq container=65258398bcf1
11:26:56.798178000: Notice Privileged container started (user=root command=container:10a38778b1ee k8s.ns=default k8s.pod=falco-9ndjk container=10a38778b1ee image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-9ndjk container=10a38778b1ee
12:19:00.057431524: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4902)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
11:26:56.121409000: Notice Privileged container started (user=<NA> command=container:4c6e998a6835 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-g8td container=4c6e998a6835 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-g8td container=4c6e998a6835
11:26:56.403484000: Notice Privileged container started (user=<NA> command=container:422045a95533 k8s.ns=default k8s.pod=falco-dkx2c container=422045a95533 image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-dkx2c container=422045a95533
12:18:58.369831574: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6256)
It looks like the work in https://github.com/falcosecurity/falco/issues/1204 will be able to help us track down the concerns here to mitigate this problem. I am definitely following that thread closely as this seems to be the biggest issue folks are hitting right now.
Hope this helps
Right now the course of action to help us identify where in the stack this issue is coming from if to use the work in https://github.com/falcosecurity/falco/issues/1204 to capture metrics about the stack experience the issue and share with us.
We (me @fntlnz and @leogr) just spent some time to perform tests while perturbing some parameters (keeping the CPU load conditions and the frequency of rules triggered as steady as possible):
Then running Falco with -v
to have detailed statistics at the end and -A
to include all the syscalls (that Falco usually ignores).
We observed that increasing the size of the ring buffer, the number of events processed per second decreases by a factor of 10. Thus making the ingestion phase slower.
We suspect that having 0 drops with a bigger ring buffer (but an ingestion throughput slower) makes Falco (userspace) able to keep up with the events since the ingestion/producer is slower.
Could you all do a simple try running Falco with stdout: false
config, please?
We want also to verify that I/O does not impact (or how it impacts) this "drops" topic.
Im running into this as well - Was showing off Falco to my CyberSecurity team but I have more syscall drops than events that I was trying to show... I have another Demo on Wed - is there any potential fix or workaround for the short-term?
Im running on GKE on-prem, which for now is running on K8s 1.13.7 (One GKE version behind on this cluster).
Hey @robotica72 can you try disabling the stdout
and see if it continues to drop or not?
@leodido I have stdout set to false and i still see the drops.
stdout_output:
enabled: false
I have some cycles to investigate this.
On GKE 1.16.8-gke.15
.
This is what happens freshly created cluster.
After doing some activity in a container to generate some events, I notice the first drop after ~40min of no activity besides the default services.
This is exactly what I see in the comment from @kris-nova and others here.
16:33:31.992532000: Notice Privileged container started (user=root command=container:1daef9d81a30 k8s.ns=kube-system k8s.pod=kube-proxy-gke-falco-669-repro-default-pool-0076378a-nsk6 container=1daef9d81a30 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-falco-669-repro-default-pool-0076378a-nsk6 container=1daef9d81a30 k8s.ns=kube-system k8s.pod=kube-proxy-gke-falco-669-repro-default-pool-0076378a-nsk6 container=1daef9d81a30
16:33:32.214615000: Notice Privileged container started (user=<NA> command=container:a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa image=falcosecurity/falco:latest) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
16:34:09.682277235: Notice A shell was spawned in a container with an attached terminal (user=root k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa shell=bash parent=runc cmdline=bash terminal=34816 container_id=a44a2b9eecfa image=falcosecurity/falco) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
16:34:15.800127343: Error Package management process launched in container (user=root command=apt container_id=a44a2b9eecfa container_name=k8s_falco_falco-daemonset-wkx4x_default_84db0fa9-ee12-497a-b113-94aeaddc9159_0 image=falcosecurity/falco:latest) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
16:34:18.417231435: Error Package management process launched in container (user=root command=apt install procps container_id=a44a2b9eecfa container_name=k8s_falco_falco-daemonset-wkx4x_default_84db0fa9-ee12-497a-b113-94aeaddc9159_0 image=falcosecurity/falco:latest) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
16:34:21.681962517: Error Package management process launched in container (user=root command=apt update -y container_id=a44a2b9eecfa container_name=k8s_falco_falco-daemonset-wkx4x_default_84db0fa9-ee12-497a-b113-94aeaddc9159_0 image=falcosecurity/falco:latest) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
17:11:06.670570790: Warning Shell history had been deleted or renamed (user=root type=openat command=bash fd.name=/root/.bash_history name=/root/.bash_history path=<NA> oldpath=<NA> k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
17:17:19.135840930: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6250)
It does not not seem related to CPU/memory usage in the node, I have very idle nodes.
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-falco-669-repro-default-pool-0076378a-3f7r 81m 8% 916Mi 34%
gke-falco-669-repro-default-pool-0076378a-bcr0 95m 10% 836Mi 31%
gke-falco-669-repro-default-pool-0076378a-nsk6 78m 8% 854Mi 32%
Considering that my nodes have 2.77 GB
of allocatable memory and 940 mCPU
I'll dig deeper to identify the conditions that are generating the drop to give an answer to the why we are seeing this. I suspect a bug in the token bucket code.
I just noticed that the drop in my comment is related to invalid memory access n_drops_pf=1
.
n_drops_buffer=4
"n_drops_buffer":"11",
n_drops_buffer: 23
While,
"n_drops_pf":"1",
n_drops_pf=1
So we have two different kinds of reports here, I think that the ones we want to investigate is the n_drops_buffer
here first.
I believe the issue still exists even if the original comment from @caquino is for 0.15.3 after reading this comment from @victortrac
@victortrac there's anything non-default in your Falco config? Are you sending the outputs anywhere or have any other thing we need to be aware? What kind of load you have in your machine?
I suspect this not to be load related, I'm trying to gather as much info as possible to create a reproducer for the buffer drops myself.
@fntlnz I've got JSON output and HTTP output enabled, sending to falcosidekick, being deployed from the the standard falco helm chart, which includes a lot of rules. I do have some additional rules like:
- macro: sysctl_reload
condition: (proc.cmdline = "sudo sysctl -a")
- rule: Attempted su/sudo
desc: sudo or su attempted
condition: >
proc.name in (userexec_binaries)
and not sysctl_reload
output: >
A process ran sudo or su (user=%user.name %container.info
process=%proc.name parent=%proc.pname cmdline=%proc.cmdline)
priority: WARNING
tags: [container, shell]
This triggers a warning on falco start about performance.
Otherwise, I've tried giving the falco pod more resources (2 CPUs and 2 GB RAM), but that didn't seem to have an effect. The clusters are test clusters with zero load, but my testing has been on small GKE nodes (2 cores, 7.5 GB RAM).
Hey @robotica72 can you try disabling the
stdout
and see if it continues to drop or not?
Hi - This may be a dumb question, but if Im only using FluentD in a default configuration, if I turn off stdout, I wont get any logs? Or should I disable stdout and log to a file?
My GKE cluster is still up and interestingly it's reporting a drop every hour of type invalid memory access.
17:17:25.928730193: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4376)
18:18:05.880108709: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5504)
19:18:26.564329835: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5289)
20:19:26.617244218: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5406)
21:20:26.365003699: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4060)
22:21:25.756071706: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=3385)
23:22:26.459459508: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6112)
00:23:26.045564944: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=3072)
01:24:25.983780603: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=3375)
02:25:25.771708452: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=3083)
03:15:49.975021136: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=2914)
03:26:26.294336741: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=8638)
04:27:25.675853752: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4241)
05:27:48.695712386: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5449)
06:27:49.904779656: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4368)
07:27:51.162401931: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=7196)
08:27:52.386337742: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6606)
09:27:53.481688011: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5707)
10:27:54.907118987: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6220)
Wasn't able to reproduce the buffer drop yet, will try what @victortrac reported as reproducer.
@victortrac what are the version of Falco you are currently using? What is the helm chart you are using? We recently migrated to https://github.com/falcosecurity/charts
@robotica72 if you turn off stdout you can still use file_output
or syslog_output
or http_output
or grpc_output
@victortrac what are the version of Falco you are currently using? What is the helm chart you are using? We recently migrated to https://github.com/falcosecurity/charts
@robotica72 if you turn off stdout you can still use
file_output
orsyslog_output
orhttp_output
orgrpc_output
Thanks @fntlnz - I thought so, was just sanity checking :) Ill output to a logfile after my presentation today - I need to have some values in stdout just to show a group a Cyber people how it works with Splunk.
Thanks @robotica72 :) - please share with us any findings and if anything doesn't work open an issue !
@victortrac as you were pointing out, with your additional rules I do see a warning from Falco talking about the event type restriction.
proc.name in (userexec_binaries) and not sysctl_reload
did not contain any evt.type restriction, meaning it will run for all event types.
This has a significant performance penalty. Consider adding an evt.type restriction if possible.
Do you still experience the drops if you disable that rule?
What happened: Falco is constantly dropping events on a really small GKE cluster (3 nodes) and 9 containers (excluding Google services containers)
What you expected to happen: I would expect that for a cluster of such size that falco would have no issues handling the events for it.
How to reproduce it (as minimally and precisely as possible): Based on my tests, just by deploying a simple cluster on GKE with monitoring and metrics extensions enabled, it is enough to cause drops.
Anything else we need to know?:
Environment:
falco --version
): 0.15.3cat /etc/os-release
): COSuname -a
): 4.14.119+Which I assume is generating enough events to cause the syscall drops, is there anything I can do? Filter out this events/process?
I've checked the other similar issue, and I'm using the KUBERNETES_SERVICE_HOST like described on it.
Log snippet: