Falco on GKE - dropped syscall events

caquino commented 5 years ago

What happened: Falco is constantly dropping events on a really small GKE cluster (3 nodes) and 9 containers (excluding Google services containers)

What you expected to happen: I would expect that for a cluster of such size that falco would have no issues handling the events for it.

How to reproduce it (as minimally and precisely as possible): Based on my tests, just by deploying a simple cluster on GKE with monitoring and metrics extensions enabled, it is enough to cause drops.

Anything else we need to know?:

Environment:

Falco version (use falco --version): 0.15.3

System info

{
"machine": "x86_64",
"nodename": "gke-ai-cloud-us-central1-default-pool-b3942f02-7mjv",
"release": "4.14.119+",
"sysname": "Linux",
"version": "#1 SMP Wed May 15 17:44:01 PDT 2019"
}

Cloud provider or hardware configuration: Google Cloud
OS (e.g: cat /etc/os-release): COS
Kernel (e.g. uname -a): 4.14.119+
Install tools (e.g. in kubernetes, rpm, deb, from source): kubernetes

Others: The logs are flooded with syscall event drop, the only other event showing up is the following:

12:50:28.988592573: Warning Log files were tampered (user=root command=pos_writer.rb:* -Eascii-8bit:ascii-8bit /usr/sbin/google-fluentd --under-supervisor file=/var/log/gcp-journald-kubelet.pos) k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6xbdr container=e84a9479a965 k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6xbdr container=e84a9479a965

Which I assume is generating enough events to cause the syscall drops, is there anything I can do? Filter out this events/process?

I've checked the other similar issue, and I'm using the KUBERNETES_SERVICE_HOST like described on it.

Log snippet:

13:03:18.688024362: Warning Log files were tampered (user=root command=pos_writer.rb:* -Eascii-8bit:ascii-8bit /usr/sbin/google-fluentd --under-supervisor file=/var/log/gcp-journald-kubelet.pos) k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6fmpm container=9b27b6fc8460 k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6fmpm container=9b27b6fc8460
13:03:34.485317730: Critical Falco internal: syscall event drop. 11 system calls dropped in last second.(ebpf_enabled=1 n_drops=11 n_drops_buffer=11 n_drops_bug=0 n_drops_pf=0 n_evts=92933)
13:04:20.686678200: Critical Falco internal: syscall event drop. 4 system calls dropped in last second.(ebpf_enabled=1 n_drops=4 n_drops_buffer=4 n_drops_bug=0 n_drops_pf=0 n_evts=8090)
13:04:24.654208479: Critical Falco internal: syscall event drop. 4 system calls dropped in last second.(ebpf_enabled=1 n_drops=4 n_drops_buffer=4 n_drops_bug=0 n_drops_pf=0 n_evts=11527)

fntlnz commented 5 years ago

Thanks for opening @caquino - need some time to look at this because I need to get to a gcp account to do tests.

fntlnz commented 5 years ago

/assign @fntlnz /assign @leodido

michiels commented 5 years ago

FWIW, getting the same on a 3-server (1cpu2Gbram) cluster at DigitalOcean Kubernetes. There are a bunch of additional things already deployed:

Traefik ingress controller
A rails app with a web and background process
Logspout logging pipeline.

Do these errors usually mean the cluster's capacity is reached?

caquino commented 5 years ago

On my case the cluster has 3 n1-standard-2 (2 vCPUs, 7.5 GB memory) nodes, other than what comes pre-configured on the cluster from Google is running only a Deployment with nginx and php.

This is not a production cluster, so it has no traffic on it yet, so I would expect to not be a capacity issue.

nuala33 commented 5 years ago

I am also having the same issue ("syscall event drop" logs several times an hour) in my almost completely unused GKE cluster with 3 nodes.

Re: the fluentd log tampering alerts, I created #684 to address it, and I describe the fix that worked to stop those from flowing in.

kbrown commented 5 years ago

Continuously seeing this after installing k8s-with-rbac/falco-daemonset-configmap.yaml in a 3 node gke environment:

Falco internal: syscall event drop. 7 system calls dropped in last second. Falco internal: syscall event drop. 15 system calls dropped in last second.

Occurs in: falcosecurity/falco:dev falcosecurity/falco:latest

Not yet checked: falcosecurity/falco:15.0.x (container won't start with config from dev branch)

leodido commented 5 years ago

Just to be sure: what syscall_event_drops.rate and syscall_event_drops.max_burst config are you folks using?

The default one?

metalsong commented 5 years ago

@leodido I've just install falco to the GKE using:

helm install --name falco stable/falco --set ebpf.enabled=true

Having exactly the same errors.

kbrown commented 5 years ago

@leodido

1) I have used the defaults, 2) then tried the increased limits:

syscall_event_drops: actions:

log
alert rate: 1 max_burst: 1000

Same result. Maybe I should have decreased?

Using falco-daemonset-configmap.yaml with the following changes:

      env:
      - name: SYSDIG_BPF_PROBE
        value: ""
      - name: KBUILD_EXTRA_CPPFLAGS
        value: -DCOS_73_WORKAROUND

caquino commented 5 years ago

Same here, I'm using the default configuration.

DannyPat44 commented 5 years ago

Has anyone found a workaround for this issue?

qbast commented 5 years ago

I see the same behaviour with Falco 0.17 on five node GKE cluster. CPU load on the nodes is between 18% and 35%. Falco is using 1-5% .

Aaron-ML commented 5 years ago

Seeing this in Azure with falco using abysmal amounts of CPU and using the set defaults

bgeesaman commented 5 years ago

I'm also seeing 1 n_drops per node, every 60 to 61 mins on my COS 4.14.127+ v1.12.8-gke.10 GKE cluster (8CPUs/30+GB RAM nodes). Running falcosecurity/falco:latest.

arthurk commented 5 years ago

I have the same issue. As a workaround i have just set the action to "ignore" until this feature works with GKE

cvernooy23 commented 5 years ago

I'm seeing this when installed as a binary on the host OS in AWS EC2 running CentOS 7 w/ kernel 5.3 mainline.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

shane-lawrence commented 5 years ago

Could we keep this open until it has been resolved?

fntlnz commented 5 years ago

@shane-lawrence i agree we want to keep this open, I was coming here to un-stale this but you already did :)

kbrown commented 5 years ago

I abandoned falco due to this issue.

On Nov 25, 2019, at 5:26 PM, Lorenzo Fontana notifications@github.com wrote:

@shane-lawrence i agree we want to keep this open, I was coming here to I’m imi un-stale this but you already did :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

popaaaandrei commented 4 years ago

Guys, any progress/updates? Thanks!

fntlnz commented 4 years ago

For everyone looking at this: The reason why Falco drops is that right now it does not have a way to "offload" events coming from kernel space while receiving them in userspace.

In other words, once inputs are consumed by the Falco driver they are sent immediately by the engine to processes them.

For that reason, we had to implement an artifact called "Token Bucket" that it's basically a rate limiter for the engine and is responsible for the drops.

Now, this kind of system is limited because on machines that have lot of activity in kernel space (and fill the ring buffer fast) Falco drops.

After discussions in many many office hours and repo planning calls, we decided to redesign the Inputs to achieve two goals:

Have an input streaming interface (that offloads messages to a queue)
Implement inputs as a gRPC client - means that the inputs are not part of the Falco engine itself but a separate service

You can find a Diagram with an explanation here.

Long story short: It's likely that this issue will persist until we implement the new input API and then release 1.0 - We don't have an official release date yet but IIRC many community members wanted to set that for March 2020.

We are grateful for everyone's help and we are trying to do our best to make Falco more and more reliable for all kinds of workloads. To stay updated and have a voice in the process please join our weekly calls.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

eihwaz-kob commented 4 years ago

open, same issue with falco master or 0.21.0 on gke with ebpf true

jaychan-q commented 4 years ago

I see it as well. falco master on GKE COS. It shows up every hour Falco internal: syscall event drop. 1 system calls dropped in last second. shows up very often if there is traffic on the cluster

fntlnz commented 4 years ago

Anyone experiencing this issue, please test again with Falco master and report your findings. Your help is very needed here!

jaychan-q commented 4 years ago

can we use image: falcosecurity/falco:latest or should it be image: falcosecurity/falco:master for pulling the images

victortrac commented 4 years ago

@fntlnz I just tried the master tag. The dropped events still pretty much start immediately:

{"output":"Falco internal: syscall event drop. 11 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"11","n_drops_buffer":"11","n_drops_bug":"0","n_drops_pf":"0","n_evts":"9452"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:45:41.803714581Z"}
{"output":"Falco internal: syscall event drop. 20 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"20","n_drops_buffer":"20","n_drops_bug":"0","n_drops_pf":"0","n_evts":"20671"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:46:12.741877678Z"}
{"output":"Falco internal: syscall event drop. 1 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"1","n_drops_buffer":"1","n_drops_bug":"0","n_drops_pf":"0","n_evts":"13463"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:46:17.902355730Z"}
{"output":"Falco internal: syscall event drop. 1 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"1","n_drops_buffer":"1","n_drops_bug":"0","n_drops_pf":"0","n_evts":"12563"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:46:22.106036046Z"}
{"output":"Falco internal: syscall event drop. 3 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"3","n_drops_buffer":"3","n_drops_bug":"0","n_drops_pf":"0","n_evts":"9336"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:46:35.545513952Z"}
{"output":"Falco internal: syscall event drop. 11 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"11","n_drops_buffer":"11","n_drops_bug":"0","n_drops_pf":"0","n_evts":"11962"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-13T04:46:41.596002790Z"}

$ kubectl get nodes
NAME                                       STATUS   ROLES    AGE    VERSION
gke-dev-gke-dev-gke-pool-0-2111bc99-gc2x   Ready    <none>   8h     v1.15.9-gke.26
gke-dev-gke-dev-gke-pool-0-284a14fb-4qrr   Ready    <none>   9h     v1.15.9-gke.26
gke-dev-gke-dev-gke-pool-0-80ef08eb-069h   Ready    <none>   2d6h   v1.15.9-gke.26

$ kubectl top nodes
NAME                                       CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
gke-dev-gke-dev-gke-pool-0-2111bc99-gc2x   182m         9%     1332Mi          23%
gke-dev-gke-dev-gke-pool-0-284a14fb-4qrr   391m         20%    2290Mi          40%
gke-dev-gke-dev-gke-pool-0-80ef08eb-069h   256m         13%    1870Mi          33%

fntlnz commented 4 years ago

@jaychan-q it's falcosecurity/falco:master to run from the latest git commit, and it's falcosecurity/falco:latest to run the latest tagged version.

@victortrac Thanks for testing - I just opened an issue regarding this and how we are working on tooling to help you help us tackling this problem https://github.com/falcosecurity/falco/issues/1204 - @shane-lawrence and @natalysheinin might be interested too.

shane-lawrence commented 4 years ago

I tried this with master today. It's using this version:

/usr/bin/falco --version
Falco version: 0.22.1-52+d0f4f7c
Driver version: 96bd9bc560f67742738eb7255aeb4d03046b8045

We're still dropping syscalls:

   output: Falco internal: syscall event drop. 23 system calls dropped in last second.
   output_fields: { [-]
     ebpf_enabled: 1
     n_drops: 23
     n_drops_buffer: 23
     n_drops_bug: 0
     n_drops_pf: 0
     n_evts: 27379
   }

Does this output indicate that 27379 syscalls were detected, and of those 23 were dropped?

jaychan-q commented 4 years ago

i am seeing the drops as well from the master image. {"output":"Falco internal: syscall event drop. 1 system calls dropped in last second.","output_fields":{"ebpf_enabled":"1","n_drops":"1","n_drops_buffer":"0","n_drops_bug":"0","n_drops_pf":"1","n_evts":"16858"},"priority":"Critical","rule":"Falco internal: syscall event drop","time":"2020-05-15T00:24:23.033865841Z"}

operatorequals commented 4 years ago

Hello guys. I am using falco in kops provisioned AWS EC2 instances, of 3 workers - 1 master, without k8saudit integration. I have exactly the same issue, here is a Kibana screenshot to show the volume:

I'm writing this to raise the issue that it probably is not GKE specific, as I have the same issue for a whole different setup (I can share setup details in case).

leodido commented 4 years ago

Thanks for the report Jhon.

Would you mind sharing info also about the other setup?

In case it needs to stay private you can share such info on slack with us.

On Wed, 20 May 2020 at 12:18, John Torakis notifications@github.com wrote:

Hello guys. I am using falco in kops provisioned AWS EC2 instances, of 3 workers - 1 master, without k8saudit integration. I have exactly the same issue, here is a Kibana screenshot to show the volume: [image: εικόνα] https://user-images.githubusercontent.com/6146328/82434884-0ac07600-9a83-11ea-9c81-a3abe388a9a9.png

I'm writing this to raise the issue that it probably is not GKE specific, as I have the same issue for a whole different setup (I can share setup details in case).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/falcosecurity/falco/issues/669#issuecomment-631384482, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5J4YVV7RJP6LYSXQ23SLRSOU7PANCNFSM4HXZI6DA .

-- L.

operatorequals commented 4 years ago

Hello @leodido!

Sure, it's a 3 worker - 1 master cluster of k8s on EC2s provisioned through kops. Kubernetes version:

$ k version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T23:34:25Z", GoVersion:"go1.14.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.11", GitCommit:"d94a81c724ea8e1ccc9002d89b7fe81d58f89ede", GitTreeState:"clean", BuildDate:"2020-03-12T21:00:06Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

Falco version:

$ k -n kube-system exec -ti falco-9zlmn -- falco --version
Falco version: 0.21.0

Default helm chart syscallEventDrops options. Same for resources. Falco pod usage is:

$ k top pods -n kube-system | grep falco
falco-9zlmn                                           11m          39Mi            
falco-hsdkj                                           9m           18Mi            
falco-nr4cx                                           13m          22Mi

If you need specific information, I can provide!

krisnova commented 4 years ago

So the helm chart being used threw me for a loop.

I opened https://github.com/helm/charts/pull/22481 to remove the falco stable chart in favor of https://github.com/falcosecurity/charts/pull/12 which updates the container image from 0.21.0 to 0.23.0.

This update was required for me to deploy to GKE.

I was getting the following error with 0.21.0

[nova@nova ~]$ k logs falco-xgg9l -f
* Setting up /usr/src links from host
* Mounting debugfs
Found kernel config at /proc/config.gz
* COS detected (build 12371.208.0), using cos kernel headers...
* Downloading https://storage.googleapis.com/cos-tools/12371.208.0/kernel-headers.tgz
* Extracting kernel sources
* Configuring kernel
* Trying to compile BPF probe falco-probe-bpf (falco-probe-bpf-latest-x86_64-4.19.109+-c49f5fbc35a5b76e9ca8d47acc2b1913.o)
make: *** /usr/src/falco-latest/bpf: No such file or directory.  Stop.
mv: cannot stat '/usr/src/falco-latest/bpf/probe.o': No such file or directory
* Trying to download precompiled BPF probe from https://s3.amazonaws.com/download.draios.com/stable/sysdig-probe-binaries/falco-probe-bpf-latest-x86_64-4.19.109%2B-c49f5fbc35a5b76e9ca8d47acc2b1913.o
curl: (22) The requested URL returned error: 404 Not Found
* Failure to find a BPF probe
Fri May 22 11:22:15 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:22:15 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:22:16 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:22:16 2020: Unable to load the driver. Exiting.
Fri May 22 11:22:16 2020: Runtime error: can't open BPF probe '/root/.falco/falco-probe-bpf.o': Errno 2. Exiting.

After making the change to the DaemonSet for 0.23.0 I have had Falco running with BPF in GKE without any dropped system calls in the logs.

[nova@nova stable]$ k logs -l app=falco -f
* Extracting kernel sources
* Configuring kernel
* Trying to compile the eBPF probe (falco_cos_4.19.109+_1.o)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
* Extracting kernel sources
* Configuring kernel
* Trying to compile the eBPF probe (falco_cos_4.19.109+_1.o)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:59 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:59 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:27:00 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:27:02 2020: Starting internal webserver, listening on port 8765
* Extracting kernel sources
* Configuring kernel
* Trying to compile the eBPF probe (falco_cos_4.19.109+_1.o)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
^C
[nova@nova stable]$ k get po
NAME          READY   STATUS    RESTARTS   AGE
falco-7kkpk   1/1     Running   0          18m
falco-9ndjk   1/1     Running   0          18m
falco-dkx2c   1/1     Running   0          19m
[nova@nova stable]$ k get no
NAME                                  STATUS   ROLES    AGE   VERSION
gke-nova-default-pool-4935e89d-cc6p   Ready    <none>   24m   v1.16.8-gke.15
gke-nova-default-pool-4935e89d-g8td   Ready    <none>   24m   v1.16.8-gke.15
gke-nova-default-pool-4935e89d-k6jq   Ready    <none>   24m   v1.16.8-gke.15
[nova@nova stable]$ k get ds -oyaml | grep "image:"
          image: docker.io/falcosecurity/falco:0.23.0
[nova@nova charts]$ k get cm -oyaml | grep -A 5 drops
      syscall_event_drops:
        actions:
          - log
          - alert
        rate: 0.03333
        max_burst: 10

Granted my pods have only been up for ~19 minutes. I will keep an eye on them over the next few weeks and will report any occurrences of

Falco internal: syscall event drop

in the hopes that sharing replication steps can help us fix this. This is a top priority.

krisnova commented 4 years ago

After the pods were successfully running without any drops in the logs for ~60 minutes I decided to try to run a small workload.

[nova@nova event-generator]$ k run events --image  krisnova/falco-event-generator:latest
pod/events created
[nova@nova event-generator]$ k get po
NAME          READY   STATUS              RESTARTS   AGE
events        0/1     ContainerCreating   0          2s
falco-7kkpk   1/1     Running             0          57m
falco-9ndjk   1/1     Running             0          57m
falco-dkx2c   1/1     Running             0          58m
[nova@nova event-generator]$ k logs -l app=falco
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:59 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:59 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:27:00 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:27:02 2020: Starting internal webserver, listening on port 8765
11:27:02.275043000: Notice Privileged container started (user=<NA> command=container:8b95e4277435 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-cc6p container=8b95e4277435 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-cc6p container=8b95e4277435
11:27:02.563498000: Notice Privileged container started (user=root command=container:9cc49632aa15 k8s.ns=default k8s.pod=falco-7kkpk container=9cc49632aa15 image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-7kkpk container=9cc49632aa15
12:18:48.157470217: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5876)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
11:26:56.478515000: Notice Privileged container started (user=<NA> command=container:65258398bcf1 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-k6jq container=65258398bcf1 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-k6jq container=65258398bcf1
11:26:56.798178000: Notice Privileged container started (user=root command=container:10a38778b1ee k8s.ns=default k8s.pod=falco-9ndjk container=10a38778b1ee image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-9ndjk container=10a38778b1ee
12:19:00.057431524: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4902)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
11:26:56.121409000: Notice Privileged container started (user=<NA> command=container:4c6e998a6835 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-g8td container=4c6e998a6835 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-g8td container=4c6e998a6835
11:26:56.403484000: Notice Privileged container started (user=<NA> command=container:422045a95533 k8s.ns=default k8s.pod=falco-dkx2c container=422045a95533 image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-dkx2c container=422045a95533
12:18:58.369831574: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6256)

I was now able to replicate the bug.

[nova@nova event-generator]$ k run events --image  krisnova/falco-event-generator:latest
pod/events created
[nova@nova event-generator]$ k get po
NAME          READY   STATUS              RESTARTS   AGE
events        0/1     ContainerCreating   0          2s
falco-7kkpk   1/1     Running             0          57m
falco-9ndjk   1/1     Running             0          57m
falco-dkx2c   1/1     Running             0          58m
[nova@nova event-generator]$ k logs -l app=falco
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:59 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:59 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:27:00 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:27:02 2020: Starting internal webserver, listening on port 8765
11:27:02.275043000: Notice Privileged container started (user=<NA> command=container:8b95e4277435 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-cc6p container=8b95e4277435 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-cc6p container=8b95e4277435
11:27:02.563498000: Notice Privileged container started (user=root command=container:9cc49632aa15 k8s.ns=default k8s.pod=falco-7kkpk container=9cc49632aa15 image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-7kkpk container=9cc49632aa15
12:18:48.157470217: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5876)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
11:26:56.478515000: Notice Privileged container started (user=<NA> command=container:65258398bcf1 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-k6jq container=65258398bcf1 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-k6jq container=65258398bcf1
11:26:56.798178000: Notice Privileged container started (user=root command=container:10a38778b1ee k8s.ns=default k8s.pod=falco-9ndjk container=10a38778b1ee image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-9ndjk container=10a38778b1ee
12:19:00.057431524: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4902)
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_4.19.109+_1.o
* eBPF probe located in /root/.falco/falco_cos_4.19.109+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Fri May 22 11:26:53 2020: Falco initialized with configuration file /etc/falco/falco.yaml
Fri May 22 11:26:53 2020: Loading rules from file /etc/falco/falco_rules.yaml:
Fri May 22 11:26:54 2020: Loading rules from file /etc/falco/falco_rules.local.yaml:
Fri May 22 11:26:56 2020: Starting internal webserver, listening on port 8765
11:26:56.121409000: Notice Privileged container started (user=<NA> command=container:4c6e998a6835 k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-g8td container=4c6e998a6835 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-nova-default-pool-4935e89d-g8td container=4c6e998a6835
11:26:56.403484000: Notice Privileged container started (user=<NA> command=container:422045a95533 k8s.ns=default k8s.pod=falco-dkx2c container=422045a95533 image=falcosecurity/falco:0.23.0) k8s.ns=default k8s.pod=falco-dkx2c container=422045a95533
12:18:58.369831574: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6256)

It looks like the work in https://github.com/falcosecurity/falco/issues/1204 will be able to help us track down the concerns here to mitigate this problem. I am definitely following that thread closely as this seems to be the biggest issue folks are hitting right now.

Hope this helps

krisnova commented 4 years ago

Right now the course of action to help us identify where in the stack this issue is coming from if to use the work in https://github.com/falcosecurity/falco/issues/1204 to capture metrics about the stack experience the issue and share with us.

leodido commented 4 years ago

We (me @fntlnz and @leogr) just spent some time to perform tests while perturbing some parameters (keeping the CPU load conditions and the frequency of rules triggered as steady as possible):

ring buffer dimension (8MiB vs 16MiB)
eBPF buf size dimension (16MiB vs 32MiB)
stdout output plugin (true vs false)

Then running Falco with -v to have detailed statistics at the end and -A to include all the syscalls (that Falco usually ignores).

We observed that increasing the size of the ring buffer, the number of events processed per second decreases by a factor of 10. Thus making the ingestion phase slower.

We suspect that having 0 drops with a bigger ring buffer (but an ingestion throughput slower) makes Falco (userspace) able to keep up with the events since the ingestion/producer is slower.

Could you all do a simple try running Falco with stdout: false config, please?

We want also to verify that I/O does not impact (or how it impacts) this "drops" topic.

robotica72 commented 4 years ago

Im running into this as well - Was showing off Falco to my CyberSecurity team but I have more syscall drops than events that I was trying to show... I have another Demo on Wed - is there any potential fix or workaround for the short-term?

Im running on GKE on-prem, which for now is running on K8s 1.13.7 (One GKE version behind on this cluster).

leodido commented 4 years ago

Hey @robotica72 can you try disabling the stdout and see if it continues to drop or not?

jaychan-q commented 4 years ago

@leodido I have stdout set to false and i still see the drops.

stdout_output:
  enabled: false

fntlnz commented 4 years ago

I have some cycles to investigate this.

On GKE 1.16.8-gke.15.

This is what happens freshly created cluster.

After doing some activity in a container to generate some events, I notice the first drop after ~40min of no activity besides the default services.

This is exactly what I see in the comment from @kris-nova and others here.

16:33:31.992532000: Notice Privileged container started (user=root command=container:1daef9d81a30 k8s.ns=kube-system k8s.pod=kube-proxy-gke-falco-669-repro-default-pool-0076378a-nsk6 container=1daef9d81a30 image=gke.gcr.io/kube-proxy-amd64:v1.16.8-gke.15) k8s.ns=kube-system k8s.pod=kube-proxy-gke-falco-669-repro-default-pool-0076378a-nsk6 container=1daef9d81a30 k8s.ns=kube-system k8s.pod=kube-proxy-gke-falco-669-repro-default-pool-0076378a-nsk6 container=1daef9d81a30
16:33:32.214615000: Notice Privileged container started (user=<NA> command=container:a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa image=falcosecurity/falco:latest) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
16:34:09.682277235: Notice A shell was spawned in a container with an attached terminal (user=root k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa shell=bash parent=runc cmdline=bash terminal=34816 container_id=a44a2b9eecfa image=falcosecurity/falco) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
16:34:15.800127343: Error Package management process launched in container (user=root command=apt container_id=a44a2b9eecfa container_name=k8s_falco_falco-daemonset-wkx4x_default_84db0fa9-ee12-497a-b113-94aeaddc9159_0 image=falcosecurity/falco:latest) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
16:34:18.417231435: Error Package management process launched in container (user=root command=apt install procps container_id=a44a2b9eecfa container_name=k8s_falco_falco-daemonset-wkx4x_default_84db0fa9-ee12-497a-b113-94aeaddc9159_0 image=falcosecurity/falco:latest) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
16:34:21.681962517: Error Package management process launched in container (user=root command=apt update -y container_id=a44a2b9eecfa container_name=k8s_falco_falco-daemonset-wkx4x_default_84db0fa9-ee12-497a-b113-94aeaddc9159_0 image=falcosecurity/falco:latest) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
17:11:06.670570790: Warning Shell history had been deleted or renamed (user=root type=openat command=bash fd.name=/root/.bash_history name=/root/.bash_history path=<NA> oldpath=<NA> k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa) k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa k8s.ns=default k8s.pod=falco-daemonset-wkx4x container=a44a2b9eecfa
17:17:19.135840930: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6250)

It does not not seem related to CPU/memory usage in the node, I have very idle nodes.

NAME                                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
gke-falco-669-repro-default-pool-0076378a-3f7r   81m          8%     916Mi           34%       
gke-falco-669-repro-default-pool-0076378a-bcr0   95m          10%    836Mi           31%       
gke-falco-669-repro-default-pool-0076378a-nsk6   78m          8%     854Mi           32%

Considering that my nodes have 2.77 GB of allocatable memory and 940 mCPU

I'll dig deeper to identify the conditions that are generating the drop to give an answer to the why we are seeing this. I suspect a bug in the token bucket code.

fntlnz commented 4 years ago

I just noticed that the drop in my comment is related to invalid memory access n_drops_pf=1.

The report from @caquino has full buffer drops n_drops_buffer=4
The one from @victortrac is also in the buffer "n_drops_buffer":"11",
The one from @shane-lawrence is also in the buffer n_drops_buffer: 23

While,

the one by @jaychan-q is for memory corruption "n_drops_pf":"1",
same goes for @kris-nova n_drops_pf=1

So we have two different kinds of reports here, I think that the ones we want to investigate is the n_drops_buffer here first.

I believe the issue still exists even if the original comment from @caquino is for 0.15.3 after reading this comment from @victortrac

@victortrac there's anything non-default in your Falco config? Are you sending the outputs anywhere or have any other thing we need to be aware? What kind of load you have in your machine?

I suspect this not to be load related, I'm trying to gather as much info as possible to create a reproducer for the buffer drops myself.

victortrac commented 4 years ago

@fntlnz I've got JSON output and HTTP output enabled, sending to falcosidekick, being deployed from the the standard falco helm chart, which includes a lot of rules. I do have some additional rules like:

    - macro: sysctl_reload
      condition: (proc.cmdline = "sudo sysctl -a")

    - rule: Attempted su/sudo
      desc: sudo or su attempted
      condition: >
        proc.name in (userexec_binaries)
        and not sysctl_reload
      output: >
        A process ran sudo or su (user=%user.name %container.info
        process=%proc.name parent=%proc.pname cmdline=%proc.cmdline)
      priority: WARNING
      tags: [container, shell]

This triggers a warning on falco start about performance.

Otherwise, I've tried giving the falco pod more resources (2 CPUs and 2 GB RAM), but that didn't seem to have an effect. The clusters are test clusters with zero load, but my testing has been on small GKE nodes (2 cores, 7.5 GB RAM).

robotica72 commented 4 years ago

Hey @robotica72 can you try disabling the stdout and see if it continues to drop or not?

Hi - This may be a dumb question, but if Im only using FluentD in a default configuration, if I turn off stdout, I wont get any logs? Or should I disable stdout and log to a file?

fntlnz commented 4 years ago

My GKE cluster is still up and interestingly it's reporting a drop every hour of type invalid memory access.

17:17:25.928730193: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4376)
18:18:05.880108709: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5504)
19:18:26.564329835: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5289)
20:19:26.617244218: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5406)
21:20:26.365003699: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4060)
22:21:25.756071706: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=3385)
23:22:26.459459508: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6112)
00:23:26.045564944: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=3072)
01:24:25.983780603: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=3375)
02:25:25.771708452: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=3083)
03:15:49.975021136: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=2914)
03:26:26.294336741: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=8638)
04:27:25.675853752: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4241)
05:27:48.695712386: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5449)
06:27:49.904779656: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=4368)
07:27:51.162401931: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=7196)
08:27:52.386337742: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6606)
09:27:53.481688011: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=5707)
10:27:54.907118987: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=1 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=6220)

Wasn't able to reproduce the buffer drop yet, will try what @victortrac reported as reproducer.

fntlnz commented 4 years ago

@victortrac what are the version of Falco you are currently using? What is the helm chart you are using? We recently migrated to https://github.com/falcosecurity/charts

@robotica72 if you turn off stdout you can still use file_output or syslog_output or http_output or grpc_output

robotica72 commented 4 years ago

@victortrac what are the version of Falco you are currently using? What is the helm chart you are using? We recently migrated to https://github.com/falcosecurity/charts

@robotica72 if you turn off stdout you can still use file_output or syslog_output or http_output or grpc_output

Thanks @fntlnz - I thought so, was just sanity checking :) Ill output to a logfile after my presentation today - I need to have some values in stdout just to show a group a Cyber people how it works with Splunk.

fntlnz commented 4 years ago

Thanks @robotica72 :) - please share with us any findings and if anything doesn't work open an issue !

fntlnz commented 4 years ago

@victortrac as you were pointing out, with your additional rules I do see a warning from Falco talking about the event type restriction.

proc.name in (userexec_binaries) and not sysctl_reload

         did not contain any evt.type restriction, meaning it will run for all event types.
         This has a significant performance penalty. Consider adding an evt.type restriction if possible.

Do you still experience the drops if you disable that rule?

falcosecurity / falco

Falco on GKE - dropped syscall events #669