falcosecurity / charts

Community managed Helm charts for running Falco with Kubernetes
Apache License 2.0
242 stars 286 forks source link

Falco exporter gRPC error invalid UTF-8 #619

Closed nc-pnan closed 2 months ago

nc-pnan commented 8 months ago

Describe the bug

We are deploying Falco with sidekick and exporter using these helm charts with daemonset, creating falco instances on 3 nodes. For 2 nodes everything is running without issues, but for the 3rd node, the falco exporter pod keeps failing in CrashLoopBackOff state. Inspecting the log of the falco-exporter container this is the output containing the error message:

2024/02/08 09:27:08 connecting to gRPC server at unix:///run/falco/falco.sock (timeout 2m0s)                                                        
2024/02/08 09:27:08 listening on http://0.0.0.0:9376/metrics                                                                                        
2024/02/08 09:27:08 connected to gRPC server, subscribing events stream                                                                             
2024/02/08 09:27:08 ready                                                                                                                           
2024/02/08 09:27:09 gRPC: rpc error: code = Internal desc = grpc: failed to unmarshal the received message string field contains invalid UTF-8 

We get the following error from the Falco pod itself: [libprotobuf ERROR google/protobuf/wire_format_lite.cc:577] String field 'falco.outputs.response.OutputFieldsEntry.value' contains invalid UTF-8 data when serializing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.

We have tried redeploying the charts several times and it is always the instance connected to one specific node that is failing, but we have not been able to figure out the issue on our end, since all nodes should be configured identically.

How to reproduce it Deploy the Falco, Falco sidekick and Falco exporter charts with this umbrella Chart and values.yaml configuration, to an AKS cluster running 3 nodes:

Chart.yaml:

annotations:
  category: Analytics
apiVersion: v2
appVersion: 0.37.0
name: falco
description: Falco is a Cloud Native Runtime Security tool designed to detect anomalous activity
dependencies:
  - name: falco
    version: 4.1.0
    repository: "https://falcosecurity.github.io/charts/"
  - name: falcosidekick
    version: 0.7.11
    condition: falcosidekick.enabled
    repository: "https://falcosecurity.github.io/charts/"
  - name: falco-exporter
    version: 0.9.9
    repository: "https://falcosecurity.github.io/charts/"
keywords:
  - monitoring
  - security
  - alerting
sources:
  - https://github.com/falcosecurity/falco
  - https://github.com/falcosecurity/charts
  - https://github.com/falcosecurity/charts/tree/master/falco
  - https://github.com/falcosecurity/charts/tree/master/falcosidekick
  - https://github.com/falcosecurity/charts/tree/master/falco-exporter
version: 0.2.0

Values.yaml:

falcosidekick:
  enabled: true
  config:
    alertmanager:
      hostport: "http://alertmanager-operated.monitoring.svc.cluster.local:9093"
      endpoint: "/api/v1/alerts"
      minimumpriority: "error"
      expireafter: ""
      mutualtls: false
      checkcert: false
      extralabels: "alertname:Falco"

falco:
  driver:
    kind: modern-bpf
    modernEbpf:
      leastPrivileged: true
  podSecurityContext:
    securityContext:
      privileged: true
  podPriorityClassName: priority-class-daemonsets
  resources:
    requests:
      cpu: 100m
      memory: 254Mi
    limits:
      memory: 1024Mi
  falco:
    json_output: true
    http_output:
      enabled: true
      url: "http://falco-falcosidekick:2801/"
    grpc:
      enabled: true
    grpc_output:
      enabled: true

falco-exporter:
  podPriorityClassName: priority-class-daemonsets
  prometheusRules:
    enabled: true
  serviceMonitor:
    enabled: true

Expected behaviour

We expect the falco exporter to be running on all three nodes.

Screenshots None.

Environment

- System info:

Falco version: 0.37.0 (x86_64) Falco initialized with configuration file: /etc/falco/falco.yaml System info: Linux version 5.15.138.1-4.cm2 (root@CBL-Mariner) (gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) #1 SMP Thu Nov 30 21:48:10 UTC 2023 Loading rules from file /etc/falco/falco_rules.yaml { "machine": "x86_64", "nodename": "falco-dnncw", "release": "5.15.138.1-4.cm2", "sysname": "Linux", "version": "#1 SMP Thu Nov 30 21:48:10 UTC 2023" }


- Cloud provider or hardware configuration:
Azure, AKS
Kubernetes version 1.28.3

- OS:

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)" NAME="Debian GNU/Linux" VERSION_ID="12" VERSION="12 (bookworm)" VERSION_CODENAME=bookworm ID=debian



- Kernel:
`Linux falco-dnncw 5.15.138.1-4.cm2 #1 SMP Thu Nov 30 21:48:10 UTC 2023 x86_64 GNU/Linux
`
- Installation method:
Kubernetes Helm Install Chart

**Additional context**
Some additional observations:
If we spin up another (4th) node it appears the same issue will happen on this node as well.
The 2 nodes where the exporter is working happens to be the nodes hosting some instances of prometheus, either the thanos prometheus pod or the prometheus pod. Since we are running thanos prometheus operator chart.
All pods are running in "monitoring" namespace.
alacuku commented 8 months ago

@nc-pnan, does falco log which rule is triggered when failing? Any info on how to reproduce would be helpful.

nc-pnan commented 8 months ago

@alacuku I unfortunately don't really have other information on how to reproduce than this, since the cluster it was deployed to is fairly extensive. But if there are any specifics you are interested in, please let me know.

The only triggered rules I can find currently is this one: {"hostname":"falco-wm9ks","output":"13:09:22.412554500: Notice Unexpected connection to K8s API Server from container (connection=10.244.1.126:45008->10.16.0.1:443 lport=443 rport=45008 fd_type=ipv4 fd_proto=fd.l4proto evt_type=connect user= user_uid=4294967295 user_loginuid=-1 process=<NA> proc_exepath= parent=<NA> command=<NA> terminal=0 container_id= container_image=<NA> container_image_tag=<NA> container_name=<NA> k8s_ns=<NA> k8s_pod_name=<NA>)","priority":"Notice","rule":"Contact K8S API Server From Container","source":"syscall","tags":["T1565","container","k8s","maturity_stable","mitre_discovery","network"],"time":"2024-02-08T13:09:22.412554500Z", "output_fields": {"container.id":"","container.image.repository":null,"container.image.tag":null,"container.name":null,"evt.time":1707397762412554500,"evt.type":"connect","fd.lport":443,"fd.name":"10.244.1.126:45008->10.16.0.1:443","fd.rport":45008,"fd.type":"ipv4","k8s.ns.name":null,"k8s.pod.name":null,"proc.cmdline":"<NA>","proc.exepath":"","proc.name":"<NA>","proc.pname":null,"proc.tty":0,"user.loginuid":-1,"user.name":"","user.uid":4294967295}}

However, we did also get triggers on FalcoExporterAbsent, but this currently is not being triggered for some reason, even though the exporter is in CrashLoopBackoffState.

name: [FalcoExporterAbsent](http://localhost:10902/graph?g0.expr=ALERTS%7Balertname%3D%22FalcoExporterAbsent%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h)
expr: [absent(up{job="falco-falco-exporter"})](http://localhost:10902/graph?g0.expr=absent(up%7Bjob%3D%22falco-falco-exporter%22%7D)&g0.tab=1&g0.stacked=0&g0.range_input=1h)
for: 10m
labels:
prometheus: monitoring/prometheus-default-prometheus
prometheus_replica: prometheus-prometheus-default-prometheus-0
severity: critical
annotations:
description: No metrics are being scraped from falco. No events will trigger any alerts.
summary: Falco Exporter has dissapeared from Prometheus service discovery.
poiana commented 5 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 4 months ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

Andreagit97 commented 4 months ago

/remove-lifecycle rotten Hey! This should be fixed in the latest Falco release 0.38.0! this should be the fix https://github.com/falcosecurity/libs/pull/1800

leogr commented 2 months ago

Hey! This should be fixed in the latest Falco release 0.38.0! this should be the fix falcosecurity/libs#1800

This has been fixed by 0.38 AFAIK. So, /close

poiana commented 2 months ago

@leogr: Closing this issue.

In response to [this](https://github.com/falcosecurity/charts/issues/619#issuecomment-2315279538): >> Hey! This should be fixed in the latest Falco release 0.38.0! >> this should be the fix [falcosecurity/libs#1800](https://github.com/falcosecurity/libs/pull/1800) > >This has been fixed by 0.38 AFAIK. So, >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.