falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.24k stars 893 forks source link

Falco is blocked when a large list of domain names is used in a rule #2680

Open emilgelman opened 1 year ago

emilgelman commented 1 year ago

Describe the bug

On startup, Falco appears to completely hang when using a rule that includes a large list of domains as a condition. Throughout this period, it appears as though Falco is entirely frozen, with no validation of other rules occurring and no logging of stats.

How to reproduce it

Overwrite the default rule Outbound Connection to C2 Servers with the following rule configuration:

- rule: Outbound Connection to C2 Servers
  enabled: true

- list: c2_server_fqdn_list
        append: true
        items: 
        <insert a list of domains>

A list of 1000 domains seems to result in a 5 minute hang in my setup.

falco.yaml ``` base_syscalls: custom_set: [] repair: false buffered_outputs: false file_output: enabled: true filename: ./events.txt keep_alive: true grpc: bind_address: unix:///run/falco/falco.sock enabled: true threadiness: 0 grpc_output: enabled: true http_output: ca_bundle: "" ca_cert: "" ca_path: /etc/ssl/certs enabled: false insecure: false url: "" user_agent: falcosecurity/falco json_include_output_property: true json_include_tags_property: true json_output: true libs_logger: enabled: false severity: debug load_plugins: [] log_level: info log_stderr: true log_syslog: true metadata_download: chunk_wait_us: 1000 max_mb: 100 watch_freq_sec: 1 metrics: convert_memory_to_mb: true enabled: false include_empty_values: false interval: 1h kernel_event_counters_enabled: true libbpf_stats_enabled: true output_rule: true resource_utilization_enabled: true modern_bpf: cpus_for_each_syscall_buffer: 2 output_timeout: 2000 plugins: - init_config: null library_path: libk8saudit.so name: k8saudit open_params: http://:9765/k8s-audit - library_path: libcloudtrail.so name: cloudtrail - init_config: "" library_path: libjson.so name: json priority: debug program_output: enabled: false keep_alive: false program: 'jq ''{text: .output}'' | curl -d @- -X POST https://hooks.slack.com/services/XXX' rules_file: - /etc/falco/falco_rules.yaml - /etc/falco/falco_rules.local.yaml - /etc/falco/rules.d stdout_output: enabled: true syscall_buf_size_preset: 4 syscall_drop_failed_exit: false syscall_event_drops: actions: - log - alert max_burst: 1000000000 rate: 1000000000 simulate_drops: false threshold: 0.1 syscall_event_timeouts: max_consecutives: 1000 syslog_output: enabled: true time_format_iso_8601: false watch_config_files: true webserver: enabled: true k8s_healthz_endpoint: /healthz listen_port: 8765 ssl_certificate: /etc/falco/falco.pem ssl_enabled: false threadiness: 0 ```

Expected behaviour

According to this, domain name resolution should happen in a background thread and should not affect other rules.

Environment

LucaGuerra commented 1 year ago

cc @incertum @loresuso as discussed in the last maintainer call this is one of the problems that we see in our DNS parsing, would either of you be able to summarize your thoughts in a tracking issue to better shape our DNS improvement work?

incertum commented 1 year ago

@loresuso can summarize some of his exploration re a possible refactor of DNS lookups.

In addition, would love to separate DNS parsing issues from matching a super long list of items 1000 which is probably more items than we typically do.

@emilgelman would you be up to trying a toy example where you for example just match against such a super long list of ips with exact same rules expression, we see no delays here? In that case it would be a clearer hint to DNS being the sole source of trouble and not issues with matching so many items in one rule.

The reason why I am asking is because of my experience with Big Data queries where you cannot match against really super long list of IOCs just like that, you have to use other tricks / algorithms (Aho-Corasick as one example) to make that work. Something we do not yet support.

emilgelman commented 1 year ago

Hi @incertum. I've tried just that, with a list of 8k IP addresses to match against.

My rule configuration is:

- rule: Outbound Connection to C2 Servers
  enabled: true

- list: c2_server_ip_list
  append: true
  items: [
  <list of 8k IP addresses>
  ]

There were no delays during initialization. Falco boots up and works immediately as expected, both for the Outbound Connection to C2 Servers rule and other rules.

Let me provide some additional context. In the same rule mentioned above, when I use a list of c2_server_fqdn_list, Falco fails to resolve some of the names, which causes timeouts. As soon as Falco starts, I see errors in coredns. For instance: [ERROR] plugin/errors: 2 <domain>. A: read udp 10.244.x.x:40446->168.63.x.x:53: i/o timeout

I assume that the combination of having a list with 1k records and some of the records failing to resolve is causing the delay.

The part I don't completely understand is why it blocks Falco's main thread. I assume that the main thread is blocked because other rules aren't being validated while DNS is being resolved. For example, trying to kubectl exec into a pod doesn't raise an alert.

incertum commented 1 year ago

Thank you for confirming so quickly. The current implementation is inefficient and as mentioned above @loresuso has done amazing research to come up with better ways. I hope we can schedule starting work in that direction soon. My personal recommendation would be to invest in a major refactor / the new approach. One downside is that it will take some time. Let's wait to hear from Lorenzo.

poiana commented 9 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 commented 9 months ago

/remove-lifecycle stale

poiana commented 6 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 commented 6 months ago

/remove-lifecycle stale

poiana commented 3 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 commented 3 months ago

/remove-lifecycle stale

poiana commented 2 weeks ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Andreagit97 commented 2 weeks ago

/remove-lifecycle stale