falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.38k stars 902 forks source link

High CPU usage with slightly customized ruleset + enabled network services #208

Closed mstemm closed 7 years ago

mstemm commented 7 years ago

@pgray reported high falco cpu usage with the attached falco rules file: falco_rules.yaml.zip

Compared to 0.5.0, a few rules have been disabled, a few additions to lists of programs that are expected to do things like spawn shells, etc. The big change is that all the network-related rules (XXX unexpected network inbound/outbound traffic) have been uncommented.

We should double-check the network-related rules to make sure they're efficient.

mstemm commented 7 years ago

More details on the deployment--the falco containers are not exactly the ones we create, they are ones based on https://github.com/phusion/baseimage-docker, which swaps out debian:unstable for the phusion base image. Probably doesn't change the cpu usage, though.

mstemm commented 7 years ago

I tried out the attached ruleset with the workloads we use internally for performance testing, which do include cassandra, and I'm able to see a significant difference in CPU usage between this ruleset and the ruleset that comes with falco 0.5.0. I think the most likely culprit is the additional network rules. I'll do some more investigation to identify the specific rules.

mstemm commented 7 years ago

It seems that it's not directly related to workload, and may not be related to the network rules, either. I generated a flame graph for a falco instance that was running this ruleset:

screen shot 2017-02-10 at 1 08 18 pm

Note that all the time is spent in compare_full_aname(), which is related to this rule:

- macro: interactive
  condition: ((proc.aname=sshd and proc.name != sshd) or proc.name=systemd-logind or proc.name=login)

- rule: root running
  desc: log all root actions
  condition: evt.type = setuid and user.name = root and proc.name != sshd and interactive
  output: "Interactive root (%user.name %proc.name %evt.dir %evt.type %evt.args %fd.name)"
  priority: WARNING

This rule is disabled in the 0.5.0 ruleset. I can see that proc.aname could potentially be expensive, as it has to walk the entire process heirarchy. However, if I start a second falco instance with the same ruleset the cpu usage is much lower and the flame graph looks very different:

screen shot 2017-02-10 at 1 08 35 pm

I checked the size of both instance's thread tables and they were about the same size (~1900 threads), so it's not a matter of dropped procexit() events causing one instance to have a much larger set of threads to work through than the other.

I'll try to look at the contents of both hash tables to see if one of them is becoming degenerate with unbalanced buckets.

mstemm commented 7 years ago

It's actually an infinite loop in sinsp_filter_check_thread::compare_full_aname:

        // No id specified, search in all of the ancestors
        //
        for(j = 0; mt != NULL; mt = mt->get_parent_thread(), j++)
        {
                if(j > 0)
                {
                        res = flt_compare(m_cmpop,
                                PT_CHARBUF,
                                (void*)mt->m_comm.c_str());

                        if(res == true)
                        {
                                return true;
                        }
                }
        }

The process state is malformed with a cycle between 4 processes: 21394 (cut) -> 21389 (mongostats2stat) -> 21421 (mongostats2stat) -> 21400 (sh) -> 21394 (cut, beginning of list)

None of those processes actually exist any longer, which explains why a second falco instance doesn't have the same cpu usage. I suspect this is related to dropped events + stale thread state + pid recycling.

mstemm commented 7 years ago

FYI, here's a trace file that can be used to reproduce the problem: parent_state_loop.zip

It was created by changing the scap file writer to modify the parent process of a given process to one of its children.

mstemm commented 7 years ago

This was fixed in https://github.com/draios/sysdig/pull/753.

pgray commented 7 years ago

@mstemm awesome! Great to hear. Sorry I've been MIA. Will pull down the new version and try it out.

mstemm commented 7 years ago

We don't have a new release yet, but that's coming soon. In the meantime, you can try one of the daily dev builds.