Closed mstemm closed 7 years ago
More details on the deployment--the falco containers are not exactly the ones we create, they are ones based on https://github.com/phusion/baseimage-docker, which swaps out debian:unstable for the phusion base image. Probably doesn't change the cpu usage, though.
I tried out the attached ruleset with the workloads we use internally for performance testing, which do include cassandra, and I'm able to see a significant difference in CPU usage between this ruleset and the ruleset that comes with falco 0.5.0. I think the most likely culprit is the additional network rules. I'll do some more investigation to identify the specific rules.
It seems that it's not directly related to workload, and may not be related to the network rules, either. I generated a flame graph for a falco instance that was running this ruleset:
Note that all the time is spent in compare_full_aname(), which is related to this rule:
- macro: interactive
condition: ((proc.aname=sshd and proc.name != sshd) or proc.name=systemd-logind or proc.name=login)
- rule: root running
desc: log all root actions
condition: evt.type = setuid and user.name = root and proc.name != sshd and interactive
output: "Interactive root (%user.name %proc.name %evt.dir %evt.type %evt.args %fd.name)"
priority: WARNING
This rule is disabled in the 0.5.0 ruleset. I can see that proc.aname could potentially be expensive, as it has to walk the entire process heirarchy. However, if I start a second falco instance with the same ruleset the cpu usage is much lower and the flame graph looks very different:
I checked the size of both instance's thread tables and they were about the same size (~1900 threads), so it's not a matter of dropped procexit() events causing one instance to have a much larger set of threads to work through than the other.
I'll try to look at the contents of both hash tables to see if one of them is becoming degenerate with unbalanced buckets.
It's actually an infinite loop in sinsp_filter_check_thread::compare_full_aname:
// No id specified, search in all of the ancestors
//
for(j = 0; mt != NULL; mt = mt->get_parent_thread(), j++)
{
if(j > 0)
{
res = flt_compare(m_cmpop,
PT_CHARBUF,
(void*)mt->m_comm.c_str());
if(res == true)
{
return true;
}
}
}
The process state is malformed with a cycle between 4 processes: 21394 (cut) -> 21389 (mongostats2stat) -> 21421 (mongostats2stat) -> 21400 (sh) -> 21394 (cut, beginning of list)
None of those processes actually exist any longer, which explains why a second falco instance doesn't have the same cpu usage. I suspect this is related to dropped events + stale thread state + pid recycling.
FYI, here's a trace file that can be used to reproduce the problem: parent_state_loop.zip
It was created by changing the scap file writer to modify the parent process of a given process to one of its children.
This was fixed in https://github.com/draios/sysdig/pull/753.
@mstemm awesome! Great to hear. Sorry I've been MIA. Will pull down the new version and try it out.
We don't have a new release yet, but that's coming soon. In the meantime, you can try one of the daily dev builds.
@pgray reported high falco cpu usage with the attached falco rules file: falco_rules.yaml.zip
Compared to 0.5.0, a few rules have been disabled, a few additions to lists of programs that are expected to do things like spawn shells, etc. The big change is that all the network-related rules (XXX unexpected network inbound/outbound traffic) have been uncommented.
We should double-check the network-related rules to make sure they're efficient.