Memory leak on falco 0.16.0

josacar commented 5 years ago

What happened:

Falco has a memory leak :

Screenshot 2019-07-29 at 09 17 56

Screenshot 2019-07-29 at 13 13 36

Some systems are running Debian Jessie and 3.16.68-2 kernel version and falco 0.16.0.

These systems have services like OpenSSH, cron, Redis, MongoDB or RabbitMQ.

What you expected to happen:

Not to have these memory leaks.

How to reproduce it (as minimally and precisely as possible):

Just start it in the affected instances

Anything else we need to know?:

We are using the default rules and 6 rules that silence some core rules.

Environment:

Falco version (use falco --version): 0.16.0

System info

Mon Jul 29 11:11:34 2019: Falco initialized with configuration file /etc/falco/falco.yaml
Mon Jul 29 11:11:34 2019: Loading rules from file /etc/falco/falco_rules.yaml:
Mon Jul 29 11:11:35 2019: Loading rules from file /etc/falco/falco_rules.local.yaml:
Mon Jul 29 11:11:35 2019: Loading rules from file /etc/falco/k8s_audit_rules.yaml:
{
"machine": "x86_64",
"nodename": "int-wadus.xxx.yyy.zz",
"release": "3.16.0-9-amd64",
"sysname": "Linux",
"version": "#1 SMP Debian 3.16.68-2 (2019-06-17)"
}

Cloud provider or hardware configuration: Amazon AWS

OS (e.g: cat /etc/os-release):

$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian
HOME_URL="http://www.debian.org/"
SUPPORT_URL="http://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel (e.g. uname -a):

$ uname -a
Linux int-wadus.xxx.yyy.zz 3.16.0-9-amd64 #1 SMP Debian 3.16.68-2 (2019-06-17) x86_64 GNU/Linux

Install tools (e.g. in kubernetes, rpm, deb, from source): deb
Others:

josacar commented 5 years ago

Also we see this in the syslog:

Jul 29 07:55:04 int-xxx.yyy.zz falco: 07:55:04.291501396: Critical Falco internal: syscall event drop. 12723 system calls dropped in last second.(ebpf_enabled=0 n_drops=12723 n_drops_buffer=12719 n_drops_bug=0 n_drops_pf=4 n_evts=165459)
Jul 29 07:55:05 int-xxx.yyy.zz falco: Falco internal: syscall event drop. 17492 system calls dropped in last second.
Jul 29 07:55:05 int-xxx.yyy.zz falco: 07:55:05.291860123: Critical Falco internal: syscall event drop. 17492 system calls dropped in last second.(ebpf_enabled=0 n_drops=17492 n_drops_buffer=17492 n_drops_bug=0 n_drops_pf=0 n_evts=137234)
Jul 29 07:55:07 int-xxx.yyy.zz falco: Falco internal: syscall event drop. 12107 system calls dropped in last second.
Jul 29 07:55:07 int-xxx.yyy.zz falco: 07:55:06.291860611: Critical Falco internal: syscall event drop. 12107 system calls dropped in last second.(ebpf_enabled=0 n_drops=12107 n_drops_buffer=12104 n_drops_bug=0 n_drops_pf=3 n_evts=222272)
Jul 29 07:55:07 int-xxx.yyy.zz falco: Falco internal: syscall event drop. 61394 system calls dropped in last second.
Jul 29 07:55:07 int-xxx.yyy.zz falco: 07:55:07.292983713: Critical Falco internal: syscall event drop. 61394 system calls dropped in last second.(ebpf_enabled=0 n_drops=61394 n_drops_buffer=61394 n_drops_bug=0 n_drops_pf=0 n_evts=130288)

josacar commented 5 years ago

Looks version 0.15.3 is not affected, no memory leak in 30 minutes.

mfdii commented 5 years ago

I need a few things to help track this down. Would you please provide:

Instance Type
Average load of the instance
Type of workloads (including language runtime, DB, proxies, etc)

My initial tests show that the memory is stable so I need a bit more info to dig further and to try and recreate.

josacar commented 5 years ago

Instance type: m3.large, m3.xlarge, c3.xlarge
Average load:
Workload ( not frontends )
- RabbitMQ cluster
- SFTP server
- Redis server
- MongoDB server

fntlnz commented 5 years ago

/assign @fntlnz

fntlnz commented 5 years ago

@josacar there's any way we can debug this together? 👼 I wasn't able to reproduce and it would be helpful to see the problem live in order to do a fix.

I'm fntlnz in the sysdig slack, if you are not there yet you can find a subscribe link here I wasn't able to find you.

Thanks for taking time of letting us now and for contributing this issue!

/triage needs-information /triage not-reproducible

josacar commented 5 years ago

@fntlnz What's your timezone? Mine is CEST. I think I can isolate a server and do a call with you.

fntlnz commented 5 years ago

Ok me and @josacar spent a couple of hours together today to get more information on this.

I can confirm that I was able to observe the leaks in their systems, it seems that like every 5 minutes a 50MB of memory is added to the heap of the falco process.

I think we can exclude that is a problem related to the module itself.

Here's a gprof of the falco process running in their machines.

expand for the gprof

``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 15.92 0.54 0.54 gen_event_filter_expression::compare(gen_event*) 4.02 0.67 0.14 sinsp_filter_check::flt_compare(cmpop, ppm_param_type, void*, unsigned int, unsigned int) 4.02 0.81 0.14 sinsp::next(sinsp_evt**) 3.87 0.94 0.13 gen_event_filter::run(gen_event*) 3.87 1.07 0.13 sinsp_filter_check_fd::extract(sinsp_evt*, unsigned int*, bool) 3.87 1.20 0.13 scap_next 3.57 1.32 0.12 falco_sinsp_ruleset::run(sinsp_evt*, unsigned short) 3.27 1.43 0.11 sinsp_filter_check_event::extract(sinsp_evt*, unsigned int*, bool) 2.98 1.53 0.10 sinsp_parser::reset(sinsp_evt*) 2.83 1.62 0.10 sinsp_parser::process_event(sinsp_evt*) 2.83 1.72 0.10 sinsp_threadinfo::get_main_thread() 2.68 1.81 0.09 sinsp::get_thread(long, bool, bool) 2.23 1.88 0.08 sinsp::find_thread(long, bool) 2.23 1.96 0.08 sinsp_filter_check_fd::compare(sinsp_evt*) 1.93 2.02 0.07 gen_event_filter_check::get_check_id() 1.93 2.09 0.07 sinsp_evt::get_param_as_str(unsigned int, char const**, sinsp_evt::param_fmt) 1.93 2.15 0.07 sinsp_filter_check::compare(sinsp_evt*) 1.79 2.21 0.06 sinsp_evt::get_param(unsigned int) 1.79 2.27 0.06 sinsp_filter_check_fdlist::extract(sinsp_evt*, unsigned int*, bool) 1.64 2.33 0.06 flt_compare(cmpop, ppm_param_type, void*, void*, unsigned int, unsigned int) 1.64 2.38 0.06 sinsp_filter_check_event::compare(sinsp_evt*) 1.49 2.43 0.05 StatsFileWriter::handle() 1.49 2.48 0.05 getfirst 1.04 2.52 0.04 flt_compare_string(cmpop, char*, char*) 1.04 2.55 0.04 sinsp_threadinfo::get_cwd() 0.89 2.58 0.03 do_inspect(falco_engine*, falco_outputs*, sinsp*, falco_configuration&, syscall_evt_drop_mgr&, unsigned long, std::string&, unsigned long, bool, int&) 0.89 2.61 0.03 sinsp_evt::get_num_params() 0.89 2.64 0.03 propagatemark 0.89 2.67 0.03 scap_check_suppressed 0.74 2.70 0.03 sinsp_filter_check::compare(gen_event*) 0.74 2.72 0.03 sinsp_filter_check::extract(gen_event*, unsigned int*, bool) 0.74 2.75 0.03 sinsp::update_k8s_state() 0.60 2.77 0.02 copy_and_sanitize_path(char*, char*, char const*) 0.60 2.79 0.02 sinsp::get_thread_ref(long, bool, bool) 0.60 2.81 0.02 codegrammar 0.60 2.83 0.02 refill_read_buffers 0.45 2.84 0.02 sinsp_threadinfo::get_fd_opencount() 0.45 2.86 0.02 sinsp_filter_check_fd::compare_domain(sinsp_evt*) 0.45 2.87 0.02 sinsp_filter_check_syslog::extract(sinsp_evt*, unsigned int*, bool) 0.45 2.89 0.02 sinsp_parser::parse_openat_dir(sinsp_evt*, char*, long, std::string*) 0.45 2.90 0.02 sinsp_parser::parse_open_openat_creat_exit(sinsp_evt*) 0.45 2.92 0.02 sinsp_filter_check_fd::extract_from_null_fd(sinsp_evt*, unsigned int*, bool) 0.45 2.93 0.02 sinsp_evt::get_type() 0.30 2.94 0.01 falco_engine::process_sinsp_event(sinsp_evt*) 0.30 2.95 0.01 sinsp_parser::parse_rw_exit(sinsp_evt*) 0.30 2.96 0.01 sinsp_parser::parse_clone_exit(sinsp_evt*) 0.30 2.97 0.01 sinsp_parser::retrieve_enter_event(sinsp_evt*, sinsp_evt*) 0.30 2.98 0.01 sinsp_filter_check::parse_field_name(char const*, bool, bool) 0.30 2.99 0.01 sinsp_thread_manager::remove_inactive_threads() 0.30 3.00 0.01 gen_event_filter_check::~gen_event_filter_check() 0.30 3.01 0.01 k8s_audit_filter_check::k8s_audit_filter_check() 0.30 3.02 0.01 sinsp_container_manager::remove_inactive_containers() 0.30 3.03 0.01 sinsp_filter_check_event::extract_abspath(sinsp_evt*, unsigned int*) 0.30 3.04 0.01 sinsp_filter_check_thread::compare(sinsp_evt*) 0.30 3.05 0.01 libsinsp::container_engine::bpm::resolve(sinsp_container_manager*, sinsp_threadinfo*, bool) 0.30 3.06 0.01 libsinsp::container_engine::rkt::match(sinsp_container_manager*, sinsp_threadinfo*, sinsp_container_info*, std::string&, std::string&, bool) 0.30 3.07 0.01 sinsp_evt::get_thread_info(bool) 0.30 3.08 0.01 sinsp_evt::get_ts() 0.30 3.09 0.01 char* std::string::_S_construct(char const*, char const*, std::allocator const&, std::forward_iterator_tag) 0.30 3.10 0.01 std::_Function_handler::_M_invoke(std::_Any_data const&, sinsp_threadinfo&) 0.30 3.11 0.01 std::_Hash_bytes(void const*, unsigned long, unsigned long) 0.30 3.12 0.01 addinstcap 0.30 3.13 0.01 checkaux 0.30 3.14 0.01 codegen 0.30 3.15 0.01 finalfix 0.30 3.16 0.01 hashkey 0.30 3.17 0.01 lh_delete 0.30 3.18 0.01 lj_BC_ISEQS 0.30 3.19 0.01 lj_BC_KNUM 0.30 3.20 0.01 lj_BC_RET 0.30 3.21 0.01 lj_alloc_free 0.30 3.22 0.01 lj_func_newL_gc 0.30 3.23 0.01 lj_record_ins 0.30 3.24 0.01 lj_str_new 0.30 3.25 0.01 lj_tab_newkey 0.30 3.26 0.01 luaL_checklstring 0.30 3.27 0.01 match 0.30 3.28 0.01 match 0.30 3.29 0.01 resizetab 0.30 3.30 0.01 scap_fd_read_unix_sockets_from_proc_fs 0.15 3.31 0.01 flt_compare_buffer(cmpop, char*, char*, unsigned int, unsigned int) 0.15 3.31 0.01 sinsp_fdinfo::add_filename(char const*) 0.15 3.32 0.01 sinsp_fdinfo::get_typestring() 0.15 3.32 0.01 sinsp_parser::event_cleanup(sinsp_evt*) 0.15 3.33 0.01 sinsp_parser::free_event_buffer(unsigned char*) 0.15 3.33 0.01 sinsp_parser::parse_container_evt(sinsp_evt*) 0.15 3.34 0.01 sinsp_exception::sinsp_exception(std::string, int) 0.15 3.34 0.01 sinsp_threadinfo::get_fd(long) 0.15 3.35 0.01 sinsp_filter_factory::sinsp_filter_factory(sinsp*) 0.15 3.35 0.01 sinsp_filter_check_fd::extract_fdname_from_creator(sinsp_evt*, unsigned int*, bool) 0.15 3.36 0.01 sinsp_evt::get_param_value_str(char const*, bool) 0.15 3.36 0.01 sinsp_evt::render_fd(long, char const**, sinsp_evt::param_fmt) % the percentage of the total running time of the time program used by this function. cumulative a running sum of the number of seconds accounted seconds for by this function and those listed above it. self the number of seconds accounted for by this seconds function alone. This is the major sort for this listing. calls the number of times this function was invoked, if this function is profiled, else blank. self the average number of milliseconds spent in this ms/call function per call, if this function is profiled, else blank. total the average number of milliseconds spent in this ms/call function and its descendents per call, if this function is profiled, else blank. name the name of the function. This is the minor sort for this listing. The index shows the location of the function in the gprof listing. If the index is in parenthesis it shows where it would appear in the gprof listing if it were to be printed. Copyright (C) 2012-2014 Free Software Foundation, Inc. Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. ```

Now Jose started a falco process from a binary with debugging symbols, they will make it core dump after a couple of hours so that we analyze it.

We tried to use valgrind on it but there were errors with the lua engine. Will try it again if the core dump method doesn't produce good results.

fntlnz commented 5 years ago

@josacar I found why valgrind didn't work in your environment to profile falco.

New valgrind versions ignore the MAP_32BIT flag in mmap, LuaJIT uses it and Falco uses LuaJIT.

https://bugs.kde.org/show_bug.cgi?id=324181

fntlnz commented 5 years ago

LuaJIT has a flag to compile it with Valgrind support.

-DLUAJITUSE_VALGRIND

https://github.com/LuaJIT/LuaJIT/blob/f0e865dd4861520258299d0f2a56491bd9d602e1/src/lj_mcode.c#L26-L29

fntlnz commented 5 years ago

Here's a representative graph from today's trials with @josacar

Every line in the graph represents a machine running Falco, all the lines under 100M are running Falco 0.15.0 while the brown one with the 400M spike is running 0.16.0 .

0.16.0 was started just today, the old 0.15.0 had been running for a while.

All the configurations are the same, the only difference is the falco version, the systems are provisioned using a set of reproducible scripts.

fntlnz commented 5 years ago

Correction: @josacar is reporting they see the leak in 0.15.3 too but it's less frequent than in 0.16.0

fntlnz commented 5 years ago

@josacar at this point I think that we need to see a valgrind callgrind.

Callgrind

valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes  ./userspace/falco/falco   -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml

This will generate an output in the working directory, named cachegrind.out.something

Also having the valgrind memcheck would be useful too:

valgrind --tool=memcheck ./userspace/falco/falco -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml

fntlnz commented 5 years ago

Ok @josacar sent to me the output of the provided commands, here's a summary:

Valgrind:

Details

``` Wed Aug 7 11:14:30 2019: Loading rules from file etc/falco/falco_rules.local.yaml: [74/1549] ==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0xC31C02: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC3656B: lua_getfield (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x63B241: falco_outputs::add_output(falco_outputs::output_config) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64B6CF: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== ==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0xC31C02: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC35EB1: lua_pushstring (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x63B265: falco_outputs::add_output(falco_outputs::output_config) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64B6CF: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== ==17858== Warning: noted but unhandled ioctl 0x7302 with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==17858== Warning: noted but unhandled ioctl 0x730a with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==17858== Warning: noted but unhandled ioctl 0x730b with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==17858== Warning: noted but unhandled ioctl 0x7301 with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==17858== Warning: noted but unhandled ioctl 0x730d with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==17858== Warning: noted but unhandled ioctl 0x7303 with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. Wed Aug 7 11:14:32 2019: Starting internal webserver, listening on port 8765 Wed Aug 7 11:15:07 2019: Falco internal: syscall event drop. 85955 system calls dropped in last second.==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0xC31C02: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC3656B: lua_getfield (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x63BB64: falco_outputs::handle_msg(unsigned long, falco_common::priority_type, std::string&, std::string&, std::map, std::allocator > >&) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x6451A6: syscall_evt_drop_mgr::perform_actions(unsigned long, scap_stats&, bool) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x645E33: syscall_evt_drop_mgr::process_event(sinsp*, sinsp_evt*) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64816C: do_inspect(falco_engine*, falco_outputs*, sinsp*, falco_configuration&, syscall_evt_drop_mgr&, unsigned long, std::string&, unsigned long, bool, int&) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64BAB3: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== 11:15:02.454277108: Critical Falco internal: syscall event drop. 85955 system calls dropped in last second.(ebpf_enabled=0 n_drops=85955 n_drops_buffer=85955 n_drops_bug=0 n_drops_pf=0 n_evts=210043) Wed Aug 7 11:15:10 2019: Falco internal: syscall event drop. 76493 system calls dropped in last second.11:15:03.454337305: Critical Falco internal: syscall event drop. 76493 system calls dropped in last second.(ebpf_enabled=0 n_drops=76493 n_drops_buffer=76493 n_drops_bug=0 n_drops_pf=0 n_evts=138162) Wed Aug 7 11:15:12 2019: Falco internal: syscall event drop. 123725 system calls dropped in last second.11:15:04.454361834: Critical Falco internal: syscall event drop. 123725 system calls dropped in last second.(ebpf_enabled=0 n_drops=123725 n_drops_buffer=123725 n_drops_bug=0 n_drops_pf=0 n_evts=134106) Wed Aug 7 11:15:15 2019: Falco internal: syscall event drop. 2331 system calls dropped in last second.11:15:05.454794937: Critical Falco internal: syscall event drop. 2331 system calls dropped in last second.(ebpf_enabled=0 n_drops=2331 n_drops_buffer=2331 n_drops_bug=0 n_drops_pf=0 n_evts=11155) Wed Aug 7 11:15:17 2019: Falco internal: syscall event drop. 1254 system calls dropped in last second.11:15:06.454814713: Critical Falco internal: syscall event drop. 1254 system calls dropped in last second.(ebpf_enabled=0 n_drops=1254 n_drops_buffer=1254 n_drops_bug=0 n_drops_pf=0 n_evts=35537) Wed Aug 7 11:15:17 2019: Falco internal: syscall event drop. 54 system calls dropped in last second.11:15:08.741225791: Critical Falco internal: syscall event drop. 54 system calls dropped in last second.(ebpf_enabled=0 n_drops=54 n_drops_buffer=54 n_drops_bug=0 n_drops_pf=0 n_evts=150) ==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0x76268D: sinsp_parser::set_ipv4_addresses_and_ports(sinsp_fdinfo*, unsigned char*) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x7628CA: sinsp_parser::update_fd(sinsp_evt*, sinsp_evt_param*) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x76BDE5: sinsp_parser::parse_rw_exit(sinsp_evt*) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x7696B5: sinsp_parser::process_event(sinsp_evt*) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x782197: sinsp::next(sinsp_evt**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x6480FE: do_inspect(falco_engine*, falco_outputs*, sinsp*, falco_configuration&, syscall_evt_drop_mgr&, unsigned long, std::string&, unsigned long, bool, int&) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64BAB3: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== Wed Aug 7 11:18:02 2019: Falco internal: syscall event drop. 1 system calls dropped in last second.11:18:02.568525217: Critical Falco internal: syscall event drop. 1 system calls dropped in last second.(ebpf_enabled=0 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=15836) Wed Aug 7 11:20:06 2019: Falco internal: syscall event drop. 32241 system calls dropped in last second.11:20:02.088801256: Critical Falco internal: syscall event drop. 32241 system calls dropped in last second.(ebpf_enabled=0 n_drops=32241 n_drops_buffer=32241 n_drops_bug=0 n_drops_pf=0 n_evts=181782) Wed Aug 7 11:20:09 2019: Falco internal: syscall event drop. 107202 system calls dropped in last second.11:20:03.088833334: Critical Falco internal: syscall event drop. 107202 system calls dropped in last second.(ebpf_enabled=0 n_drops=107202 n_drops_buffer=107202 n_drops_bug=0 n_drops_pf=0 n_evts=143094) Wed Aug 7 11:20:11 2019: Falco internal: syscall event drop. 131754 system calls dropped in last second.11:20:04.091879812: Critical Falco internal: syscall event drop. 131754 system calls dropped in last second.(ebpf_enabled=0 n_drops=131754 n_drops_buffer=131754 n_drops_bug=0 n_drops_pf=0 n_evts=131754) Wed Aug 7 11:20:14 2019: Falco internal: syscall event drop. 3588 system calls dropped in last second.11:20:05.101192044: Critical Falco internal: syscall event drop. 3588 system calls dropped in last second.(ebpf_enabled=0 n_drops=3588 n_drops_buffer=3588 n_drops_bug=0 n_drops_pf=0 n_evts=18508) Wed Aug 7 11:23:02 2019: Falco internal: syscall event drop. 1 system calls dropped in last second.11:23:02.462342492: Critical Falco internal: syscall event drop. 1 system calls dropped in last second.(ebpf_enabled=0 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=40387) Wed Aug 7 11:25:04 2019: Falco internal: syscall event drop. 29329 system calls dropped in last second.11:25:01.154299947: Critical Falco internal: syscall event drop. 29329 system calls dropped in last second.(ebpf_enabled=0 n_drops=29329 n_drops_buffer=29327 n_drops_bug=0 n_drops_pf=2 n_evts=244968) Wed Aug 7 11:25:07 2019: Falco internal: syscall event drop. 156262 system calls dropped in last second.11:25:02.154301138: Critical Falco internal: syscall event drop. 156262 system calls dropped in last second.(ebpf_enabled=0 n_drops=156262 n_drops_buffer=156262 n_drops_bug=0 n_drops_pf=0 n_evts=176731) Wed Aug 7 11:25:10 2019: Falco internal: syscall event drop. 89268 system calls dropped in last second.11:25:03.169015620: Critical Falco internal: syscall event drop. 89268 system calls dropped in last second.(ebpf_enabled=0 n_drops=89268 n_drops_buffer=89268 n_drops_bug=0 n_drops_pf=0 n_evts=201983) Wed Aug 7 11:25:11 2019: Falco internal: syscall event drop. 44851 system calls dropped in last second.11:25:04.246601791: Critical Falco internal: syscall event drop. 44851 system calls dropped in last second.(ebpf_enabled=0 n_drops=44851 n_drops_buffer=44851 n_drops_bug=0 n_drops_pf=0 n_evts=49673) Wed Aug 7 11:28:02 2019: Falco internal: syscall event drop. 1 system calls dropped in last second.11:28:02.661140487: Critical Falco internal: syscall event drop. 1 system calls dropped in last second.(ebpf_enabled=0 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=9594) ^CWed Aug 7 11:28:27 2019: SIGHUP Received, restarting... ==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0xC31C02: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC31ECD: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC4C5B9: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC466A4: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC36EEB: lua_pcall (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x66BC51: falco_engine::print_stats() (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64BAF0: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== Events detected: 0 Rule counts by severity: Triggered rules by rule name: Syscall event drop monitoring: - event drop detected: 17 occurrences - num times actions taken: 17 ==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0xC31C02: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC3656B: lua_getfield (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x63E514: falco_outputs::~falco_outputs() (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x63E938: falco_outputs::~falco_outputs() (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x648A6A: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== ==17858== ==17858== HEAP SUMMARY: ==17858== in use at exit: 761,379 bytes in 1,008 blocks ==17858== total heap usage: 4,597,884 allocs, 4,596,876 frees, 489,460,475 bytes allocated ==17858== ==17858== LEAK SUMMARY: ==17858== definitely lost: 512 bytes in 1 blocks ==17858== indirectly lost: 736,776 bytes in 658 blocks ==17858== possibly lost: 0 bytes in 0 blocks ==17858== still reachable: 24,091 bytes in 349 blocks ==17858== of which reachable via heuristic: ==17858== stdstring : 9,842 bytes in 132 blocks ==17858== suppressed: 0 bytes in 0 blocks ==17858== Rerun with --leak-check=full to see details of leaked memory ==17858== ==17858== For counts of detected and suppressed errors, rerun with: -v ==17858== Use --track-origins=yes to see where uninitialised values come from ==17858== ERROR SUMMARY: 24126 errors from 20 contexts (suppressed: 0 from 0) ```

Callgrind

12504

fntlnz commented 5 years ago

At this point @josacar I need you to run two more tools:

Memcheck with track origin

valgrind --tool=memcheck --track-origins=yes ./userspace/falco/falco -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml

Massif

valgrind --tool=massif ./userspace/falco/falco -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml

Thank you

fntlnz commented 4 years ago

By looking at the massif output that @josacar sent to me, I believe that the leaks are in the functinos that read sockets from proc fs

Namely:

scap_fd_read_ipv4_sockets_from_proc_fs
scap_fd_read_unix_sockets_from_proc_fs
scap_fd_read_netlink_sockets_from_proc_fs
scap_fd_read_ipv6_sockets_from_proc_fs

The massif also seems more focused on unix sockets,

@josacar would you mind telling us if the specific nodes where you are seeing this have a bigger amount of unix sockets open than the other nodes where you don't see this?

# lsof | grep -i unix | wc -l

joseluis-fw commented 4 years ago

Instance with highest leak:

# lsof | grep -i unix | wc -l
1827

Second one:

# lsof | grep -i unix | wc -l
2143

Third one:

# lsof | grep -i unix | wc -l
199

And the rest:

# lsof | grep -i unix | wc -l
73

# lsof | grep -i unix | wc -l
51

fntlnz commented 4 years ago

Ok so sockets seem to allocate memory but those are not the source of the leak, in the graph you can notice that the spike goes immediatelly down and also that code is called only at startup so it can't be increasing like that over time despite the number of sockets.

I'm doing further analysis on the files and I just noticed that also sinsp_filter_check_fd::allocate_new seems consistent in allocation and responsible for the spike in the following graph.