falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.16k stars 884 forks source link

Memory leak on falco 0.16.0 #740

Closed josacar closed 4 years ago

josacar commented 5 years ago

What happened:

Falco has a memory leak :

Screenshot 2019-07-29 at 09 17 56

Screenshot 2019-07-29 at 13 13 36

Some systems are running Debian Jessie and 3.16.68-2 kernel version and falco 0.16.0.

These systems have services like OpenSSH, cron, Redis, MongoDB or RabbitMQ.

What you expected to happen:

Not to have these memory leaks.

How to reproduce it (as minimally and precisely as possible):

Just start it in the affected instances

Anything else we need to know?:

We are using the default rules and 6 rules that silence some core rules.

Environment:

josacar commented 5 years ago

Also we see this in the syslog:

Jul 29 07:55:04 int-xxx.yyy.zz falco: 07:55:04.291501396: Critical Falco internal: syscall event drop. 12723 system calls dropped in last second.(ebpf_enabled=0 n_drops=12723 n_drops_buffer=12719 n_drops_bug=0 n_drops_pf=4 n_evts=165459)
Jul 29 07:55:05 int-xxx.yyy.zz falco: Falco internal: syscall event drop. 17492 system calls dropped in last second.
Jul 29 07:55:05 int-xxx.yyy.zz falco: 07:55:05.291860123: Critical Falco internal: syscall event drop. 17492 system calls dropped in last second.(ebpf_enabled=0 n_drops=17492 n_drops_buffer=17492 n_drops_bug=0 n_drops_pf=0 n_evts=137234)
Jul 29 07:55:07 int-xxx.yyy.zz falco: Falco internal: syscall event drop. 12107 system calls dropped in last second.
Jul 29 07:55:07 int-xxx.yyy.zz falco: 07:55:06.291860611: Critical Falco internal: syscall event drop. 12107 system calls dropped in last second.(ebpf_enabled=0 n_drops=12107 n_drops_buffer=12104 n_drops_bug=0 n_drops_pf=3 n_evts=222272)
Jul 29 07:55:07 int-xxx.yyy.zz falco: Falco internal: syscall event drop. 61394 system calls dropped in last second.
Jul 29 07:55:07 int-xxx.yyy.zz falco: 07:55:07.292983713: Critical Falco internal: syscall event drop. 61394 system calls dropped in last second.(ebpf_enabled=0 n_drops=61394 n_drops_buffer=61394 n_drops_bug=0 n_drops_pf=0 n_evts=130288)
josacar commented 5 years ago

Looks version 0.15.3 is not affected, no memory leak in 30 minutes.

mfdii commented 5 years ago

I need a few things to help track this down. Would you please provide:

My initial tests show that the memory is stable so I need a bit more info to dig further and to try and recreate.

josacar commented 5 years ago
fntlnz commented 5 years ago

/assign @fntlnz

fntlnz commented 5 years ago

@josacar there's any way we can debug this together? 👼 I wasn't able to reproduce and it would be helpful to see the problem live in order to do a fix.

I'm fntlnz in the sysdig slack, if you are not there yet you can find a subscribe link here I wasn't able to find you.

Thanks for taking time of letting us now and for contributing this issue!

/triage needs-information /triage not-reproducible

josacar commented 5 years ago

@fntlnz What's your timezone? Mine is CEST. I think I can isolate a server and do a call with you.

fntlnz commented 5 years ago

Ok me and @josacar spent a couple of hours together today to get more information on this.

I can confirm that I was able to observe the leaks in their systems, it seems that like every 5 minutes a 50MB of memory is added to the heap of the falco process.

I think we can exclude that is a problem related to the module itself.

Here's a gprof of the falco process running in their machines.

expand for the gprof ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 15.92 0.54 0.54 gen_event_filter_expression::compare(gen_event*) 4.02 0.67 0.14 sinsp_filter_check::flt_compare(cmpop, ppm_param_type, void*, unsigned int, unsigned int) 4.02 0.81 0.14 sinsp::next(sinsp_evt**) 3.87 0.94 0.13 gen_event_filter::run(gen_event*) 3.87 1.07 0.13 sinsp_filter_check_fd::extract(sinsp_evt*, unsigned int*, bool) 3.87 1.20 0.13 scap_next 3.57 1.32 0.12 falco_sinsp_ruleset::run(sinsp_evt*, unsigned short) 3.27 1.43 0.11 sinsp_filter_check_event::extract(sinsp_evt*, unsigned int*, bool) 2.98 1.53 0.10 sinsp_parser::reset(sinsp_evt*) 2.83 1.62 0.10 sinsp_parser::process_event(sinsp_evt*) 2.83 1.72 0.10 sinsp_threadinfo::get_main_thread() 2.68 1.81 0.09 sinsp::get_thread(long, bool, bool) 2.23 1.88 0.08 sinsp::find_thread(long, bool) 2.23 1.96 0.08 sinsp_filter_check_fd::compare(sinsp_evt*) 1.93 2.02 0.07 gen_event_filter_check::get_check_id() 1.93 2.09 0.07 sinsp_evt::get_param_as_str(unsigned int, char const**, sinsp_evt::param_fmt) 1.93 2.15 0.07 sinsp_filter_check::compare(sinsp_evt*) 1.79 2.21 0.06 sinsp_evt::get_param(unsigned int) 1.79 2.27 0.06 sinsp_filter_check_fdlist::extract(sinsp_evt*, unsigned int*, bool) 1.64 2.33 0.06 flt_compare(cmpop, ppm_param_type, void*, void*, unsigned int, unsigned int) 1.64 2.38 0.06 sinsp_filter_check_event::compare(sinsp_evt*) 1.49 2.43 0.05 StatsFileWriter::handle() 1.49 2.48 0.05 getfirst 1.04 2.52 0.04 flt_compare_string(cmpop, char*, char*) 1.04 2.55 0.04 sinsp_threadinfo::get_cwd() 0.89 2.58 0.03 do_inspect(falco_engine*, falco_outputs*, sinsp*, falco_configuration&, syscall_evt_drop_mgr&, unsigned long, std::string&, unsigned long, bool, int&) 0.89 2.61 0.03 sinsp_evt::get_num_params() 0.89 2.64 0.03 propagatemark 0.89 2.67 0.03 scap_check_suppressed 0.74 2.70 0.03 sinsp_filter_check::compare(gen_event*) 0.74 2.72 0.03 sinsp_filter_check::extract(gen_event*, unsigned int*, bool) 0.74 2.75 0.03 sinsp::update_k8s_state() 0.60 2.77 0.02 copy_and_sanitize_path(char*, char*, char const*) 0.60 2.79 0.02 sinsp::get_thread_ref(long, bool, bool) 0.60 2.81 0.02 codegrammar 0.60 2.83 0.02 refill_read_buffers 0.45 2.84 0.02 sinsp_threadinfo::get_fd_opencount() 0.45 2.86 0.02 sinsp_filter_check_fd::compare_domain(sinsp_evt*) 0.45 2.87 0.02 sinsp_filter_check_syslog::extract(sinsp_evt*, unsigned int*, bool) 0.45 2.89 0.02 sinsp_parser::parse_openat_dir(sinsp_evt*, char*, long, std::string*) 0.45 2.90 0.02 sinsp_parser::parse_open_openat_creat_exit(sinsp_evt*) 0.45 2.92 0.02 sinsp_filter_check_fd::extract_from_null_fd(sinsp_evt*, unsigned int*, bool) 0.45 2.93 0.02 sinsp_evt::get_type() 0.30 2.94 0.01 falco_engine::process_sinsp_event(sinsp_evt*) 0.30 2.95 0.01 sinsp_parser::parse_rw_exit(sinsp_evt*) 0.30 2.96 0.01 sinsp_parser::parse_clone_exit(sinsp_evt*) 0.30 2.97 0.01 sinsp_parser::retrieve_enter_event(sinsp_evt*, sinsp_evt*) 0.30 2.98 0.01 sinsp_filter_check::parse_field_name(char const*, bool, bool) 0.30 2.99 0.01 sinsp_thread_manager::remove_inactive_threads() 0.30 3.00 0.01 gen_event_filter_check::~gen_event_filter_check() 0.30 3.01 0.01 k8s_audit_filter_check::k8s_audit_filter_check() 0.30 3.02 0.01 sinsp_container_manager::remove_inactive_containers() 0.30 3.03 0.01 sinsp_filter_check_event::extract_abspath(sinsp_evt*, unsigned int*) 0.30 3.04 0.01 sinsp_filter_check_thread::compare(sinsp_evt*) 0.30 3.05 0.01 libsinsp::container_engine::bpm::resolve(sinsp_container_manager*, sinsp_threadinfo*, bool) 0.30 3.06 0.01 libsinsp::container_engine::rkt::match(sinsp_container_manager*, sinsp_threadinfo*, sinsp_container_info*, std::string&, std::string&, bool) 0.30 3.07 0.01 sinsp_evt::get_thread_info(bool) 0.30 3.08 0.01 sinsp_evt::get_ts() 0.30 3.09 0.01 char* std::string::_S_construct(char const*, char const*, std::allocator const&, std::forward_iterator_tag) 0.30 3.10 0.01 std::_Function_handler::_M_invoke(std::_Any_data const&, sinsp_threadinfo&) 0.30 3.11 0.01 std::_Hash_bytes(void const*, unsigned long, unsigned long) 0.30 3.12 0.01 addinstcap 0.30 3.13 0.01 checkaux 0.30 3.14 0.01 codegen 0.30 3.15 0.01 finalfix 0.30 3.16 0.01 hashkey 0.30 3.17 0.01 lh_delete 0.30 3.18 0.01 lj_BC_ISEQS 0.30 3.19 0.01 lj_BC_KNUM 0.30 3.20 0.01 lj_BC_RET 0.30 3.21 0.01 lj_alloc_free 0.30 3.22 0.01 lj_func_newL_gc 0.30 3.23 0.01 lj_record_ins 0.30 3.24 0.01 lj_str_new 0.30 3.25 0.01 lj_tab_newkey 0.30 3.26 0.01 luaL_checklstring 0.30 3.27 0.01 match 0.30 3.28 0.01 match 0.30 3.29 0.01 resizetab 0.30 3.30 0.01 scap_fd_read_unix_sockets_from_proc_fs 0.15 3.31 0.01 flt_compare_buffer(cmpop, char*, char*, unsigned int, unsigned int) 0.15 3.31 0.01 sinsp_fdinfo::add_filename(char const*) 0.15 3.32 0.01 sinsp_fdinfo::get_typestring() 0.15 3.32 0.01 sinsp_parser::event_cleanup(sinsp_evt*) 0.15 3.33 0.01 sinsp_parser::free_event_buffer(unsigned char*) 0.15 3.33 0.01 sinsp_parser::parse_container_evt(sinsp_evt*) 0.15 3.34 0.01 sinsp_exception::sinsp_exception(std::string, int) 0.15 3.34 0.01 sinsp_threadinfo::get_fd(long) 0.15 3.35 0.01 sinsp_filter_factory::sinsp_filter_factory(sinsp*) 0.15 3.35 0.01 sinsp_filter_check_fd::extract_fdname_from_creator(sinsp_evt*, unsigned int*, bool) 0.15 3.36 0.01 sinsp_evt::get_param_value_str(char const*, bool) 0.15 3.36 0.01 sinsp_evt::render_fd(long, char const**, sinsp_evt::param_fmt) % the percentage of the total running time of the time program used by this function. cumulative a running sum of the number of seconds accounted seconds for by this function and those listed above it. self the number of seconds accounted for by this seconds function alone. This is the major sort for this listing. calls the number of times this function was invoked, if this function is profiled, else blank. self the average number of milliseconds spent in this ms/call function per call, if this function is profiled, else blank. total the average number of milliseconds spent in this ms/call function and its descendents per call, if this function is profiled, else blank. name the name of the function. This is the minor sort for this listing. The index shows the location of the function in the gprof listing. If the index is in parenthesis it shows where it would appear in the gprof listing if it were to be printed. Copyright (C) 2012-2014 Free Software Foundation, Inc. Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. ```

Now Jose started a falco process from a binary with debugging symbols, they will make it core dump after a couple of hours so that we analyze it.

We tried to use valgrind on it but there were errors with the lua engine. Will try it again if the core dump method doesn't produce good results.

fntlnz commented 5 years ago

@josacar I found why valgrind didn't work in your environment to profile falco.

New valgrind versions ignore the MAP_32BIT flag in mmap, LuaJIT uses it and Falco uses LuaJIT.

https://bugs.kde.org/show_bug.cgi?id=324181

fntlnz commented 5 years ago

LuaJIT has a flag to compile it with Valgrind support.

-DLUAJITUSE_VALGRIND 

https://github.com/LuaJIT/LuaJIT/blob/f0e865dd4861520258299d0f2a56491bd9d602e1/src/lj_mcode.c#L26-L29

fntlnz commented 5 years ago

Here's a representative graph from today's trials with @josacar

image

Every line in the graph represents a machine running Falco, all the lines under 100M are running Falco 0.15.0 while the brown one with the 400M spike is running 0.16.0 .

0.16.0 was started just today, the old 0.15.0 had been running for a while.

All the configurations are the same, the only difference is the falco version, the systems are provisioned using a set of reproducible scripts.

fntlnz commented 5 years ago

Correction: @josacar is reporting they see the leak in 0.15.3 too but it's less frequent than in 0.16.0

fntlnz commented 5 years ago

@josacar at this point I think that we need to see a valgrind callgrind.

Callgrind

valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes  ./userspace/falco/falco   -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml

This will generate an output in the working directory, named cachegrind.out.something

Also having the valgrind memcheck would be useful too:

valgrind --tool=memcheck ./userspace/falco/falco -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml
fntlnz commented 5 years ago

Ok @josacar sent to me the output of the provided commands, here's a summary:

Valgrind:

Details ``` Wed Aug 7 11:14:30 2019: Loading rules from file etc/falco/falco_rules.local.yaml: [74/1549] ==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0xC31C02: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC3656B: lua_getfield (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x63B241: falco_outputs::add_output(falco_outputs::output_config) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64B6CF: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== ==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0xC31C02: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC35EB1: lua_pushstring (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x63B265: falco_outputs::add_output(falco_outputs::output_config) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64B6CF: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== ==17858== Warning: noted but unhandled ioctl 0x7302 with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==17858== Warning: noted but unhandled ioctl 0x730a with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==17858== Warning: noted but unhandled ioctl 0x730b with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==17858== Warning: noted but unhandled ioctl 0x7301 with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==17858== Warning: noted but unhandled ioctl 0x730d with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. ==17858== Warning: noted but unhandled ioctl 0x7303 with no size/direction hints. ==17858== This could cause spurious value errors to appear. ==17858== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. Wed Aug 7 11:14:32 2019: Starting internal webserver, listening on port 8765 Wed Aug 7 11:15:07 2019: Falco internal: syscall event drop. 85955 system calls dropped in last second.==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0xC31C02: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC3656B: lua_getfield (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x63BB64: falco_outputs::handle_msg(unsigned long, falco_common::priority_type, std::string&, std::string&, std::map, std::allocator > >&) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x6451A6: syscall_evt_drop_mgr::perform_actions(unsigned long, scap_stats&, bool) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x645E33: syscall_evt_drop_mgr::process_event(sinsp*, sinsp_evt*) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64816C: do_inspect(falco_engine*, falco_outputs*, sinsp*, falco_configuration&, syscall_evt_drop_mgr&, unsigned long, std::string&, unsigned long, bool, int&) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64BAB3: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== 11:15:02.454277108: Critical Falco internal: syscall event drop. 85955 system calls dropped in last second.(ebpf_enabled=0 n_drops=85955 n_drops_buffer=85955 n_drops_bug=0 n_drops_pf=0 n_evts=210043) Wed Aug 7 11:15:10 2019: Falco internal: syscall event drop. 76493 system calls dropped in last second.11:15:03.454337305: Critical Falco internal: syscall event drop. 76493 system calls dropped in last second.(ebpf_enabled=0 n_drops=76493 n_drops_buffer=76493 n_drops_bug=0 n_drops_pf=0 n_evts=138162) Wed Aug 7 11:15:12 2019: Falco internal: syscall event drop. 123725 system calls dropped in last second.11:15:04.454361834: Critical Falco internal: syscall event drop. 123725 system calls dropped in last second.(ebpf_enabled=0 n_drops=123725 n_drops_buffer=123725 n_drops_bug=0 n_drops_pf=0 n_evts=134106) Wed Aug 7 11:15:15 2019: Falco internal: syscall event drop. 2331 system calls dropped in last second.11:15:05.454794937: Critical Falco internal: syscall event drop. 2331 system calls dropped in last second.(ebpf_enabled=0 n_drops=2331 n_drops_buffer=2331 n_drops_bug=0 n_drops_pf=0 n_evts=11155) Wed Aug 7 11:15:17 2019: Falco internal: syscall event drop. 1254 system calls dropped in last second.11:15:06.454814713: Critical Falco internal: syscall event drop. 1254 system calls dropped in last second.(ebpf_enabled=0 n_drops=1254 n_drops_buffer=1254 n_drops_bug=0 n_drops_pf=0 n_evts=35537) Wed Aug 7 11:15:17 2019: Falco internal: syscall event drop. 54 system calls dropped in last second.11:15:08.741225791: Critical Falco internal: syscall event drop. 54 system calls dropped in last second.(ebpf_enabled=0 n_drops=54 n_drops_buffer=54 n_drops_bug=0 n_drops_pf=0 n_evts=150) ==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0x76268D: sinsp_parser::set_ipv4_addresses_and_ports(sinsp_fdinfo*, unsigned char*) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x7628CA: sinsp_parser::update_fd(sinsp_evt*, sinsp_evt_param*) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x76BDE5: sinsp_parser::parse_rw_exit(sinsp_evt*) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x7696B5: sinsp_parser::process_event(sinsp_evt*) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x782197: sinsp::next(sinsp_evt**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x6480FE: do_inspect(falco_engine*, falco_outputs*, sinsp*, falco_configuration&, syscall_evt_drop_mgr&, unsigned long, std::string&, unsigned long, bool, int&) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64BAB3: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== Wed Aug 7 11:18:02 2019: Falco internal: syscall event drop. 1 system calls dropped in last second.11:18:02.568525217: Critical Falco internal: syscall event drop. 1 system calls dropped in last second.(ebpf_enabled=0 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=15836) Wed Aug 7 11:20:06 2019: Falco internal: syscall event drop. 32241 system calls dropped in last second.11:20:02.088801256: Critical Falco internal: syscall event drop. 32241 system calls dropped in last second.(ebpf_enabled=0 n_drops=32241 n_drops_buffer=32241 n_drops_bug=0 n_drops_pf=0 n_evts=181782) Wed Aug 7 11:20:09 2019: Falco internal: syscall event drop. 107202 system calls dropped in last second.11:20:03.088833334: Critical Falco internal: syscall event drop. 107202 system calls dropped in last second.(ebpf_enabled=0 n_drops=107202 n_drops_buffer=107202 n_drops_bug=0 n_drops_pf=0 n_evts=143094) Wed Aug 7 11:20:11 2019: Falco internal: syscall event drop. 131754 system calls dropped in last second.11:20:04.091879812: Critical Falco internal: syscall event drop. 131754 system calls dropped in last second.(ebpf_enabled=0 n_drops=131754 n_drops_buffer=131754 n_drops_bug=0 n_drops_pf=0 n_evts=131754) Wed Aug 7 11:20:14 2019: Falco internal: syscall event drop. 3588 system calls dropped in last second.11:20:05.101192044: Critical Falco internal: syscall event drop. 3588 system calls dropped in last second.(ebpf_enabled=0 n_drops=3588 n_drops_buffer=3588 n_drops_bug=0 n_drops_pf=0 n_evts=18508) Wed Aug 7 11:23:02 2019: Falco internal: syscall event drop. 1 system calls dropped in last second.11:23:02.462342492: Critical Falco internal: syscall event drop. 1 system calls dropped in last second.(ebpf_enabled=0 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=40387) Wed Aug 7 11:25:04 2019: Falco internal: syscall event drop. 29329 system calls dropped in last second.11:25:01.154299947: Critical Falco internal: syscall event drop. 29329 system calls dropped in last second.(ebpf_enabled=0 n_drops=29329 n_drops_buffer=29327 n_drops_bug=0 n_drops_pf=2 n_evts=244968) Wed Aug 7 11:25:07 2019: Falco internal: syscall event drop. 156262 system calls dropped in last second.11:25:02.154301138: Critical Falco internal: syscall event drop. 156262 system calls dropped in last second.(ebpf_enabled=0 n_drops=156262 n_drops_buffer=156262 n_drops_bug=0 n_drops_pf=0 n_evts=176731) Wed Aug 7 11:25:10 2019: Falco internal: syscall event drop. 89268 system calls dropped in last second.11:25:03.169015620: Critical Falco internal: syscall event drop. 89268 system calls dropped in last second.(ebpf_enabled=0 n_drops=89268 n_drops_buffer=89268 n_drops_bug=0 n_drops_pf=0 n_evts=201983) Wed Aug 7 11:25:11 2019: Falco internal: syscall event drop. 44851 system calls dropped in last second.11:25:04.246601791: Critical Falco internal: syscall event drop. 44851 system calls dropped in last second.(ebpf_enabled=0 n_drops=44851 n_drops_buffer=44851 n_drops_bug=0 n_drops_pf=0 n_evts=49673) Wed Aug 7 11:28:02 2019: Falco internal: syscall event drop. 1 system calls dropped in last second.11:28:02.661140487: Critical Falco internal: syscall event drop. 1 system calls dropped in last second.(ebpf_enabled=0 n_drops=1 n_drops_buffer=0 n_drops_bug=0 n_drops_pf=1 n_evts=9594) ^CWed Aug 7 11:28:27 2019: SIGHUP Received, restarting... ==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0xC31C02: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC31ECD: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC4C5B9: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC466A4: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC36EEB: lua_pcall (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x66BC51: falco_engine::print_stats() (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x64BAF0: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== Events detected: 0 Rule counts by severity: Triggered rules by rule name: Syscall event drop monitoring: - event drop detected: 17 occurrences - num times actions taken: 17 ==17858== Conditional jump or move depends on uninitialised value(s) ==17858== at 0xC31C02: ??? (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0xC3656B: lua_getfield (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x63E514: falco_outputs::~falco_outputs() (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x63E938: falco_outputs::~falco_outputs() (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x648A6A: falco_init(int, char**) (in /tmp/falco/falco/usr/bin/falco) ==17858== by 0x627A23: main (in /tmp/falco/falco/usr/bin/falco) ==17858== ==17858== ==17858== HEAP SUMMARY: ==17858== in use at exit: 761,379 bytes in 1,008 blocks ==17858== total heap usage: 4,597,884 allocs, 4,596,876 frees, 489,460,475 bytes allocated ==17858== ==17858== LEAK SUMMARY: ==17858== definitely lost: 512 bytes in 1 blocks ==17858== indirectly lost: 736,776 bytes in 658 blocks ==17858== possibly lost: 0 bytes in 0 blocks ==17858== still reachable: 24,091 bytes in 349 blocks ==17858== of which reachable via heuristic: ==17858== stdstring : 9,842 bytes in 132 blocks ==17858== suppressed: 0 bytes in 0 blocks ==17858== Rerun with --leak-check=full to see details of leaked memory ==17858== ==17858== For counts of detected and suppressed errors, rerun with: -v ==17858== Use --track-origins=yes to see where uninitialised values come from ==17858== ERROR SUMMARY: 24126 errors from 20 contexts (suppressed: 0 from 0) ```

Callgrind

12504

fntlnz commented 5 years ago

At this point @josacar I need you to run two more tools:

Memcheck with track origin

valgrind --tool=memcheck --track-origins=yes ./userspace/falco/falco -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml 

Massif

valgrind --tool=massif ./userspace/falco/falco -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml 

Thank you

fntlnz commented 4 years ago

By looking at the massif output that @josacar sent to me, I believe that the leaks are in the functinos that read sockets from proc fs

Namely:

The massif also seems more focused on unix sockets,

image

@josacar would you mind telling us if the specific nodes where you are seeing this have a bigger amount of unix sockets open than the other nodes where you don't see this?

# lsof | grep -i unix | wc -l
joseluis-fw commented 4 years ago

Instance with highest leak:

# lsof | grep -i unix | wc -l
1827

Second one:

# lsof | grep -i unix | wc -l
2143

Third one:

# lsof | grep -i unix | wc -l
199

And the rest:

# lsof | grep -i unix | wc -l
73
# lsof | grep -i unix | wc -l
51
fntlnz commented 4 years ago

Ok so sockets seem to allocate memory but those are not the source of the leak, in the graph you can notice that the spike goes immediatelly down and also that code is called only at startup so it can't be increasing like that over time despite the number of sockets.

I'm doing further analysis on the files and I just noticed that also sinsp_filter_check_fd::allocate_new seems consistent in allocation and responsible for the spike in the following graph.

image

josacar commented 4 years ago

If you need me to run massif again for longer time LMK.

fntlnz commented 4 years ago

Yes please it would be helpful, like at least a couple of hours more @josacar

fntlnz commented 4 years ago

Ok, me and @leodido have a fix - we sent a deb with it to @josacar and he is now reporting successful results.

Here's an image from @josacar 's system, blue is falco 0.15.3, yellow is the fix. Untitled

We are still waiting for it to run for a while more and will open the PR.

fntlnz commented 4 years ago

The massif with the last applied patch looks much better but there's still a leak, tens of megabytes an hour compared to hundreds before the fix.

2019-08-22_3436x1333_scrot

fntlnz commented 4 years ago

deb with the latest fix in https://github.com/draios/sysdig/pull/1491 compiled in falco: https://falco-dev-public.s3.amazonaws.com/issue-740/falco-0.740.1-x86_64.deb

fntlnz commented 4 years ago

Here's the new deb after the latest commits

https://falco-dev-public.s3.amazonaws.com/issue-740/falco-0.740.2-x86_64.deb

Older ones had been deleted!

fntlnz commented 4 years ago

Ok @josacar just started the latest patch, it looks like the first ten minutes are promising!! :crossed_fingers:

image

josacar commented 4 years ago

Almost 24 hours running 0.740.2 here are the bad news:

image

I will run massif if you like @fntlnz

leodido commented 4 years ago

Hey @josacar yeah please run massif

josacar commented 4 years ago

@leodido How many hours do you want me to run it?

fntlnz commented 4 years ago

24 hours should be enough @josacar thank you!

josacar commented 4 years ago

I sent @fntlnz massif dumps running slightly more than 24 hours:

image

fntlnz commented 4 years ago

Ok there's still something but the leak is heavily decreased since the first report, it was around 100M/hour now is more like 2M/hour - investigating

leodido commented 4 years ago

Ok here we are with a third version of the package: https://falco-dev-public.s3.amazonaws.com/issue-740/falco-0.740.3-x86_64.deb

Could you test it and provide us a new massif for at least 12 hours, maybe?

Thanks :)

fntlnz commented 4 years ago

@josacar just updated us in the Falco slack, this is it running for 1 hour.

image

It looks pretty flat right now, @josacar is going to share an update with massif in 24 hours to see if the report looks good.

We are getting closer!!!!!

fntlnz commented 4 years ago

Here's the last update from @josacar ! The leak is definitely solved!

image

fntlnz commented 4 years ago

Ok here's the last massif output, the leak is definitely solved, now memory goes normally up and down as expected. 2019-08-29_3437x1346_scrot

fntlnz commented 4 years ago

Ok the fix had been merged on Sysdig, let's close this!! https://github.com/draios/sysdig/pull/1491

/close

krisnova commented 4 years ago

Should we evaluate https://github.com/marketplace/pigci for Falco?