Closed josacar closed 4 years ago
Also we see this in the syslog:
Jul 29 07:55:04 int-xxx.yyy.zz falco: 07:55:04.291501396: Critical Falco internal: syscall event drop. 12723 system calls dropped in last second.(ebpf_enabled=0 n_drops=12723 n_drops_buffer=12719 n_drops_bug=0 n_drops_pf=4 n_evts=165459)
Jul 29 07:55:05 int-xxx.yyy.zz falco: Falco internal: syscall event drop. 17492 system calls dropped in last second.
Jul 29 07:55:05 int-xxx.yyy.zz falco: 07:55:05.291860123: Critical Falco internal: syscall event drop. 17492 system calls dropped in last second.(ebpf_enabled=0 n_drops=17492 n_drops_buffer=17492 n_drops_bug=0 n_drops_pf=0 n_evts=137234)
Jul 29 07:55:07 int-xxx.yyy.zz falco: Falco internal: syscall event drop. 12107 system calls dropped in last second.
Jul 29 07:55:07 int-xxx.yyy.zz falco: 07:55:06.291860611: Critical Falco internal: syscall event drop. 12107 system calls dropped in last second.(ebpf_enabled=0 n_drops=12107 n_drops_buffer=12104 n_drops_bug=0 n_drops_pf=3 n_evts=222272)
Jul 29 07:55:07 int-xxx.yyy.zz falco: Falco internal: syscall event drop. 61394 system calls dropped in last second.
Jul 29 07:55:07 int-xxx.yyy.zz falco: 07:55:07.292983713: Critical Falco internal: syscall event drop. 61394 system calls dropped in last second.(ebpf_enabled=0 n_drops=61394 n_drops_buffer=61394 n_drops_bug=0 n_drops_pf=0 n_evts=130288)
Looks version 0.15.3
is not affected, no memory leak in 30 minutes.
I need a few things to help track this down. Would you please provide:
My initial tests show that the memory is stable so I need a bit more info to dig further and to try and recreate.
/assign @fntlnz
@josacar there's any way we can debug this together? 👼 I wasn't able to reproduce and it would be helpful to see the problem live in order to do a fix.
I'm fntlnz
in the sysdig slack, if you are not there yet you can find a subscribe link here I wasn't able to find you.
Thanks for taking time of letting us now and for contributing this issue!
/triage needs-information /triage not-reproducible
@fntlnz What's your timezone? Mine is CEST. I think I can isolate a server and do a call with you.
Ok me and @josacar spent a couple of hours together today to get more information on this.
I can confirm that I was able to observe the leaks in their systems, it seems that like every 5 minutes a 50MB of memory is added to the heap of the falco process.
I think we can exclude that is a problem related to the module itself.
Here's a gprof of the falco process running in their machines.
Now Jose started a falco process from a binary with debugging symbols, they will make it core dump after a couple of hours so that we analyze it.
We tried to use valgrind on it but there were errors with the lua engine. Will try it again if the core dump method doesn't produce good results.
@josacar I found why valgrind didn't work in your environment to profile falco.
New valgrind versions ignore the MAP_32BIT flag in mmap, LuaJIT uses it and Falco uses LuaJIT.
LuaJIT has a flag to compile it with Valgrind support.
-DLUAJITUSE_VALGRIND
Here's a representative graph from today's trials with @josacar
Every line in the graph represents a machine running Falco, all the lines under 100M are running Falco 0.15.0 while the brown one with the 400M spike is running 0.16.0 .
0.16.0 was started just today, the old 0.15.0 had been running for a while.
All the configurations are the same, the only difference is the falco version, the systems are provisioned using a set of reproducible scripts.
Correction: @josacar is reporting they see the leak in 0.15.3 too but it's less frequent than in 0.16.0
@josacar at this point I think that we need to see a valgrind callgrind.
Callgrind
valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes ./userspace/falco/falco -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml
This will generate an output in the working directory, named cachegrind.out.something
Also having the valgrind memcheck would be useful too:
valgrind --tool=memcheck ./userspace/falco/falco -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml
Ok @josacar sent to me the output of the provided commands, here's a summary:
Valgrind:
Callgrind
At this point @josacar I need you to run two more tools:
Memcheck with track origin
valgrind --tool=memcheck --track-origins=yes ./userspace/falco/falco -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml
Massif
valgrind --tool=massif ./userspace/falco/falco -c /source/falco/falco.yaml -r /source/falco/rules/falco_rules.yaml -r /source/falco//rules/k8s_audit_rules.yaml -r /source/falco/rules/falco_rules.local.yaml
Thank you
By looking at the massif output that @josacar sent to me, I believe that the leaks are in the functinos that read sockets from proc fs
Namely:
The massif also seems more focused on unix sockets,
@josacar would you mind telling us if the specific nodes where you are seeing this have a bigger amount of unix sockets open than the other nodes where you don't see this?
# lsof | grep -i unix | wc -l
Instance with highest leak:
# lsof | grep -i unix | wc -l
1827
Second one:
# lsof | grep -i unix | wc -l
2143
Third one:
# lsof | grep -i unix | wc -l
199
And the rest:
# lsof | grep -i unix | wc -l
73
# lsof | grep -i unix | wc -l
51
Ok so sockets seem to allocate memory but those are not the source of the leak, in the graph you can notice that the spike goes immediatelly down and also that code is called only at startup so it can't be increasing like that over time despite the number of sockets.
I'm doing further analysis on the files and I just noticed that also sinsp_filter_check_fd::allocate_new
seems consistent in allocation and responsible for the spike in the following graph.
If you need me to run massif
again for longer time LMK.
Yes please it would be helpful, like at least a couple of hours more @josacar
Ok, me and @leodido have a fix - we sent a deb with it to @josacar and he is now reporting successful results.
Here's an image from @josacar 's system, blue is falco 0.15.3, yellow is the fix.
We are still waiting for it to run for a while more and will open the PR.
The massif with the last applied patch looks much better but there's still a leak, tens of megabytes an hour compared to hundreds before the fix.
deb with the latest fix in https://github.com/draios/sysdig/pull/1491 compiled in falco: https://falco-dev-public.s3.amazonaws.com/issue-740/falco-0.740.1-x86_64.deb
Here's the new deb after the latest commits
https://falco-dev-public.s3.amazonaws.com/issue-740/falco-0.740.2-x86_64.deb
Older ones had been deleted!
Ok @josacar just started the latest patch, it looks like the first ten minutes are promising!! :crossed_fingers:
Almost 24 hours running 0.740.2 here are the bad news:
I will run massif if you like @fntlnz
Hey @josacar yeah please run massif
@leodido How many hours do you want me to run it?
24 hours should be enough @josacar thank you!
I sent @fntlnz massif
dumps running slightly more than 24 hours:
Ok there's still something but the leak is heavily decreased since the first report, it was around 100M/hour now is more like 2M/hour - investigating
Ok here we are with a third version of the package: https://falco-dev-public.s3.amazonaws.com/issue-740/falco-0.740.3-x86_64.deb
Could you test it and provide us a new massif for at least 12 hours, maybe?
Thanks :)
@josacar just updated us in the Falco slack, this is it running for 1 hour.
It looks pretty flat right now, @josacar is going to share an update with massif in 24 hours to see if the report looks good.
We are getting closer!!!!!
Here's the last update from @josacar ! The leak is definitely solved!
Ok here's the last massif output, the leak is definitely solved, now memory goes normally up and down as expected.
Ok the fix had been merged on Sysdig, let's close this!! https://github.com/draios/sysdig/pull/1491
/close
Should we evaluate https://github.com/marketplace/pigci for Falco?
What happened:
Falco has a memory leak :
Some systems are running Debian Jessie and
3.16.68-2
kernel version and falco0.16.0
.These systems have services like OpenSSH, cron, Redis, MongoDB or RabbitMQ.
What you expected to happen:
Not to have these memory leaks.
How to reproduce it (as minimally and precisely as possible):
Just start it in the affected instances
Anything else we need to know?:
We are using the default rules and 6 rules that silence some core rules.
Environment:
falco --version
): 0.16.0cat /etc/os-release
):uname -a
):deb