els0r / goProbe

High-performance IP packet metadata aggregation and efficient storage and querying of flows
GNU General Public License v2.0
12 stars 4 forks source link

goProbe shows elevated memory consumption after writeout #131

Closed els0r closed 1 year ago

els0r commented 1 year ago

The data produced for this issue was as a result of running an alpha test on a production system

System Info

Ring buffer configuration

    promisc: true
    ring_buffer:
      block_size: 10485760
      num_blocks: 2

Details

Memory is steadily increasing with every writeout. It can be observed that the RES memory sometimes doubles when flows are written out to the database. Below is the result of top -b -d 30 | grep goProbe observed across write out cycles:

26878 root      20   0 7068504  79236  70580 S   0.0   0.2   0:01.00 goProbe
26878 root      20   0 7068504  79236  70580 S   0.3   0.2   0:01.08 goProbe
26878 root      20   0 7068504 445992 429156 S   0.3   1.4   0:01.16 goProbe <-- WRITEOUT
26878 root      20   0 7068504 445992 429156 S   0.2   1.4   0:01.23 goProbe
26878 root      20   0 7068504 445992 429156 S   0.2   1.4   0:01.30 goProbe
26878 root      20   0 7068504 445992 429156 S   0.2   1.4   0:01.37 goProbe
26878 root      20   0 7076700 453960 449636 S   0.2   1.4   0:01.44 goProbe
26878 root      20   0 7076700 474608 470116 S   0.2   1.5   0:01.49 goProbe <-- WRITEOUT
26878 root      20   0 7076700 474608 470116 S   0.2   1.5   0:01.56 goProbe
26878 root      20   0 7076700 474608 470116 S   0.2   1.5   0:01.62 goProbe 
26878 root      20   0 7076700 514784 511076 S   0.2   1.6   0:01.69 goProbe 
26878 root      20   0 7076700 514784 511076 S   0.2   1.6   0:01.76 goProbe 
26878 root      20   0 7076956   1.1g   1.1g S   0.3   3.5   0:01.84 goProbe <-- WRITEOUT
26878 root      20   0 7076956   1.1g   1.1g S   0.2   3.5   0:01.91 goProbe
26878 root      20   0 7076956   1.1g   1.1g S   0.2   3.5   0:01.98 goProbe
26878 root      20   0 7076956   1.1g   1.1g S   0.2   3.5   0:02.05 goProbe
26878 root      20   0 7076956   1.1g   1.1g S   0.2   3.5   0:02.12 goProbe
26878 root      20   0 7076956   1.1g   1.1g S   0.2   3.5   0:02.18 goProbe
26878 root      20   0 7076956   1.1g   1.1g S   0.2   3.5   0:02.23 goProbe
26878 root      20   0 7076956   1.1g   1.1g S   0.2   3.5   0:02.30 goProbe
26878 root      20   0 7076956   1.1g   1.1g S   0.2   3.5   0:02.36 goProbe
26878 root      20   0 7076956   1.1g   1.1g S   0.2   3.5   0:02.43 goProbe
26878 root      20   0 7076956   1.5g   1.4g S   0.3   4.7   0:02.52 goProbe <-- WRITEOUT
26878 root      20   0 7076956   1.5g   1.5g S   0.2   4.8   0:02.59 goProbe

To be investigated

While the initial memory footprint may be the result of chosen ring buffer sizes and a large amount of flows inside the individual flow maps, the steady increase must have other causes.

@fako1024 : FYI. Ever seen this?

fako1024 commented 1 year ago

@els0r Nope, but given the magnitude it should be easy to track. I'll try to reproduce, but the easiest would probably be to simply run that with profiling enabled and send me the mem profile. This should be clearly visible in the inuse_memory. If that's not possible in production, EZ, I'll figure something out locally, otherwise that would speed up things...

fako1024 commented 1 year ago

That being said: My gut tells me it's the flows that are kept at the end of the rotation, maybe (tm) we're not cleaning up the old (then unused) ones upon rotation or something like that (given that a recent profile of an E2E test I ran came up with zero allocations it must be related to the flow map(s))...

fako1024 commented 1 year ago

A very first shot at profiling (using an E2E test suite running 10x) shows no immediate issues (and pretty much saturation):

  25734 fako      20   0 1395308  30836  12672 S 162.5   0.2   0:39.71 e2etest.test <- START
...
  25734 fako      20   0 1757456  49964  14848 S 175.1   0.3  12:24.44 e2etest.test <- END

In between it fluctuates (like Go just does) but only goes as high as about 68 MiB. Memory profile also doesn't show anything suspicious (neither in use nor "stray" allocations):

alloc_space
Showing top 10 nodes out of 11
      flat  flat%   sum%        cum   cum%
   79.54GB 96.79% 96.79%    79.54GB 96.79%  github.com/fako1024/slimcap/capture/afpacket/afring.NewMockSource /home/fako/Develop/go/pkg/mod/github.com/fako1024/slimcap@v0.0.0-20230611191315-b410f996b186/capture/afpacket/afring/afring_mock.go:83
         0     0% 96.79%    79.66GB 96.94%  github.com/els0r/goProbe/pkg/capture.(*Capture).run /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:156
         0     0% 96.79%    79.71GB 96.99%  github.com/els0r/goProbe/pkg/capture.(*Manager).update.func1 /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:235
         0     0% 96.79%    80.11GB 97.48%  github.com/els0r/goProbe/pkg/capture.(*RunGroup).Run.func1 /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/rungroup.go:26
         0     0% 96.79%     1.25GB  1.52%  github.com/els0r/goProbe/pkg/e2etest.newPcapSource.func1 /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:389
inuse_space
Showing nodes accounting for 512.05kB, 100% of 512.05kB total
      flat  flat%   sum%        cum   cum%
  512.05kB   100%   100%   512.05kB   100%  time.NewTimer /usr/local/go/src/time/sleep.go:87
         0     0%   100%   512.05kB   100%  github.com/els0r/goProbe/pkg/capture.(*Manager).ScheduleWriteouts.func1 /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:96

Now that tells us that stuff gets cleaned properly once the CaptureManager terminates. Of course not yet the described scenario, so I'll try to test what happens if a single instance is just long running / rotating a lot of times...

fako1024 commented 1 year ago

So, the good news is: I'm 99% sure there is no leak (even if we ignore the discussion about "can there be leaks in Go at all"). I've run a continuous memory profile on goProbe (not only at the end) and tracked the inuse_space profile. Everything is neatly cleaned up after each rotation and there is nothing indicating that there are dangling references. But there is a feature that might be the culprit here in combination with a small oversight. There is only one allocation (aside from regexp and static stuff) that is kept (by design) and that's the reusable memory buffer(s) for GPDir / GPFile:

Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
 1067.36kB 40.11% 40.11%  1067.36kB 40.11%  regexp/syntax.(*compiler).inst /usr/local/go/src/regexp/syntax/compile.go:164
  561.50kB 21.10% 61.21%   561.50kB 21.10%  golang.org/x/net/html.init /home/fako/Develop/go/pkg/mod/golang.org/x/net@v0.10.0/html/entity.go:16
  520.04kB 19.54% 80.75%   520.04kB 19.54%  github.com/els0r/goProbe/pkg/goDB/storage/gpfile.(*MemPoolNoLimit).Get /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goDB/storage/gpfile/buffers.go:58 (inline)
  512.38kB 19.25%   100%   512.38kB 19.25%  regexp.onePassCopy /usr/local/go/src/regexp/onepass.go:226
         0     0%   100%   520.04kB 19.54%  github.com/els0r/goProbe/pkg/goDB.(*DBWriter).Write /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goDB/db_writer.go:63
         0     0%   100%   520.04kB 19.54%  github.com/els0r/goProbe/pkg/goDB/storage/gpfile.(*GPDir).Column /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goDB/storage/gpfile/gpdir.go:445
         0     0%   100%   520.04kB 19.54%  github.com/els0r/goProbe/pkg/goDB/storage/gpfile.(*GPDir).WriteBlocks /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goDB/storage/gpfile/gpdir.go:201
         0     0%   100%   520.04kB 19.54%  github.com/els0r/goProbe/pkg/goDB/storage/gpfile.New /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goDB/storage/gpfile/gpfile.go:107
         0     0%   100%   520.04kB 19.54%  github.com/els0r/goProbe/pkg/goprobe/writeout.(*GoDBHandler).HandleWriteout.func1 /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goprobe/writeout/godb.go:69
         0     0%   100%   520.04kB 19.54%  github.com/els0r/goProbe/pkg/goprobe/writeout.(*GoDBHandler).handleIfaceWriteout /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goprobe/writeout/godb.go:104

On top of that we're running the rotation of all interfaces (in this case 75) in a RunGroup, so massively / aggressively parallelized. What I think is happening is that he just very greedily creates memory buffers because he's handling a lot of writeouts in parallel. Since they are kept (to avoid re-allocation on the next run) that's why it's growing a lot (not forever of course, but probably more than need be).

My suggestion would be to make the writeout / rotation in the CaptureManager sequential (or even better: allow for a fixed / maximum number of parallel workers). Not only will that solve this particular problem but it's probably not the best idea (from an I/O perspective) to run 75 (or even more) writeouts to disk in parallel :see_no_evil: )...

Sidenote: If we are lazy we might also switch to the "limited" memory buffer approach (MemPoolLimit) and basically "steer" the maximum number of parallel buffers (which will in turn limit the number of parallel writeouts), but that doesn't feel right IMHO (because it's implicit process flow handling by means of memory limits), plus I'd be afraid that it could cause weird / uncontrollable limitations (or even deadlocks).

@els0r What do you think?

els0r commented 1 year ago

@els0r What do you think?

Thanks for looking into this so quickly!

I think this is spot on. And leak was maybe a bit charged of a word. Greedy does feel more appropriate. Your suggestion sounds good: we use a standard fan-out-fan-in to do the write-out and see which impact this has on timing and memory consumption. As long as the writeout completes within <10s, we should be good. If we keep memory consumption below 500MB, all the better.

Current writeout for all 75 interfaces is at 26ms (which begs the point of instrumenting goProbe with metrics to properly measure this).

Do you want to have a shot at this?

fako1024 commented 1 year ago

Do you want to have a shot at this?

Sure, I'll take a look. Out of curiosity: How does the old goProbe fare in this scenario / environment (to get a feeling for the "benchmark"), which I assume is running there anyway?

fako1024 commented 1 year ago

Mh, I may have been a bit too hasty in my assessment. Thing is: The writeouts (which trigger the buffer allocations) are already happening sequentially (because those happen via the writeout channel that feeds into the godb handler, which in turn writes those out to disk one by one). Of course the rotations are still happening in parallel, so chances are that it's really just the vast amount of temporary maps that are being created, but that should be reusable, and before I go on a wild goose chase I'm actually a bit puzzled by the math here:

@els0r : You stated up there that you have the following conditions:

If that is the case then your upfront memory consumption should be 75 2 10 MiB ~= 1.5 GiB (which is incidentally the last value I'm seeing in your top output, and that last writeout only increased from 1.4 to 1.5 GiB). So the overall amount of RES size is not unexpected, but I always thought this would be allocated right away (because the memory gets mmaped right away, unlike what happens with pcap). I'll have to dig into that (god forbid this might be dependent on the kernel, I saw that you're still using a 5.x one, who knows if e.g. demand paging for TPacket allocation is handled differently), but it could also be that the interfaces are not all brought up immediately, but are incrementally ramped up during each writeout (which is also when a config reload happens, right?).

Can you do me a favor and very closely check the logs (and / or the output directory / DB for indications that not all interfaces are written right from the get-go) if there's any indication that something like that happens? Also, can you:

If decreasing the block size reduces the maximum size then it's definitively unrelated to rotation. If all interfaces come up immediately then it's some funny mmap magic (which is generally not important, just good to know) and we just have to make sure to choose reasonable dimensions for the ring buffer sizes (c.f. #98)...

els0r commented 1 year ago

Heya. You might be spot on with the ring buffer. Apologies, I thought I was at 1 MiB already. I just didn't read the final zero when recalculating 1024*1024 and comparing against the config.

Anyhow, below is the memory consumption over a runtime of 1 hour. The plot is synthesized from top -b -d 60 | grep goProbe:

gpv4_mem_over_time

Ring Buffer settings are 1 MiB, 2 blocks, 75 ifaces. The final line after 1 h is

 1333 root      20   0 5712480 184444 160312 S   0.2   0.6   0:10.97 goProbe

My assumption is that the ring buffer isn't fully allocated on start-up or that top doesn't see the allocation until packets actually fill the buffer. In any case, it looks like memory consumption is stable.

Sporadic checks for drops showed 0 drops, so the buffer size seems to be sufficient. Will validate this on larger host with more interfaces.

els0r commented 1 year ago

Out of curiosity: How does the old goProbe fare in this scenario / environment (to get a feeling for the "benchmark"), which I assume is running there anyway?

Unfortunately, it wasn't running there, so I have no way of comparing.

els0r commented 1 year ago

Same graph with longer runtime to further show the long-term development of memory. The slight, asymptotical increase I can only attribute to flow maps being slightly larger than their predecessors leading to a minimal increase in allocated memory for the writeout.

(On that note: you see how ChatGPT generates different plotting code depending on the size of the data set 😆)

image

fako1024 commented 1 year ago

Thanks a lot for cross-checking, the result looks encouraging. If you look at the time frame from 6000s memory does indeed seem to have saturated. It might still be helpful to have it run for a day or so with profiling turned on (simply using our integrated setup -profiling-output-dir), I'd be interested in the memory / CPU usage over a longer period of time on a busy host since for me it's impossible to generate adequate debugging information like that on my tiny sensor "fleet". Do you think that's an option @els0r ?

els0r commented 1 year ago

Thanks a lot for cross-checking, the result looks encouraging. If you look at the time frame from 6000s memory does indeed seem to have saturated. It might still be helpful to have it run for a day or so with profiling turned on (simply using our integrated setup -profiling-output-dir), I'd be interested in the memory / CPU usage over a longer period of time on a busy host since for me it's impossible to generate adequate debugging information like that on my tiny sensor "fleet". Do you think that's an option @els0r ?

Within the realm of possible. The CPU footprint is negligible, so I can easily run it for a day.

els0r commented 1 year ago

Profiles were also sent by mail. Below a quick visual representation of what went down. Didn't run for a full day, but long enough to provide meaningful insights, I would say.

Memory

mem

CPU

cpu

fako1024 commented 1 year ago

@els0r Amazing, thanks for the quick turnaround! Looks really promising (I also checked the space_inuse profile, basically everything was neatly cleaned up except (apparently) for one capture routine):

 2723.33kB 29.23% 29.23%  2723.33kB 29.23%  github.com/els0r/goProbe/pkg/types/hashmap.NewHint /usr/src/packages/src/sdwan/security/pkg/OSAGtmi/goProbe/pkg/types/hashmap/hashmap.go:84

-> This is from one ongoing rotation that seems not to have been cleaned up properly (Maybe we have a high level race condition somewhere that allows killing an interface while it's in rotation). No corruption should happen, but it might lead to inconsistencies (as in: one interface having more blocks than others)... I'll check if I can find anything obvious that we overlooked, but it's certainly not a big deal.

Aside from that: Awesome - goProbe basically spends most of its CPU on disk operations, only like 1% is spent polling, and like 2% for the entire packet grabbing. Wouldn't have thought we'd get there. I'm really looking forward to a host with more throughput, this seems almost boring ^^.

The memory profile is interesting, there's a couple of allocations that I wouldn't have guessed to amount to such a large sample (like block metadata slices), and a few really trivial ones that can easily be avoided. Obviously it doesn't matter a crazy lot, but still I'll check if there's a few things that can be avoided / reused. It would definitively be easy to pool / reuse the contiguous memory areas for the hashmap(s) recreated on each rotation to avoid like half of all allocations.

I'll check a few things and file a PR against this one. Thanks again for the helpful insight into some real-world data! :heart:

els0r commented 1 year ago

No worries. More to come as part of #74 . Let's just say I have my eyes on a few interfaces, high throughput host 😏 . Stay tuned.