els0r / goProbe

High-performance IP packet metadata aggregation and efficient storage and querying of flows

GNU General Public License v2.0

12 stars 4 forks source link

Proof-of-concept for slimcap #49

Closed fako1024 closed 1 year ago

fako1024 commented 1 year ago

In order to properly assess feasibility / performance, we should implement a simple PoC for goProbe using github.com/fako1024/slimcap instead of gopacket.

DoD

[x] Remove gopacket dependency from goProbe
[x] Use slimcap to retrieve packets from the wire
[x] Transfer data from the received payload to GPPackets / GPFlow (zero-copy)
[x] Adapt state machine to accomodate for the more simple interface of slimcap
[x] Implement packet stats for slimcap (similar to pcap stats, there should be a socket option for that in AF_PACKET)
[x] Implement TPacket V3 handling to provide bulk packet handling instead of polling individual ones
[x] Improve upon handling for buffer / block / frame sizes
[x] Perform benchmarks and optimize

fako1024 commented 1 year ago

Initial attempt is quite successful, running a version of goProbe from https://github.com/els0r/goProbe/tree/49-proof-of-concept-for-slimcap next to one from develop on identical config files yields the exact same packet counts (so capture in general seems to work perfectly). However, despite a perfect match of packet counters there's slight variations in byte counters across the board, so probably I've messed up some counting there, will investigate.

But it seems this is very promising, after running for a while the CPU consumption for the "new" goProbe is already only about half of the standard one (and that's not even zero-copy):

 119792 root      20   0 1664992  62608  50316 S   0.0   0.4   0:00.60 goProbe
 119793 root      20   0 1593024  57444  43204 S   0.0   0.4   0:01.17 goProbe

The way the ring buffer is now set up I fathom we can also significantly reduce the buffer size per interface and still not see any packet drops (which would alleviate the fact that unlike pcap, AF_PACKET requires us to allocate the full buffer right from the start). Will have to experiment a bit with a host that has a little more load / traffic than mine though. Maybe @els0r you could give it a shot (and cut the buffer size in the config file by let's say a factor of 10 for all interfaces)?

I also didn't properly adapt the state machine in on this branch, I basically just took the shortest possible route to make it work without breaking. Any help you could provide on that would be really great - I'm sure this can be simplified a lot...

fako1024 commented 1 year ago

Never mind that, found the issue regarding the byte counts: I was not taking into account the interface-specific offset to the IP header. The newest commit fixes that. Now the results for this branch are 1:1 equal to the ones from develop.

fako1024 commented 1 year ago

Alright, newest commit to this branch has an updated version of slimcap with migration to TPacket V3 and bulk package processing. Now this is what we were looking for. Running a quick benchmark shows the following result on a relatively idle host (just a few background tabs running) and one with a bit more traffic (downloading an ISO image at full speed / 250Mbps), each measured after download completed:

IDLE:
 362416 root      20   0 1632120  27688  24556 S   0.0   0.2   0:00.12 goProbe_slimcap   -config goprobe_slimcap.conf
 362417 root      20   0 1642440  29832  19188 S   0.0   0.2   0:00.87 goProbe_gopacket -config goprobe_gopacket.conf

LOAD:
 367320 root      20   0 1631800   9276   7104 S   0.0   0.1   0:00.60 goProbe_slimcap -config goprobe_slimcap.conf
 367321 root      20   0 1642696  44940  27020 S   0.0   0.3   0:03.60 goProbe_gopacket -config goprobe_gopacket.conf

So we are looking at an improvement of about a factor of 6x w.r.t. the current develop branch. A quick CPU profile doesn't even provide me with a proper amount of samples because he's basically not spending enough time doing anything (or put differently: didn't run it long enough under load to get enough samples), but I think this is the way to go. Will look into some more details / micro-optimizations once I get a proper profile. @els0r Happy trails with the state machine, I just circumvented basically everything right now so that it doesn't cause any overhead - This needs some love of course...

fako1024 commented 1 year ago

Last update for the day: I finally wrapped my head around the whole buffer / block / frame magic that is happening in the background when using the V3 TPacket ring buffer. After some changes and a fix results are now even a bit faster (x6.3), while at the same time using less buffer memory AND contrary to the default settings of gopacket with TPacket V3 zero packet drops are observed, while current develop shows 42236 / 2196791 packet drops during capture (so about 2%, although in total fairness the drops could probably be fixed or at least improved by adapting the parameters to similar values there as well):

 514640 root      20   0 1632364  18336  16332 S   0.0   0.1   0:01.01 goProbe_slimcap   -config goprobe_slimcap.conf
 514641 root      20   0 1642440  43948  26984 S   0.0   0.3   0:06.31 goProbe_gopacket -config goprobe_gopacket.conf

fako1024 commented 1 year ago

Latest changes now bring us to a performance increase of >1000% w.r.t develop on the aforementioned simple scenario (mostly caused by the assembly magic added today, further reducing the overhead caused by the SYSCALL and a few other micro-optimizations):

 682327 root      20   0 1471524  16436  14756 S   0.0   0.1   0:00.57 goProbe_slimcap   -config goprobe_slimcap.conf
 682328 root      20   0 1642440  49384  31960 S   0.0   0.3   0:06.34 goProbe_gopacket -config goprobe_gopacket.conf

On a more aggressive scenario (firing up 8 parallel iperf TCP connections network between two hosts on my internal network) results are similar:

 682903 root      20   0 1545832  13712  11464 S   0.0   0.1   0:00.17 goProbe_slimcap   -config goprobe_slimcap.conf
 682904 root      20   0 1642696  33184  15408 S   0.0   0.2   0:01.70 goProbe_gopacket -config goprobe_gopacket.conf

In addition, the variant (re-)using a packet buffer for each packet is now basically free of any allocation for packet processing / population:

Showing nodes accounting for 3262.98kB, 100% of 3262.98kB total
Showing top 5 nodes out of 13
      flat  flat%   sum%        cum   cum%
 1566.21kB 48.00% 48.00%  1566.21kB 48.00%  syscall.NetlinkRIB /usr/local/go/src/syscall/netlink_linux.go:87
 1184.27kB 36.29% 84.29%  1184.27kB 36.29%  runtime/pprof.StartCPUProfile /usr/local/go/src/runtime/pprof/pprof.go:793
  512.50kB 15.71%   100%   512.50kB 15.71%  syscall.ParseNetlinkRouteAttr /usr/local/go/src/syscall/netlink_linux.go:166
         0     0%   100%  2078.71kB 63.71%  github.com/els0r/goProbe/pkg/capture.(*Capture).initialize /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:495
         0     0%   100%  2078.71kB 63.71%  github.com/els0r/goProbe/pkg/capture.(*Capture).process.func2

@els0r as hinted: for further steps it would be great to get some comparisons (both w.r.t. performance and correctness of data) from hosts with a bit more action, ideally including

Cumulative CPU usage slimcap vs. gopacket after a while of capture
CPU + mem allocation pprof.Lookup("allocs") profiles
Output of the "pcap"-Stats() call, either at the very end or from time to time (to get an idea if the default parameters are quite ok)
Comparison of goQuery outputs

els0r commented 1 year ago

As discussed. The numbers above look promising.

Before I dig into a more comparative analysis, a heads-up: the tunnel interfaces currently don't work on the machine I was testing on:

[INFO] Fri Feb 17 20:21:29 2023 Added interface 't4....' to capture list.
panic: Link Type 778 not supported (yet)

goroutine 7 [running]:
github.com/fako1024/slimcap/link.LinkType.IpHeaderOffset(...)
    /root/go/pkg/mod/github.com/fako1024/slimcap@v0.0.0-20230217141948-dd176935150a/link/link.go:36
github.com/fako1024/slimcap/capture/afpacket.NewRingBufSource({0xc00002d4a0?, 0xc00050aec0?}, {0xc00006be88, 0x3, 0x0?})
    /root/go/pkg/mod/github.com/fako1024/slimcap@v0.0.0-20230217141948-dd176935150a/capture/afpacket/afpacket_ring.go:54 +0x6f9
github.com/els0r/goProbe/pkg/capture.(*Capture).initialize(0xc000236120)
    ...

According to tcpdump.org:

If the ARPHRD_ type is ARPHRD_IPGRE (778), the protocol type field contains a [GRE](https://www.rfc-editor.org/rfc/rfc2784.html) protocol type.

I guess this is "just" some more offset handling in slimcap itself.

So much for that. Will proceed with profiling and status.

els0r commented 1 year ago

Statistics after two write-out periods (e.g. ~10 minute runtime):

29682 nobody    35  15 1774080  31328  13024 S   0.0   0.1   0:02.19 goProbe
29757 root      20   0 1703548  32788  28684 S   0.0   0.1   0:00.90 goProbe_slimcap

Confirms that goProbe + slimcap has a significantly smaller footprint.

As for the status, no packets are being dropped:

{
  "logged_rcvd": 62676,
  "pcap_rcvd": 58510,
  "pcap_drop": 0,
  "pcap_ifdrop": 0,
  "iface_active": 2,
  "iface_total": 2,
  "last_writeout": 71.526597641,
  "ifaces": {
    "lacp0": {
      "state": 3,
      "stats": {
        "pcap": {
          "PacketsReceived": 29143,
          "PacketsDropped": 0,
          "PacketsIfDropped": 0
        },
        "packets_logged": 31214
      }
    },
    "lacp1": {
      "state": 3,
      "stats": {
        "pcap": {
          "PacketsReceived": 29367,
          "PacketsDropped": 0,
          "PacketsIfDropped": 0
        },
        "packets_logged": 31462
      }
    }
  }
}
- Checking existence of goProbe process.................[  OK  ] pid=29682
Interface Capture Statistics:

       last writeout:  72s ago
    packets received:  62.06 K
     dropped by pcap:        0
    dropped by iface:        0

                                                                  PKTS RCV      DROP   IF DROP
- lacp0 ................................................[  OK  ]   30.77 K         0         0
- lacp1 ................................................[  OK  ]   31.29 K         0         0

- Checking interface capture threads....................[  OK  ]

goQuery output aligns, although not fully. This is mainly because the default goProbe config had quite some filters set (such as not ether proto 0x88cc and not stp and not arp and not icmp and not icmp6 and not host 224.0.0.18).

Will provide profiles per Email.

els0r commented 1 year ago

Argl. Fundamentally rewriting something isn't trivial: please refer to https://github.com/els0r/goProbe/commit/51d740e8909686329b98c55178a7f6d051a37b9b for recent progress.

It is not safe for merging, but provides some background on what happens if you just pull the rug under the capture handle.

els0r commented 1 year ago

State Machine

Not as pretty as I had hoped, but at least it is a little cleaner now in terms of what it does when and gets rid of checking its own state in every routine.

It also doesn’t block program execution or send on closed channels anymore. So that part should be running smoothly now. Appreciate testing of course.

Profiles

Profiles: at the end of comparing goProbe on a host with about 200 interfaces:

8066 nobody 35 15 15.8g 138332 102912 S 10.9 0.1 1:10.22 goProbe 8272 root 20 0 15.8g 168236 123612 S 9.9 0.1 0:29.74 goProbe_slimcap

And this is with a pprof profiler running :) And the original goProbe has BPF filters set to already exclude quite some traffic.

Fragmentation

This is something for tuning. There are still quite some decoding errors when capturing on tunnel interfaces. From the early goProbe days, I remember these errors, and we had to manually check the fragmentation header.