Closed fako1024 closed 1 year ago
Initial attempt is quite successful, running a version of goProbe from https://github.com/els0r/goProbe/tree/49-proof-of-concept-for-slimcap next to one from develop on identical config files yields the exact same packet counts (so capture in general seems to work perfectly). However, despite a perfect match of packet counters there's slight variations in byte counters across the board, so probably I've messed up some counting there, will investigate.
But it seems this is very promising, after running for a while the CPU consumption for the "new" goProbe is already only about half of the standard one (and that's not even zero-copy):
119792 root 20 0 1664992 62608 50316 S 0.0 0.4 0:00.60 goProbe
119793 root 20 0 1593024 57444 43204 S 0.0 0.4 0:01.17 goProbe
The way the ring buffer is now set up I fathom we can also significantly reduce the buffer size per interface and still not see any packet drops (which would alleviate the fact that unlike pcap, AF_PACKET requires us to allocate the full buffer right from the start). Will have to experiment a bit with a host that has a little more load / traffic than mine though. Maybe @els0r you could give it a shot (and cut the buffer size in the config file by let's say a factor of 10 for all interfaces)?
I also didn't properly adapt the state machine in on this branch, I basically just took the shortest possible route to make it work without breaking. Any help you could provide on that would be really great - I'm sure this can be simplified a lot...
Never mind that, found the issue regarding the byte counts: I was not taking into account the interface-specific offset to the IP header. The newest commit fixes that. Now the results for this branch are 1:1 equal to the ones from develop.
Alright, newest commit to this branch has an updated version of slimcap with migration to TPacket V3 and bulk package processing. Now this is what we were looking for. Running a quick benchmark shows the following result on a relatively idle host (just a few background tabs running) and one with a bit more traffic (downloading an ISO image at full speed / 250Mbps), each measured after download completed:
IDLE:
362416 root 20 0 1632120 27688 24556 S 0.0 0.2 0:00.12 goProbe_slimcap -config goprobe_slimcap.conf
362417 root 20 0 1642440 29832 19188 S 0.0 0.2 0:00.87 goProbe_gopacket -config goprobe_gopacket.conf
LOAD:
367320 root 20 0 1631800 9276 7104 S 0.0 0.1 0:00.60 goProbe_slimcap -config goprobe_slimcap.conf
367321 root 20 0 1642696 44940 27020 S 0.0 0.3 0:03.60 goProbe_gopacket -config goprobe_gopacket.conf
So we are looking at an improvement of about a factor of 6x w.r.t. the current develop branch. A quick CPU profile doesn't even provide me with a proper amount of samples because he's basically not spending enough time doing anything (or put differently: didn't run it long enough under load to get enough samples), but I think this is the way to go. Will look into some more details / micro-optimizations once I get a proper profile. @els0r Happy trails with the state machine, I just circumvented basically everything right now so that it doesn't cause any overhead - This needs some love of course...
Last update for the day: I finally wrapped my head around the whole buffer / block / frame magic that is happening in the background when using the V3 TPacket ring buffer. After some changes and a fix results are now even a bit faster (x6.3), while at the same time using less buffer memory AND contrary to the default settings of gopacket with TPacket V3 zero packet drops are observed, while current develop shows 42236 / 2196791 packet drops during capture (so about 2%, although in total fairness the drops could probably be fixed or at least improved by adapting the parameters to similar values there as well):
514640 root 20 0 1632364 18336 16332 S 0.0 0.1 0:01.01 goProbe_slimcap -config goprobe_slimcap.conf
514641 root 20 0 1642440 43948 26984 S 0.0 0.3 0:06.31 goProbe_gopacket -config goprobe_gopacket.conf
Latest changes now bring us to a performance increase of >1000% w.r.t develop on the aforementioned simple scenario (mostly caused by the assembly magic added today, further reducing the overhead caused by the SYSCALL and a few other micro-optimizations):
682327 root 20 0 1471524 16436 14756 S 0.0 0.1 0:00.57 goProbe_slimcap -config goprobe_slimcap.conf
682328 root 20 0 1642440 49384 31960 S 0.0 0.3 0:06.34 goProbe_gopacket -config goprobe_gopacket.conf
On a more aggressive scenario (firing up 8 parallel iperf TCP connections network between two hosts on my internal network) results are similar:
682903 root 20 0 1545832 13712 11464 S 0.0 0.1 0:00.17 goProbe_slimcap -config goprobe_slimcap.conf
682904 root 20 0 1642696 33184 15408 S 0.0 0.2 0:01.70 goProbe_gopacket -config goprobe_gopacket.conf
In addition, the variant (re-)using a packet buffer for each packet is now basically free of any allocation for packet processing / population:
Showing nodes accounting for 3262.98kB, 100% of 3262.98kB total
Showing top 5 nodes out of 13
flat flat% sum% cum cum%
1566.21kB 48.00% 48.00% 1566.21kB 48.00% syscall.NetlinkRIB /usr/local/go/src/syscall/netlink_linux.go:87
1184.27kB 36.29% 84.29% 1184.27kB 36.29% runtime/pprof.StartCPUProfile /usr/local/go/src/runtime/pprof/pprof.go:793
512.50kB 15.71% 100% 512.50kB 15.71% syscall.ParseNetlinkRouteAttr /usr/local/go/src/syscall/netlink_linux.go:166
0 0% 100% 2078.71kB 63.71% github.com/els0r/goProbe/pkg/capture.(*Capture).initialize /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:495
0 0% 100% 2078.71kB 63.71% github.com/els0r/goProbe/pkg/capture.(*Capture).process.func2
@els0r as hinted: for further steps it would be great to get some comparisons (both w.r.t. performance and correctness of data) from hosts with a bit more action, ideally including
pprof.Lookup("allocs")
profilesAs discussed. The numbers above look promising.
Before I dig into a more comparative analysis, a heads-up: the tunnel interfaces currently don't work on the machine I was testing on:
[INFO] Fri Feb 17 20:21:29 2023 Added interface 't4....' to capture list.
panic: Link Type 778 not supported (yet)
goroutine 7 [running]:
github.com/fako1024/slimcap/link.LinkType.IpHeaderOffset(...)
/root/go/pkg/mod/github.com/fako1024/slimcap@v0.0.0-20230217141948-dd176935150a/link/link.go:36
github.com/fako1024/slimcap/capture/afpacket.NewRingBufSource({0xc00002d4a0?, 0xc00050aec0?}, {0xc00006be88, 0x3, 0x0?})
/root/go/pkg/mod/github.com/fako1024/slimcap@v0.0.0-20230217141948-dd176935150a/capture/afpacket/afpacket_ring.go:54 +0x6f9
github.com/els0r/goProbe/pkg/capture.(*Capture).initialize(0xc000236120)
...
According to tcpdump.org:
If the ARPHRD_ type is ARPHRD_IPGRE (778), the protocol type field contains a [GRE](https://www.rfc-editor.org/rfc/rfc2784.html) protocol type.
I guess this is "just" some more offset handling in slimcap
itself.
So much for that. Will proceed with profiling and status.
Statistics after two write-out periods (e.g. ~10 minute runtime):
29682 nobody 35 15 1774080 31328 13024 S 0.0 0.1 0:02.19 goProbe
29757 root 20 0 1703548 32788 28684 S 0.0 0.1 0:00.90 goProbe_slimcap
Confirms that goProbe + slimcap has a significantly smaller footprint.
As for the status, no packets are being dropped:
{
"logged_rcvd": 62676,
"pcap_rcvd": 58510,
"pcap_drop": 0,
"pcap_ifdrop": 0,
"iface_active": 2,
"iface_total": 2,
"last_writeout": 71.526597641,
"ifaces": {
"lacp0": {
"state": 3,
"stats": {
"pcap": {
"PacketsReceived": 29143,
"PacketsDropped": 0,
"PacketsIfDropped": 0
},
"packets_logged": 31214
}
},
"lacp1": {
"state": 3,
"stats": {
"pcap": {
"PacketsReceived": 29367,
"PacketsDropped": 0,
"PacketsIfDropped": 0
},
"packets_logged": 31462
}
}
}
}
- Checking existence of goProbe process.................[ OK ] pid=29682
Interface Capture Statistics:
last writeout: 72s ago
packets received: 62.06 K
dropped by pcap: 0
dropped by iface: 0
PKTS RCV DROP IF DROP
- lacp0 ................................................[ OK ] 30.77 K 0 0
- lacp1 ................................................[ OK ] 31.29 K 0 0
- Checking interface capture threads....................[ OK ]
goQuery output aligns, although not fully. This is mainly because the default goProbe config had quite some filters set (such as not ether proto 0x88cc and not stp and not arp and not icmp and not icmp6 and not host 224.0.0.18
).
Will provide profiles per Email.
Argl. Fundamentally rewriting something isn't trivial: please refer to https://github.com/els0r/goProbe/commit/51d740e8909686329b98c55178a7f6d051a37b9b for recent progress.
It is not safe for merging, but provides some background on what happens if you just pull the rug under the capture handle.
Not as pretty as I had hoped, but at least it is a little cleaner now in terms of what it does when and gets rid of checking its own state in every routine.
It also doesn’t block program execution or send on closed channels anymore. So that part should be running smoothly now. Appreciate testing of course.
Profiles: at the end of comparing goProbe on a host with about 200 interfaces:
8066 nobody 35 15 15.8g 138332 102912 S 10.9 0.1 1:10.22 goProbe 8272 root 20 0 15.8g 168236 123612 S 9.9 0.1 0:29.74 goProbe_slimcap
And this is with a pprof profiler running :) And the original goProbe has BPF filters set to already exclude quite some traffic.
This is something for tuning. There are still quite some decoding errors when capturing on tunnel interfaces. From the early goProbe days, I remember these errors, and we had to manually check the fragmentation header.
In order to properly assess feasibility / performance, we should implement a simple PoC for goProbe using github.com/fako1024/slimcap instead of gopacket.
DoD