Closed fako1024 closed 1 year ago
Note: Efforts in this issue show that right now the fastest way of extracting the required information in goProbe
seems to be to use NextIPPacketZeroCopy()
, which slightly outperforms the functional approach (probably because the latter has some function overhead).
Benchmark stats collection from TestMockPacketCapturePerformance
(run on a rather slow machine to allow for higher granularity):
Initial / Baseline: 143.9 ± 0.4 ns / packet
Switch to `IPLayer`: 131.2 ± 2.9 ns / packet
Switch to zero-copy: 115.3 ± 0.5 ns / packet
New baseline for benchmarks in capture
/ e2etest
:
TestMockPacketCapturePerformance: 146 ± 1 ns / packet
TestBenchmarkCaptureThroughput/random: 355 ± 2 ns / packet
TestBenchmarkCaptureThroughput/non-random: 145 ± 2 ns / packet
First step / commit (444b89480bc7c3ee9d95456312849a774ae73ea0) - Switch packet capture to IP layer / zero-copy mode:
TestMockPacketCapturePerformance: 116 ± 1 ns / packet (~ -20%)
TestBenchmarkCaptureThroughput/random: 319 ± 1 ns / packet (~ -10%)
TestBenchmarkCaptureThroughput/non-random: 112 ± 2 ns / packet (~ -22%)
TestBenchmarkCaptureThroughput/random+return: 319 ± 1 ns / packet
TestBenchmarkCaptureThroughput/non-random+return: 137 ± 1 ns / packet
(additional benchmarks added for comparison with next step)
Second step / commit (https://github.com/els0r/goProbe/commit/d051da3100f948e48cec6932f89e48f138341a6f) - Completely remove GPPacket
and interact directly with GPFlow
s (and only on-demand calculation of reverse hashes):
TestMockPacketCapturePerformance: 116 ± 1 ns / packet
TestBenchmarkCaptureThroughput/random: 322 ± 4 ns / packet
TestBenchmarkCaptureThroughput/non-random: 111 ± 1 ns / packet
TestBenchmarkCaptureThroughput/random+return: 321 ± 5 ns / packet
TestBenchmarkCaptureThroughput/non-random+return: 136 ± 1 ns / packet
Note: The keen observer will realize that there is close to no gain in terms of performance - I had hoped for a bigger impact, but it seems that the unnecessary calculation of the reverse hash is not substantial. However, since it removes a redundant layer (and will enable something else, see later) I think it's still worth doing.
Third step / commit (https://github.com/els0r/goProbe/commit/2fc0e4921aab384382a65937d929e0784a2b1b41) - Replace capture routine select{}
statement with more basic approach with less overhead:
TestMockPacketCapturePerformance: 108 ± 1 ns / packet (~ -7%)
TestBenchmarkCaptureThroughput/random: 314 ± 3 ns / packet (~ -3%)
TestBenchmarkCaptureThroughput/non-random: 103 ± 1 ns / packet (~ -7%)
TestBenchmarkCaptureThroughput/random+return: 313 ± 3 ns / packet (~ -3%)
TestBenchmarkCaptureThroughput/non-random+return: 129 ± 3 ns / packet (~ -5%)
Extra step (no commit, since will be taken care of in linked issue for slimcap
) - Remove debugging statements / checks in afring
source:
TestMockPacketCapturePerformance: 103 ± 1 ns / packet (~ -5%)
TestBenchmarkCaptureThroughput/random: 309 ± 3 ns / packet (~ -2%)
TestBenchmarkCaptureThroughput/non-random: 98 ± 1 ns / packet (~ -5%)
TestBenchmarkCaptureThroughput/random+return: 307 ± 1 ns / packet (~ -2%)
TestBenchmarkCaptureThroughput/non-random+return: 123 ± 1 ns / packet (~ -5%)
Bonus validation: Number of memory allocations in total for TestMockPacketCapturePerformance (alloc_space
+ alloc_objects
):
Showing nodes accounting for 4MB, 100% of 4MB total
flat flat% sum% cum cum%
4MB 100% 100% 4MB 100% github.com/fako1024/slimcap/capture/afpacket/afring.NewMockSource /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock.go:83
0 0% 100% 4MB 100% github.com/els0r/goProbe/pkg/capture.TestMockPacketCapturePerformance /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_test.go:166
0 0% 100% 4MB 100% github.com/fako1024/slimcap/capture/afpacket/afring.NewMockSourceNoDrain /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock_nodrain.go:28
0 0% 100% 4MB 100% testing.tRunner /usr/local/go/src/testing/testing.go:1576
Showing nodes accounting for 1, 100% of 1 total
flat flat% sum% cum cum%
1 100% 100% 1 100% github.com/fako1024/slimcap/capture/afpacket/afring.NewMockSource /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock.go:83
0 0% 100% 1 100% github.com/els0r/goProbe/pkg/capture.TestMockPacketCapturePerformance /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_test.go:166
0 0% 100% 1 100% github.com/fako1024/slimcap/capture/afpacket/afring.NewMockSourceNoDrain /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock_nodrain.go:28
0 0% 100% 1 100% testing.tRunner /usr/local/go/src/testing/testing.go:1576
The whole processing is now free of allocations (of course there are allocations for the flow map, during rotation and writeout, but those can't be avoided for obvious reasons).
Final cross-check: Running goProbe
on my system simultaneously with the old pcap
-style version (and running a couple of high-throughput downloads in the background) I now see a factor of ~x14
in terms of CPU usage:
232237 root 20 0 1547308 27928 21248 S 0.0 0.2 0:00.78 ./goProbe -config /home/fako/Develop/scratch/private/goprobe_afpacket.conf
232238 root 20 0 1640044 50860 34176 S 0.0 0.3 0:11.25 /tmp/goProbe/cmd/goProbe/goProbe -config /home/fako/Develop/scratch/private/goprobe_pcap.conf
This issue tracks potential micro- (assuming those are the only ones left in the end) optimizations to add prior to release (probably in conjunction with #74 ).
NextPacket
population incapture.go
(see e.g. https://github.com/fako1024/slimcap/issues/8)Rationale: The linked issue already improves performance by about 14% with only few changes. It should be noted though that the zero-allocation approach using
NextPacketFn
would be even better, but unfortunately using that via thecapture.Source
interface prevents the compiler from keeping theGPPacket
on the stack (by design, there's nothing we can do about it), which kills any potential performance improvements by almost an order of magnitude. Maybe this could be avoided by skippingGPPacket
entirely and just interacting directly with the flow map (see also below).ZeroCopy()
option / mode for ring buffer source (issue)Rationale: Currently data is copied from the ring buffer, no matter if a buffer has been provided to
NextPacket()
/NextIPPacket()
. Since the data in the ring buffer is invalidated only upon the next call to those methods we can re-add a real zero-copy mode and then doing:where
EPHash
on demand onlyRationale: Right now we determine both
EPHash
andEPHashReverse
in parallel when populatingGPPacket
. However, not only is it probably faster to just swap the relevant parts of the hash to "convert" the forward hash to the reverse hash (in place), it could also be done if required only (i.e. if there is no match for the forward hash in the flow map, because otherwise there's no reason to calculate it at all). We of course can't know which one is more likely to occur (at least not without additional tracking), but statistically speaking we can skip the reverse hash calculation in ~50% of the cases (ok, that's probably not the right number, but we can save a substantial amount of calculations / memory operations).Rationale: We are currently using acapture.Source
(respectivelycapture.SourceZeroCopy
) interface type for thecapture.Capture
type. While being (semi-)generic it also add the interface overhead for each call to any method of the underlying source. At the level we are currently at it might be reasonable to use an explicit type instead (maybe wrapped in a local named type in order to support future changes quickly but allow for specific compiler-level checks & optimizations).select{}
statementRationale: Prior to implementing the two-way lock for the capture routine we had to continuously check if a lock was pending / had to be acted on. Technically this is not required anymore (because the lock is signified via a channel with depth 1, so although being atomic the command can be consumed out-of-band in the capture routine, making a synchronous
select{}
statement obsolete.Rationale: Even though final removal of debugging statements will be done in https://github.com/fako1024/slimcap/issues/6 at a later point in time, it would be good to know how this change will impact performance (positively, I assume) once completed.
...