els0r / goProbe

High-performance IP packet metadata aggregation and efficient storage and querying of flows
GNU General Public License v2.0
12 stars 4 forks source link

Micro-optimization collection issue #83

Closed fako1024 closed 1 year ago

fako1024 commented 1 year ago

This issue tracks potential micro- (assuming those are the only ones left in the end) optimizations to add prior to release (probably in conjunction with #74 ).

Rationale: The linked issue already improves performance by about 14% with only few changes. It should be noted though that the zero-allocation approach using NextPacketFn would be even better, but unfortunately using that via the capture.Source interface prevents the compiler from keeping the GPPacket on the stack (by design, there's nothing we can do about it), which kills any potential performance improvements by almost an order of magnitude. Maybe this could be avoided by skipping GPPacket entirely and just interacting directly with the flow map (see also below).

Rationale: Currently data is copied from the ring buffer, no matter if a buffer has been provided to NextPacket() / NextIPPacket(). Since the data in the ring buffer is invalidated only upon the next call to those methods we can re-add a real zero-copy mode and then doing:

// Populate the packet
data = s.curTPacketHeader.payloadNoCopyAtOffset(uint32(s.ipLayerOffset), uint32(snapLen))
//s.curTPacketHeader.payloadCopyPutAtOffset(data, uint32(s.ipLayerOffset))

where

func (t tPacketHeader) payloadNoCopyAtOffset(offset, to uint32) []byte {
    mac := uint32(*(*uint16)(unsafe.Pointer(&t.data[t.ppos+24])))
    return t.data[t.ppos+mac+offset : t.ppos+mac+to]
}

Rationale: Right now we determine both EPHash and EPHashReverse in parallel when populating GPPacket. However, not only is it probably faster to just swap the relevant parts of the hash to "convert" the forward hash to the reverse hash (in place), it could also be done if required only (i.e. if there is no match for the forward hash in the flow map, because otherwise there's no reason to calculate it at all). We of course can't know which one is more likely to occur (at least not without additional tracking), but statistically speaking we can skip the reverse hash calculation in ~50% of the cases (ok, that's probably not the right number, but we can save a substantial amount of calculations / memory operations).

Rationale: We are currently using a capture.Source (respectively capture.SourceZeroCopy) interface type for the capture.Capture type. While being (semi-)generic it also add the interface overhead for each call to any method of the underlying source. At the level we are currently at it might be reasonable to use an explicit type instead (maybe wrapped in a local named type in order to support future changes quickly but allow for specific compiler-level checks & optimizations).

Rationale: Prior to implementing the two-way lock for the capture routine we had to continuously check if a lock was pending / had to be acted on. Technically this is not required anymore (because the lock is signified via a channel with depth 1, so although being atomic the command can be consumed out-of-band in the capture routine, making a synchronous select{} statement obsolete.

Rationale: Even though final removal of debugging statements will be done in https://github.com/fako1024/slimcap/issues/6 at a later point in time, it would be good to know how this change will impact performance (positively, I assume) once completed.

...

fako1024 commented 1 year ago

Note: Efforts in this issue show that right now the fastest way of extracting the required information in goProbe seems to be to use NextIPPacketZeroCopy(), which slightly outperforms the functional approach (probably because the latter has some function overhead).

fako1024 commented 1 year ago

Benchmark stats collection from TestMockPacketCapturePerformance (run on a rather slow machine to allow for higher granularity):

Initial / Baseline:  143.9 ± 0.4 ns / packet
Switch to `IPLayer`: 131.2 ± 2.9 ns / packet
Switch to zero-copy: 115.3 ± 0.5 ns / packet
fako1024 commented 1 year ago

New baseline for benchmarks in capture / e2etest:

TestMockPacketCapturePerformance:          146 ± 1 ns / packet
TestBenchmarkCaptureThroughput/random:     355 ± 2 ns / packet
TestBenchmarkCaptureThroughput/non-random: 145 ± 2 ns / packet
fako1024 commented 1 year ago

First step / commit (444b89480bc7c3ee9d95456312849a774ae73ea0) - Switch packet capture to IP layer / zero-copy mode:

TestMockPacketCapturePerformance:                 116 ± 1 ns / packet (~ -20%)
TestBenchmarkCaptureThroughput/random:            319 ± 1 ns / packet (~ -10%)
TestBenchmarkCaptureThroughput/non-random:        112 ± 2 ns / packet (~ -22%)
TestBenchmarkCaptureThroughput/random+return:     319 ± 1 ns / packet
TestBenchmarkCaptureThroughput/non-random+return: 137 ± 1 ns / packet

(additional benchmarks added for comparison with next step)

fako1024 commented 1 year ago

Second step / commit (https://github.com/els0r/goProbe/commit/d051da3100f948e48cec6932f89e48f138341a6f) - Completely remove GPPacket and interact directly with GPFlows (and only on-demand calculation of reverse hashes):

TestMockPacketCapturePerformance:                 116 ± 1 ns / packet
TestBenchmarkCaptureThroughput/random:            322 ± 4 ns / packet
TestBenchmarkCaptureThroughput/non-random:        111 ± 1 ns / packet
TestBenchmarkCaptureThroughput/random+return:     321 ± 5 ns / packet
TestBenchmarkCaptureThroughput/non-random+return: 136 ± 1 ns / packet

Note: The keen observer will realize that there is close to no gain in terms of performance - I had hoped for a bigger impact, but it seems that the unnecessary calculation of the reverse hash is not substantial. However, since it removes a redundant layer (and will enable something else, see later) I think it's still worth doing.

fako1024 commented 1 year ago

Third step / commit (https://github.com/els0r/goProbe/commit/2fc0e4921aab384382a65937d929e0784a2b1b41) - Replace capture routine select{} statement with more basic approach with less overhead:

TestMockPacketCapturePerformance:                 108 ± 1 ns / packet  (~ -7%)
TestBenchmarkCaptureThroughput/random:            314 ± 3 ns / packet  (~ -3%)
TestBenchmarkCaptureThroughput/non-random:        103 ± 1 ns / packet  (~ -7%)
TestBenchmarkCaptureThroughput/random+return:     313 ± 3 ns / packet  (~ -3%)
TestBenchmarkCaptureThroughput/non-random+return: 129 ± 3 ns / packet  (~ -5%)
fako1024 commented 1 year ago

Extra step (no commit, since will be taken care of in linked issue for slimcap) - Remove debugging statements / checks in afring source:

TestMockPacketCapturePerformance:                 103 ± 1 ns / packet  (~ -5%)
TestBenchmarkCaptureThroughput/random:            309 ± 3 ns / packet  (~ -2%)
TestBenchmarkCaptureThroughput/non-random:         98 ± 1 ns / packet  (~ -5%)
TestBenchmarkCaptureThroughput/random+return:     307 ± 1 ns / packet  (~ -2%)
TestBenchmarkCaptureThroughput/non-random+return: 123 ± 1 ns / packet  (~ -5%)
fako1024 commented 1 year ago

Bonus validation: Number of memory allocations in total for TestMockPacketCapturePerformance (alloc_space + alloc_objects):

Showing nodes accounting for 4MB, 100% of 4MB total
      flat  flat%   sum%        cum   cum%
       4MB   100%   100%        4MB   100%  github.com/fako1024/slimcap/capture/afpacket/afring.NewMockSource /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock.go:83
         0     0%   100%        4MB   100%  github.com/els0r/goProbe/pkg/capture.TestMockPacketCapturePerformance /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_test.go:166
         0     0%   100%        4MB   100%  github.com/fako1024/slimcap/capture/afpacket/afring.NewMockSourceNoDrain /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock_nodrain.go:28
         0     0%   100%        4MB   100%  testing.tRunner /usr/local/go/src/testing/testing.go:1576
Showing nodes accounting for 1, 100% of 1 total
      flat  flat%   sum%        cum   cum%
         1   100%   100%          1   100%  github.com/fako1024/slimcap/capture/afpacket/afring.NewMockSource /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock.go:83
         0     0%   100%          1   100%  github.com/els0r/goProbe/pkg/capture.TestMockPacketCapturePerformance /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_test.go:166
         0     0%   100%          1   100%  github.com/fako1024/slimcap/capture/afpacket/afring.NewMockSourceNoDrain /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock_nodrain.go:28
         0     0%   100%          1   100%  testing.tRunner /usr/local/go/src/testing/testing.go:1576

The whole processing is now free of allocations (of course there are allocations for the flow map, during rotation and writeout, but those can't be avoided for obvious reasons).

fako1024 commented 1 year ago

Final cross-check: Running goProbe on my system simultaneously with the old pcap-style version (and running a couple of high-throughput downloads in the background) I now see a factor of ~x14 in terms of CPU usage:

 232237 root      20   0 1547308  27928  21248 S   0.0   0.2   0:00.78 ./goProbe -config /home/fako/Develop/scratch/private/goprobe_afpacket.conf
 232238 root      20   0 1640044  50860  34176 S   0.0   0.3   0:11.25 /tmp/goProbe/cmd/goProbe/goProbe -config /home/fako/Develop/scratch/private/goprobe_pcap.conf