els0r / goProbe

High-performance IP packet metadata aggregation and efficient storage and querying of flows
GNU General Public License v2.0
12 stars 4 forks source link

Several race conditions in CaptureManager #117

Closed fako1024 closed 1 year ago

fako1024 commented 1 year ago

After straightening out a couple of issues found in #88 via https://github.com/fako1024/slimcap/issues/33 running the E2E test on a basic flow (single interface, data being piped through a mock ring buffer source from a pcap file) revealed several data races, all of which seem to be related to the CaptureManager (respectively the state machine of the individual captures). The following ones I have found:

Previous read at 0x00c0000d4940 by goroutine 23: github.com/fako1024/slimcap/capture/afpacket/afring.(Source).nextPacket() /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring.go:331 +0x5e github.com/fako1024/slimcap/capture/afpacket/afring.(Source).NextPacket() /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring.go:136 +0x57 github.com/fako1024/slimcap/capture/afpacket/afring.(*MockSource).NextPacket()

:1 +0x6e github.com/els0r/goProbe/pkg/capture.(*Capture).capturePacket() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:446 +0x99 github.com/els0r/goProbe/pkg/capture.(*Capture).process() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:514 +0x125 github.com/els0r/goProbe/pkg/capture.capturing.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:352 +0x39 Goroutine 13 (running) created at: github.com/els0r/goProbe/pkg/capture.(*Capture).Run() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:315 +0x8e github.com/els0r/goProbe/pkg/capture.(*Manager).enable() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:108 +0x464 github.com/els0r/goProbe/pkg/capture.(*Manager).Update.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:210 +0x3e github.com/els0r/goProbe/pkg/capture.(*RunGroup).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/rungroup.go:26 +0x73 Goroutine 23 (running) created at: github.com/els0r/goProbe/pkg/capture.capturing() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:352 +0x15b github.com/els0r/goProbe/pkg/capture.(*Capture).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:324 +0xc9 ``` - [x] Access to flow map length while still running It seems that `process()` is still adding packets to the initial map while `rotate()` checks if there are any flows present at all (this is probably the easiest thing to fix because it seems to me that this pre-check is simply not guarded by the current rotation copy-on-write magic): ``` WARNING: DATA RACE Read at 0x00c000200db0 by goroutine 13: github.com/els0r/goProbe/pkg/capture.(*FlowLog).Len() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/flow_log.go:55 +0xc4 github.com/els0r/goProbe/pkg/capture.(*Capture).rotate() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:534 +0x74 github.com/els0r/goProbe/pkg/capture.captureCommandRotate.execute() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:203 +0x31 github.com/els0r/goProbe/pkg/capture.capturing() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:365 +0x2e2 github.com/els0r/goProbe/pkg/capture.(*Capture).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:324 +0xc9 Previous write at 0x00c000200db0 by goroutine 23: runtime.mapassign_faststr() /usr/local/go/src/runtime/map_faststr.go:203 +0x0 github.com/els0r/goProbe/pkg/capture.(*FlowLog).Add() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/flow_log.go:115 +0x16b github.com/els0r/goProbe/pkg/capture.(*Capture).capturePacket() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:461 +0x40e github.com/els0r/goProbe/pkg/capture.(*Capture).process() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:514 +0x125 github.com/els0r/goProbe/pkg/capture.capturing.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:352 +0x39 Goroutine 13 (running) created at: github.com/els0r/goProbe/pkg/capture.(*Capture).Run() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:315 +0x8e github.com/els0r/goProbe/pkg/capture.(*Manager).enable() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:108 +0x464 github.com/els0r/goProbe/pkg/capture.(*Manager).Update.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:210 +0x3e github.com/els0r/goProbe/pkg/capture.(*RunGroup).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/rungroup.go:26 +0x73 Goroutine 23 (running) created at: github.com/els0r/goProbe/pkg/capture.capturing() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:352 +0x15b github.com/els0r/goProbe/pkg/capture.(*Capture).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:324 +0xc9 ``` - [x] Writeout handler still processing after / during `Close()` The writeout handler seems to still be attempting to perform its routines while / after the channel is being closed. Probably also not too tricky (I think I saw some random errors regarding sending on closed channel, maybe that's related): ``` WARNING: DATA RACE Write at 0x00c000100070 by goroutine 8: runtime.closechan() /usr/local/go/src/runtime/chan.go:357 +0x0 github.com/els0r/goProbe/pkg/goprobe/writeout.(*Handler).Close() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goprobe/writeout/handler.go:57 +0x6e4 github.com/els0r/goProbe/pkg/e2etest.runGoProbe() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:231 +0x6c5 github.com/els0r/goProbe/pkg/e2etest.testE2E() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:162 +0x429 github.com/els0r/goProbe/pkg/e2etest.TestE2EBasic() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:57 +0x165 testing.tRunner() /usr/local/go/src/testing/testing.go:1576 +0x216 testing.(*T).Run.func1() /usr/local/go/src/testing/testing.go:1629 +0x47 Previous read at 0x00c000100070 by goroutine 16: runtime.chansend() /usr/local/go/src/runtime/chan.go:160 +0x0 github.com/els0r/goProbe/pkg/goprobe/writeout.(*Handler).FullWriteout() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goprobe/writeout/handler.go:97 +0x2e5 github.com/els0r/goProbe/pkg/goprobe/writeout.(*Handler).HandleRotations.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goprobe/writeout/handler.go:160 +0x393 Goroutine 8 (running) created at: testing.(*T).Run() /usr/local/go/src/testing/testing.go:1629 +0x805 testing.runTests.func1() /usr/local/go/src/testing/testing.go:2036 +0x8d testing.tRunner() /usr/local/go/src/testing/testing.go:1576 +0x216 testing.runTests() /usr/local/go/src/testing/testing.go:2034 +0x87c testing.(*M).Run() /usr/local/go/src/testing/testing.go:1906 +0xb44 github.com/els0r/goProbe/pkg/e2etest.TestMain() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:522 +0x12e main.main() _testmain.go:59 +0x324 Goroutine 16 (running) created at: github.com/els0r/goProbe/pkg/goprobe/writeout.(*Handler).HandleRotations() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/goprobe/writeout/handler.go:133 +0xfa github.com/els0r/goProbe/pkg/e2etest.runGoProbe() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:225 +0x6a7 github.com/els0r/goProbe/pkg/e2etest.testE2E() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:162 +0x429 github.com/els0r/goProbe/pkg/e2etest.TestE2EBasic() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:57 +0x165 testing.tRunner() /usr/local/go/src/testing/testing.go:1576 +0x216 testing.(*T).Run.func1() /usr/local/go/src/testing/testing.go:1629 +0x47 ``` - [x] Access to packet stats while capture is running There is a race when accessing the (mock) stats in `rotate()` because packets are being processed (and hence the counter is accessed) at the same time. This is arguably more of a problem in the slimcap mock source (I'll address that one in a separate issue, I guess an atomic counter won't add too much overhead) but I'm still raising it here because it might hint at an issue with the state machine: ``` WARNING: DATA RACE Read at 0x00c0000962d8 by goroutine 21: github.com/fako1024/slimcap/capture/afpacket/socket.(*MockFileDescriptor).IncrementPacketCount() /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/socket/socket_mock.go:48 +0xa9b github.com/fako1024/slimcap/capture/afpacket/afring.(*MockSource).addPacket() /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock.go:159 +0xa69 github.com/fako1024/slimcap/capture/afpacket/afring.(*MockSource).addPacket-fm() :1 +0x84 github.com/fako1024/slimcap/capture/pcap.(*Source).NextPacketFn() /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/pcap/pcap.go:150 +0xd4 github.com/fako1024/slimcap/capture/afpacket/afring.(*MockSource).AddPacketFromSource() /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock.go:110 +0xc5 github.com/fako1024/slimcap/capture/afpacket/afring.(*MockSource).Pipe.func1() /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock.go:193 +0xd6 github.com/fako1024/slimcap/capture/afpacket/afring.(*MockSource).Pipe.func2() /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock.go:204 +0x47 Previous write at 0x00c0000962d8 by goroutine 13: github.com/fako1024/slimcap/capture/afpacket/socket.(*MockFileDescriptor).GetSocketStats() /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/socket/socket_mock.go:59 +0x104 github.com/fako1024/slimcap/event.(*Handler).GetSocketStats() /home/fako/Develop/go/src/github.com/fako1024/slimcap/event/poll.go:81 +0x4f github.com/fako1024/slimcap/capture/afpacket/afring.(*Source).Stats() /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring.go:268 +0xce github.com/fako1024/slimcap/capture/afpacket/afring.(*MockSource).Stats() :1 +0x51 github.com/els0r/goProbe/pkg/capture.(*Capture).tryGetCaptureStats() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:590 +0xc4 github.com/els0r/goProbe/pkg/capture.(*Capture).rotate() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:556 +0x351 github.com/els0r/goProbe/pkg/capture.captureCommandRotate.execute() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:203 +0x31 github.com/els0r/goProbe/pkg/capture.capturing() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:365 +0x2e2 github.com/els0r/goProbe/pkg/capture.(*Capture).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:324 +0xc9 Goroutine 21 (running) created at: github.com/fako1024/slimcap/capture/afpacket/afring.(*MockSource).Pipe() /home/fako/Develop/go/src/github.com/fako1024/slimcap/capture/afpacket/afring/afring_mock.go:191 +0x19e github.com/els0r/goProbe/pkg/e2etest.newPcapSource.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:393 +0x754 github.com/els0r/goProbe/pkg/e2etest.setupSources.func1.1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:290 +0x93 github.com/els0r/goProbe/pkg/capture.initializing() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:337 +0xcd github.com/els0r/goProbe/pkg/capture.(*Capture).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:324 +0xc9 Goroutine 13 (running) created at: github.com/els0r/goProbe/pkg/capture.(*Capture).Run() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:315 +0x8e github.com/els0r/goProbe/pkg/capture.(*Manager).enable() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:108 +0x464 github.com/els0r/goProbe/pkg/capture.(*Manager).Update.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:210 +0x3e github.com/els0r/goProbe/pkg/capture.(*RunGroup).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/rungroup.go:26 +0x73 ``` - [x] Concurrent access to logger It seems that there is at least one race condition when interacting with the logger still (although I recall that the new logger should(tm) be race-free and safe for concurrent use): ``` ================== WARNING: DATA RACE Read at 0x00c000022228 by goroutine 176: github.com/els0r/goProbe/pkg/capture.(*Capture).Status() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:595 +0x64 github.com/els0r/goProbe/pkg/capture.(*Manager).Status.func2() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:292 +0x67 github.com/els0r/goProbe/pkg/capture.(*RunGroup).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/rungroup.go:26 +0x73 Previous write at 0x00c000022228 by goroutine 173: github.com/els0r/goProbe/pkg/capture.(*Capture).setState() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:273 +0x14f github.com/els0r/goProbe/pkg/capture.initializing() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:297 +0x35 github.com/els0r/goProbe/pkg/capture.(*Capture).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:291 +0xc9 Goroutine 176 (running) created at: github.com/els0r/goProbe/pkg/capture.(*RunGroup).Run() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/rungroup.go:24 +0xe4 github.com/els0r/goProbe/pkg/capture.(*Manager).Status() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:291 +0x138 github.com/els0r/goProbe/pkg/e2etest.mockIfaces.KillGoProbeOnceDone() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/types.go:156 +0x6c github.com/els0r/goProbe/pkg/e2etest.runGoProbe.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:216 +0x64 Goroutine 173 (running) created at: github.com/els0r/goProbe/pkg/capture.(*Capture).Run() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:282 +0x8e github.com/els0r/goProbe/pkg/capture.(*Manager).enable() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:110 +0x464 github.com/els0r/goProbe/pkg/capture.(*Manager).Update.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:212 +0x3e github.com/els0r/goProbe/pkg/capture.(*RunGroup).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/rungroup.go:26 +0x73 ================== ================== WARNING: DATA RACE Read at 0x00c0000f4490 by goroutine 176: context.(*valueCtx).Value() /usr/local/go/src/context/context.go:623 +0x48 github.com/els0r/goProbe/pkg/logging.getFields() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/logging/init.go:213 +0x7b github.com/els0r/goProbe/pkg/logging.fromContext() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/logging/init.go:263 +0x5f github.com/els0r/goProbe/pkg/logging.FromContext() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/logging/init.go:289 +0x3e github.com/els0r/goProbe/pkg/capture.(*Capture).Status() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:595 +0x84 github.com/els0r/goProbe/pkg/capture.(*Manager).Status.func2() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:292 +0x67 github.com/els0r/goProbe/pkg/capture.(*RunGroup).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/rungroup.go:26 +0x73 Previous write at 0x00c0000f4490 by goroutine 173: context.WithValue() /usr/local/go/src/context/context.go:593 +0x109 github.com/els0r/goProbe/pkg/logging.WithFields() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/logging/init.go:256 +0x508 github.com/els0r/goProbe/pkg/capture.(*Capture).setState() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:273 +0x13b github.com/els0r/goProbe/pkg/capture.initializing() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:297 +0x35 github.com/els0r/goProbe/pkg/capture.(*Capture).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:291 +0xc9 Goroutine 176 (running) created at: github.com/els0r/goProbe/pkg/capture.(*RunGroup).Run() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/rungroup.go:24 +0xe4 github.com/els0r/goProbe/pkg/capture.(*Manager).Status() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:291 +0x138 github.com/els0r/goProbe/pkg/e2etest.mockIfaces.KillGoProbeOnceDone() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/types.go:156 +0x6c github.com/els0r/goProbe/pkg/e2etest.runGoProbe.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/e2etest/e2e_test.go:216 +0x64 Goroutine 173 (running) created at: github.com/els0r/goProbe/pkg/capture.(*Capture).Run() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture.go:282 +0x8e github.com/els0r/goProbe/pkg/capture.(*Manager).enable() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:110 +0x464 github.com/els0r/goProbe/pkg/capture.(*Manager).Update.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/capture_manager.go:212 +0x3e github.com/els0r/goProbe/pkg/capture.(*RunGroup).Run.func1() /home/fako/Develop/go/src/github.com/els0r/goProbe/pkg/capture/rungroup.go:26 +0x73 ================== ``` @els0r TBH I'm not sure if that's all of them, but I think we'll have to urgently take a look at them (they are not cosmetic, the E2E test randomly fails even without race detection turned on, and since they are not solely related to the mock sources I'm pretty sure we might see weird issues in the wild at some point)... Maybe something we can discuss in a call as well?
fako1024 commented 1 year ago

Note: This is on my local working branch of #88 and uses the changes in this PR: https://github.com/fako1024/slimcap/pull/34

fako1024 commented 1 year ago

Raised https://github.com/fako1024/slimcap/issues/35 to cover packet counter issue.

fako1024 commented 1 year ago

@els0r I was playing around a little: The first two are quite easily solvable without introducing any performance penalty (albeit with a bit of a cringe because the require additional synchronization outside of the existing state machine / run group stuff). The slimcap counter issue I can easily solve as well. The writeout channel stuff is a bit more contrived from what I see. I'll keep you posted...

fako1024 commented 1 year ago

More issues (or rather details) found when assessing:

els0r commented 1 year ago

First off: thanks for looking into this and embracing the ugly. As discussed, let's try to rework this part and keep only what we need.

The thing with the logger is concerning. I wonder if it has to do with using c.ctx vs. a regular context. Definitely something we should look out for when restructuring the code base.

I want to get rid of storing the context inside the structs capture and captureManager and make it part of the function calls whenever possible.

fako1024 commented 1 year ago

First off: thanks for looking into this and embracing the ugly. As discussed, let's try to rework this part and keep only what we need.

The thing with the logger is concerning. I wonder if it has to do with using c.ctx vs. a regular context. Definitely something we should look out for when restructuring the code base.

I want to get rid of storing the context inside the structs capture and captureManager and make it part of the function calls whenever possible.

Context makes sense given the stack trace from the race, so maybe that's the culprit indeed. Nice digging!! :muscle: