SiaFoundation / coreutils

Implementations of core Sia components
MIT License
1 stars 4 forks source link

Fix deadlock in syncer #100

Open ChrisSchinnerl opened 1 month ago

ChrisSchinnerl commented 1 month ago

A random deadlock that prevents our syncer from shutting down.

https://github.com/SiaFoundation/renterd/actions/runs/11030847304/job/30636343643?pr=1574

stacktrace.txt

Originally posted by @peterjan in https://github.com/SiaFoundation/coreutils/issues/92#issuecomment-2376111458

lukechampine commented 1 month ago

Here's what I have so far:

The Syncer uses a WaitGroup to ensure that all its goroutines exit before Syncer.Close returns. The deadlock happens because wg.Wait is not returning. This could be caused by calling wg.Add more than wg.Done, but that doesn't seem to be the case; all occurrences of wg.Add are immediately followed by a deferred wg.Done. This suggests that one of the goroutines simply isn't returning.

Looking at the stack trace, we can see that there is indeed one outstanding goroutine: 573687, which accepts an incoming connection and calls runPeer. This goroutine appears to be blocked on the mux's AcceptStream method. That's strange, because when mux.Close is called, AcceptStream should wake up, observe m.err != nil, and exit. Another strange thing is that the stack trace does not say that this goroutine has been blocked for very long. Maybe this is just an inconsistency in the trace; alternatively, it could mean that the goroutine is being woken repeatedly, but isn't exiting. This in turn would imply that m.err == nil, which would be very strange.

I think the next step is to add some debug logic to the mux package, and use it for renterd's CI until we trigger the bug again. I'll get started on that.

lukechampine commented 1 month ago

Actually, first things first: renterd is using mux@v1.2.0, which is pretty outdated. Let's update it to v1.3.0 and see if the bug recurs.

ChrisSchinnerl commented 1 month ago

Actually, first things first: renterd is using mux@v1.2.0, which is pretty outdated. Let's update it to v1.3.0 and see if the bug recurs.

I believe that this particular test run already uses v1.3.0 since we updated the dependency on the dev branch 6 days ago and the PR was created 4 days ago.

lukechampine commented 1 month ago

I think not lol

created by go.sia.tech/mux/v2.newMux in goroutine 573687
    /home/runner/go/pkg/mod/go.sia.tech/mux@v1.2.0/v2/mux.go:377 +0x426
peterjan commented 1 month ago

Just for the sake of collecting more datapoints

https://github.com/SiaFoundation/renterd/actions/runs/11127300472/job/30919534124?pr=1596 stacktrace2.txt

lukechampine commented 1 month ago

ok, so it's reproduced on v.1.3.0. Good to know.

peterjan commented 6 days ago

Looks like it's happening once in a blue moon though, encountered another:

https://github.com/SiaFoundation/renterd/actions/runs/11612827367/job/32338279894?pr=1643 stacktrace.txt