Open ChrisSchinnerl opened 1 month ago
Here's what I have so far:
The Syncer uses a WaitGroup to ensure that all its goroutines exit before Syncer.Close
returns. The deadlock happens because wg.Wait
is not returning. This could be caused by calling wg.Add
more than wg.Done
, but that doesn't seem to be the case; all occurrences of wg.Add
are immediately followed by a deferred wg.Done
. This suggests that one of the goroutines simply isn't returning.
Looking at the stack trace, we can see that there is indeed one outstanding goroutine: 573687
, which accepts an incoming connection and calls runPeer
. This goroutine appears to be blocked on the mux's AcceptStream
method. That's strange, because when mux.Close
is called, AcceptStream
should wake up, observe m.err != nil
, and exit. Another strange thing is that the stack trace does not say that this goroutine has been blocked for very long. Maybe this is just an inconsistency in the trace; alternatively, it could mean that the goroutine is being woken repeatedly, but isn't exiting. This in turn would imply that m.err == nil
, which would be very strange.
I think the next step is to add some debug logic to the mux
package, and use it for renterd
's CI until we trigger the bug again. I'll get started on that.
Actually, first things first: renterd
is using mux@v1.2.0
, which is pretty outdated. Let's update it to v1.3.0
and see if the bug recurs.
Actually, first things first:
renterd
is usingmux@v1.2.0
, which is pretty outdated. Let's update it tov1.3.0
and see if the bug recurs.
I believe that this particular test run already uses v1.3.0
since we updated the dependency on the dev
branch 6 days ago and the PR was created 4 days ago.
I think not lol
created by go.sia.tech/mux/v2.newMux in goroutine 573687
/home/runner/go/pkg/mod/go.sia.tech/mux@v1.2.0/v2/mux.go:377 +0x426
Just for the sake of collecting more datapoints
https://github.com/SiaFoundation/renterd/actions/runs/11127300472/job/30919534124?pr=1596 stacktrace2.txt
ok, so it's reproduced on v.1.3.0. Good to know.
Looks like it's happening once in a blue moon though, encountered another:
https://github.com/SiaFoundation/renterd/actions/runs/11612827367/job/32338279894?pr=1643 stacktrace.txt
A random deadlock that prevents our syncer from shutting down.
https://github.com/SiaFoundation/renterd/actions/runs/11030847304/job/30636343643?pr=1574
stacktrace.txt
Originally posted by @peterjan in https://github.com/SiaFoundation/coreutils/issues/92#issuecomment-2376111458