Open marekmartins opened 7 months ago
Oh no :(
Can you post a goroutine trace of the leaked goroutines?
maybe You can give me some pointers how I could do it easiest and in a form most useful for You? then I try to look more into it in next days
If you send your program a SIGQUIT it should dump a stack of all active goroutines. That would help lots.
If you cannot do that, you can use delve to attach to a running go program and get a trace of all goroutines. See https://github.com/go-delve/delve/issues/1475
I think net/http/pprof also supports a debug endpoint that prints a trace of all goroutines.
All equally useful options, please pick whichever is most convenient for you and get back to me as soon as possible.
Also just to note we do ensure there are no hanging goroutines after running all tests. So whatever it is, it's something not coming up in the tests.
How do you know it's a goroutine leak actually? Could just be some memory being held somewhere indefinitely somehow.
Actually idk how that could be either. Def need a trace or further info to debug this.
Tnx. I'll try to look into this.
I know this by Prometheus/Grafana metrics we have / are tracking. Each night connections count drops from 50k to roughly 2k and together with it goroutines count aswell related to connections + related memory.
With newest patch version this drop was not there like it should be, it was pretty much flatlined.
EDIT
I deployed it to our test environment, but there is load so low that it takes litte time to get anything meaningful from there.
I will profile application there in next days and take goroutine
profile using pprof
and share file with You.
I let it run in test litte-bit and I already see patterns that goroutines tend to go up. I profiled it in a moment where there were 8
connections opened. We track it via Prometheus metrics and it's very precide.
I added part of goroutines
profile here. I think for some reasons timeoutLoop
goroutine
does not always exit when connection is actually closed or somehow it get's stuck. It shows there 102
, what seems to be way too off compared to 8
connections by other metrics. Also at night this number did not decrease.
I let it run more and will see if that number goes more up. On Monday I get newer numbers.
What kind of goroutines profile is that? What do the percentages mean?
It would help tons if you could get the stacks of those timeoutLoop goroutines. I looked over the code and it isn't clear how it could be getting stuck
I'm using net/http/pprof
package for that I have HTTP handlers what give back different profiles.
This concrete one is /debug/pprof/goroutine
profile from that.
https://pkg.go.dev/net/http/pprof
I took new profile and now it's 123 with no actual connections.
So it's steadily growing in test environment.
I added latest prof
file as zip
. Unzip it and check it via command:
go tool pprof -http=:8080 test.cmh1-3.goroutine.prof
Then You can check it out via browser. I can also sent that snapshot from other profile types like heap, mutex etc. All what is supported by pprof
package.
Let me know 🙏
Also next week I will debug timeoutLoop
and try to see if I'm able to repeat the situation when those goroutines
will not exit.
It's confusing but there are multiple goroutine profiles. Can you go to /debug/pprof/
in a browser and then click full goroutine stack dump
? That's the one I need and it's just plaintext. The link should be /debug/pprof/goroutine?debug=2
.
One possible scenario in which I can see this behavior arising is if you're not calling c.Close
/c.CloseNow
on every connection after you're done with it. Before if an error occured on the connection like a read/write it would be automatically closed and cleaned up for you but this latest version removed that in favor of always requiring c.Close
/c.CloseNow
.
Though the docs have always read:
Be sure to call Close on the connection when you are finished with it to release associated resources.
So this behavior was never to be relied on. Though I can see how that's confusing with the next line:
On any error from any method, the connection is closed with an appropriate reason.
I'll remove this line from the docs.
The quoted docs are from: https://godocs.io/nhooyr.io/websocket#Conn
You're right. If connection was closed from client side and main read loop ended because of that then on server side we did not explicitly close as without it also it worked and released resources on it's own.
If that's the case then sorry for the trouble 😞. I will add Close
next week in this case as well and let here know that if it resolved the issue.
Also as maybe I'm not the only one and previous behavior was not intentional, but it still was there then would be wise to mention it in release notes also? For me it seems important behavioral change to mention 😅
Awesome! Yup will add to the release notes thanks @marekmartins.
Letting know that seems that adding explicit Close
solved problem for us! Thank You :) Feel free to close this issue.
Awesome good to hear!
Problem
In our production systems we upgraded to latest patch version and we noticed excessive memory usage, seemed to be goroutine leak as far as I checked. Memory went up but when connections count went down, memory stayed roughly at the same level.
Probably on
close
etc it was not cleaned up properly, but I did not dig any further - for now we rolled back to versionv1.8.10
and all is back to normal.Our use-case: We have roughly opened 50k websockets on peak time at day, 500 messages pushed per second, 30 new websockets made per second, 30 old websockets closed per second.
Also for each websocket connection we execute in separate goroutine
PING-PONG
er to prevent websockets closing for clients under some circumstances.Maybe it's some edge case with our usage, but for us it seems regression introduced. If I can help anyhow more then let me know 🙏