Update docs and mention explicit close required change in v1.8.11

marekmartins commented 7 months ago

Problem

In our production systems we upgraded to latest patch version and we noticed excessive memory usage, seemed to be goroutine leak as far as I checked. Memory went up but when connections count went down, memory stayed roughly at the same level.

Probably on close etc it was not cleaned up properly, but I did not dig any further - for now we rolled back to version v1.8.10 and all is back to normal.

Our use-case: We have roughly opened 50k websockets on peak time at day, 500 messages pushed per second, 30 new websockets made per second, 30 old websockets closed per second.

Also for each websocket connection we execute in separate goroutine PING-PONGer to prevent websockets closing for clients under some circumstances.

Maybe it's some edge case with our usage, but for us it seems regression introduced. If I can help anyhow more then let me know 🙏

nhooyr commented 7 months ago

Oh no :(

Can you post a goroutine trace of the leaked goroutines?

marekmartins commented 7 months ago

maybe You can give me some pointers how I could do it easiest and in a form most useful for You? then I try to look more into it in next days

nhooyr commented 6 months ago

If you send your program a SIGQUIT it should dump a stack of all active goroutines. That would help lots.

If you cannot do that, you can use delve to attach to a running go program and get a trace of all goroutines. See https://github.com/go-delve/delve/issues/1475

I think net/http/pprof also supports a debug endpoint that prints a trace of all goroutines.

All equally useful options, please pick whichever is most convenient for you and get back to me as soon as possible.

nhooyr commented 6 months ago

Also just to note we do ensure there are no hanging goroutines after running all tests. So whatever it is, it's something not coming up in the tests.

How do you know it's a goroutine leak actually? Could just be some memory being held somewhere indefinitely somehow.

nhooyr commented 6 months ago

Actually idk how that could be either. Def need a trace or further info to debug this.

marekmartins commented 6 months ago

Tnx. I'll try to look into this.

I know this by Prometheus/Grafana metrics we have / are tracking. Each night connections count drops from 50k to roughly 2k and together with it goroutines count aswell related to connections + related memory.

With newest patch version this drop was not there like it should be, it was pretty much flatlined.

EDIT I deployed it to our test environment, but there is load so low that it takes litte time to get anything meaningful from there. I will profile application there in next days and take goroutine profile using pprof and share file with You.

marekmartins commented 6 months ago

I let it run in test litte-bit and I already see patterns that goroutines tend to go up. I profiled it in a moment where there were 8 connections opened. We track it via Prometheus metrics and it's very precide.

I added part of goroutines profile here. I think for some reasons timeoutLoop goroutine does not always exit when connection is actually closed or somehow it get's stuck. It shows there 102, what seems to be way too off compared to 8 connections by other metrics. Also at night this number did not decrease.

I let it run more and will see if that number goes more up. On Monday I get newer numbers.

nhooyr commented 6 months ago

What kind of goroutines profile is that? What do the percentages mean?

It would help tons if you could get the stacks of those timeoutLoop goroutines. I looked over the code and it isn't clear how it could be getting stuck

marekmartins commented 6 months ago

I'm using net/http/pprof package for that I have HTTP handlers what give back different profiles. This concrete one is /debug/pprof/goroutine profile from that.

https://pkg.go.dev/net/http/pprof

I took new profile and now it's 123 with no actual connections.

So it's steadily growing in test environment.

I added latest prof file as zip. Unzip it and check it via command: go tool pprof -http=:8080 test.cmh1-3.goroutine.prof

Then You can check it out via browser. I can also sent that snapshot from other profile types like heap, mutex etc. All what is supported by pprof package.

Let me know 🙏

Also next week I will debug timeoutLoop and try to see if I'm able to repeat the situation when those goroutines will not exit.

test.cmh1-3.goroutine.zip

nhooyr commented 6 months ago

It's confusing but there are multiple goroutine profiles. Can you go to /debug/pprof/ in a browser and then click full goroutine stack dump? That's the one I need and it's just plaintext. The link should be /debug/pprof/goroutine?debug=2.

nhooyr commented 6 months ago

One possible scenario in which I can see this behavior arising is if you're not calling c.Close/c.CloseNow on every connection after you're done with it. Before if an error occured on the connection like a read/write it would be automatically closed and cleaned up for you but this latest version removed that in favor of always requiring c.Close/c.CloseNow.

Though the docs have always read:

Be sure to call Close on the connection when you are finished with it to release associated resources.

So this behavior was never to be relied on. Though I can see how that's confusing with the next line:

On any error from any method, the connection is closed with an appropriate reason.

I'll remove this line from the docs.

nhooyr commented 6 months ago

The quoted docs are from: https://godocs.io/nhooyr.io/websocket#Conn

marekmartins commented 6 months ago

You're right. If connection was closed from client side and main read loop ended because of that then on server side we did not explicitly close as without it also it worked and released resources on it's own. If that's the case then sorry for the trouble 😞. I will add Close next week in this case as well and let here know that if it resolved the issue.

Also as maybe I'm not the only one and previous behavior was not intentional, but it still was there then would be wise to mention it in release notes also? For me it seems important behavioral change to mention 😅

nhooyr commented 6 months ago

Awesome! Yup will add to the release notes thanks @marekmartins.

marekmartins commented 6 months ago

Letting know that seems that adding explicit Close solved problem for us! Thank You :) Feel free to close this issue.

nhooyr commented 6 months ago

Awesome good to hear!

coder / websocket

Update docs and mention explicit close required change in v1.8.11 #445

Problem