Open FrauElster opened 2 months ago
Hey, thanks for the PR. I do think we should take a slightly different approach, though. Instead of piggy-backing on closeMutex
in writeClose
and potentially introduce deadlocks due to multiple consumers of the mutex, we move the logic to writeFrame
instead. Then we can rely on the already existing writeFrameMu
and set a boolean for the close when opcode == opClose
.
This has another nice benefit of catching the close message wherever it may originate from.
Edit: Looks like CI failed with a test timeout so this did indeed introduce a deadlock: https://github.com/coder/websocket/actions/runs/10717638649/job/29717749114?pr=476#step:4:97
Sure great idea. I just had this bug and need it fixed and in my experience fixing it and opening a PR is the fastest way. I did not read nor understand the whole codebase, just stepped through it with my debugger and implemented this quick and dirty solution.
Just for clarification if you say we, do you mean you or me? Clearly you already have the solution on your mind, but if you are busy I can also do it.
As suggested I moved the writeSent read/write into writeFrame. However I felt bad to fail silently if a double Close occurred, and is not really an net.ErrClosed since the connection could still be open waiting for the close response, so I introduced a new ErrAlreadyClosed, but currently it is not handled anywhere.
Very interesting. I came to this PR because I am debugging our tests. In our tests we measure how many bytes we send over a websocket (using this library for both client and server in Go) and how many we get for a simple sending of text message hi
from the client and then sending hi
back from the server to the client.
With v1.8.10, before closing was rewritten, we had {"wsFromClient":16,"wsToClient":8}
bytes transmitted. With v1.8.12, we saw change to {"wsFromClient":16,"wsToClient":12}
which I understood that the increase in wsToClient
is because server is now properly closing. But then I noticed that it is not always like that and that sometimes it is {"wsFromClient":24,"wsToClient":12}
because it seems client sends close twice. So I got to this PR and tried this PR and with this PR I am back to {"wsFromClient":16,"wsToClient":8}
, always. Not sure why not {"wsFromClient":16,"wsToClient":12}
.
Nice to hear that it makes things more predictable. :)
@mafredri Yeah a test would be great. Sadly I am on a business trip until next monday, so until then this PR would be stall, unless you would not mind.
So I had a moment to have look and wrote several approaches, but I have to admit that I find it kindof hard to test. I am not familiar enough with the code, so I'll describe some assumptions here.
My general idea is to open a server and a client connection, close one and count the amount of close messages received on both ends. Both are asserted to be one.
Assumption 1:
The problem is that the exposed methods are highlevel and manage the closing handshake themself without returning it to the callee (e.g. conn.Read()
) , so I cannot count them.
Assumption 2: A connection ignores any messages received after the connection was closed. So it just gets discarded and I cannot count any more than 1 received closed message. (This behaviour differs e.g. compared to websocket client of chrome).
I am tempting to have a private callback on the connection options, but introducing private dependency injection just to be able to test it seems overkill, and introduces to much complexity for the benefits. (IMHO).
Second attempt would be to write a separat minimal websocket client / server that can count, but this seems like a bunch of work for such a simple test.
I saw that there are some helpers like EchoServer, one could also introduce something similar that can count opcodes, but also this would need a separat implementation since it is not in the websocket package.
For now I am using my fork (thanks to go mod replace thats pretty easy) to fix the problems i am currently experiencing in production.
Looking forward to get some feedback on the testing issue.
Another note:
your last commit makes my test (go test -count=10 ./...
) gives me the "conn_test.go:623: failed to close WebSocket: failed to read frame header: EOF" again (sometimes). So the race conditions seems to be back in the current state.
Another note:
your last commit makes my test (
go test -count=10 ./...
) gives me the "conn_test.go:623: failed to close WebSocket: failed to read frame header: EOF" again (sometimes). So the race conditions seems to be back in the current state.
Thanks @FrauElster, I can reproduce this. I took a look and I'm pretty sure that the fix simply surfaced a new issue. Calling (*Conn).CloseRead(ctx)
seems to interfere with the close handshake, it hijacks the reader so that the close handshake isn't propagated during actual close. This in turn is closing the connection early so that (*Conn).Close()
runs into io.EOF
. (Parts of this could be wrong, I'm still investigating exactly what is happening.)
I appreciate you taking a look and thinking about ways to test for this, if you want though, I can take over as I want to look more closely at the way we close connections.
PS. Are you using CloseRead(ctx)
yourself? And have you tried this branch as it is right now as a fix for the problems you're experiencing? If we disregard the failing test, have you been able to show that this fix doesn't work in your use-case?
No I don't use CloseRead(ctx) in my application. I use the simple
msgType, data, err := conn.Read(ctx)
if websocket.CloseStatus(err) != -1 { ... }
or
conn.Close(websocket.StatusInternalError, string(reason))
@FrauElster I pushed a fix for the failed to close WebSocket: failed to read frame header: EOF
issue. Let me know if you want me to open a separate PR since I'm pushing to your fork (I don't mind but don't want to cause conflicts for you).
@mitar I'm curious if c3613fc causes any change in the behavior you're seeing? I have noticed that there are still cases where the writeTimeout
and readTimeout
contexts aren't cleared and could potentially cause an abrupt closure of the connection. I'm not saying that's what's happening to you, just a potential cause.
Sure go ahead. My production go.mod is pinned to a commit, so no worries. ;)
Looks like need to run go fmt
and commit result.
Closes #448