getlantern / browsersunbounded

Interoperable browser-based P2P proxies for censorship circumvention
GNU General Public License v3.0
14 stars 0 forks source link

Egress server: Heroku reports H15 errors / 503 responses #181

Open noahlevenson opened 1 year ago

noahlevenson commented 1 year ago

@oxtoacart pointed out that the egress server metrics indicate a bunch of 5XX responses for requests. This correlates with the H15 errors reported in the events.

If you tail the logs for the server, you can observe these things happening from time to time. It looks like this:

2023-08-22T23:43:31.603801+00:00 heroku[router]: at=error code=H15 desc="Idle connection" method=GET path="/ws" host=bf-egress.herokuapp.com request_id=403c23
3f-37c6-4371-ae53-253143b2e767 fwd="68.81.173.37" dyno=web.1 connect=0ms service=100175ms status=503 bytes= protocol=https                                    
2023-08-22T23:43:31.604881+00:00 app[web.1]: 2023/08/22 23:43:31 WebSocket connection 952f97a1-d237-4c45-b4fc-b9fdb3a5876e QUIC listener error (failed to get 
reader: failed to read frame header: EOF), closing!                                                                                                           
2023-08-22T23:43:31.604896+00:00 app[web.1]: 2023/08/22 23:43:31 Closed a WebSocket connection! (44 total)  

After reading about H15 errors, you might think that Heroku is just deciding that some of our WebSocket connections are idle and closing them down. But we implemented a keepalive to prevent this from happening, and at the time, it seems to have worked.

I thought: Is the keepalive now failing? It was set for 30s, so I reduced it to 15s and redeployed. But this didn't solve the problem!

noahlevenson commented 1 year ago

Given the content and order of the log messages when the error is reported, there are two possible interpretations of causality here:

  1. Heroku decides a WebSocket is idle and closes the connection, and then the server detects that the connection no longer exists and cleans it up.

  2. The server detects a dead connection (due to a communications partner who left abruptly) and closes the WebSocket, and then Heroku incorrectly interprets this as an idle condition, reporting an error for no reason.

This SO thread suggests it might be no. 2:

https://stackoverflow.com/questions/32728030/heroku-h15-error-on-web-socket-close

noahlevenson commented 1 year ago

Another clue: If you pay attention to the timestamps in the log messages, it seems that the messages actually print out of order, which obfuscates causality. In other words, if you match up the log message from the egress server indicating WebSocket closure with the log message from Heroku reporting the error, it's actually the egress server message with the earlier timestamp -- even though the Heroku message reliably appears first.

The timestamp reversal does not occur when looking at messages resulting from a client-initiated healthy closure. That is, when a client toggles the widget off, Heroku emits a log message reporting the outcome of the HTTP GET, and then the server emits a log message reporting the death of the WebSocket.

I've only looked at 10 or so message pairings -- but if these invariants hold, it suggests the following idea:

When a client disconnects in a healthy way, Heroku is the first to know about it (which makes sense), and so Heroku reports the outcome of the request first.

When a client disconnects in an unhealthy way, leaving the connection in a half open state, the server is the first to know about it, and so the server formally closes the request. Only then does Heroku see that the request has terminated, and so it reports the outcome -- and for some reason, it reports it as an H15 error.

This suggests that, in the case of the reported error, it's not Heroku closing our WebSockets and our server reacting to it -- rather, it's our server behaving correctly, and Heroku just interpreting things wrong.