getlantern / browsersunbounded

Interoperable browser-based P2P proxies for censorship circumvention
GNU General Public License v3.0
12 stars 0 forks source link

Egress server misbehavior when deployed #70

Closed noahlevenson closed 1 year ago

noahlevenson commented 1 year ago

After deploying on Google Cloud Run, I observed a complex of [possibly] related bugs that are irreproducible when hosting locally:

  1. Leaking websocket connections, or just failing to decrement the number of websocket connections
  2. Strange interleaved log messages, even when min and max containers are set to 1
  3. Intermittent periods when clients will continuously report QUIC dial failure

Is this specific to Google Cloud Run? A good first step might be to deploi our boi on a very simple un-abstracted VM and see if we can repro the naughtiness.

noahlevenson commented 1 year ago

I just reproduced some weirdness. This matches some weirdness that happened during my 1:1 testing session with @anacrolix, but this is the first time I've actually pieced together the sequence of events:

  1. I start tailing the logs for the egress server. Bizarrely, the egress server appears to be in a pristine state, with no dangling connections. I load the web widget and toggle it on, and the egress server reports 5 WebSocket connections.

  2. I start the desktop CLI client via ./desktop. My desktop client negotiates 5 parallel connections with my web widget. I see normal activity in the egress server logs.

  3. I curl this big file from Hetzner: curl --proxy http://127.0.0.1:1080 https://speed.hetzner.de/1GB.bin --output /dev/null

  4. All is looking good! The web widget reports about 3 MB/sec throughput, and it's pretty steady.

  5. After receiving ~785/1000 MB, the curl totally stops. The widget throughput drops to 0. I look at the logs, and peculiar things have happened:

In summary, this is what it seems is happening: In the middle of the curl, for reasons unclear, the widget's active WebSocket connection to the egress server experiences a catastrophic error. The widget detects the connection breakage, but the egress server fails to detect it. The widget dials up a new WebSocket connection, which is exactly what it's supposed to do. But now there's a dangling WebSocket connection at the egress server. A few moments later, the QUIC connection state kicks in, and both the egress server and the desktop client recognize that they no longer have an e2e connection; the egress server cleans up the QUIC connection as it's supposed to do, and the desktop client dials a new QUIC connection as it's supposed to do. At this point, however, the curl cannot continue -- which is totally expected -- because we have no Turbo Tunnel-like management of state across ephemeral connections.

I wonder if the strange WebSocket breakage is due to reliable-over-reliable weirdness which is only manifesting now because we're for the first time experiencing normal packet loss with the egress server deployed in the cloud?

egress server:

2023-01-12 00:03:05 2023/01/12 00:03:05 [003] INFO: Copied 213 bytes to client error=<nil>
2023-01-12 00:03:05 2023/01/12 00:03:05 [002] INFO: Received response 200 OK
2023-01-12 00:03:05 2023/01/12 00:03:05 [002] INFO: Copying response to client 200 OK [200]
2023-01-12 00:03:05 2023/01/12 00:03:05 [002] INFO: Copied 213 bytes to client error=<nil>
2023-01-12 00:03:04 2023/01/12 00:03:04 WebSocket connection #4 accepted a new QUIC connection!
2023-01-12 00:03:04 2023/01/12 00:03:04 [002] INFO: Got request /user-data api.getiantem.org GET http://api.getiantem.org/user-data?locale=en-US&timeout=10
2023-01-12 00:03:04 2023/01/12 00:03:04 HTTP proxy just saw a request:
2023-01-12 00:03:04 2023/01/12 00:03:04 [003] INFO: Got request /user-data api.getiantem.org GET http://api.getiantem.org/user-data?locale=en-US&timeout=10
2023-01-12 00:03:04 2023/01/12 00:03:04 HTTP proxy just saw a request:
2023-01-12 00:03:48 2023/01/12 00:03:48 WebSocket connection #1 QUIC listener error (failed to get reader: received close frame: status = StatusNormalClosure and reason = "failed to read: context canceled"), closing!
2023-01-12 00:03:48 2023/01/12 00:03:48 Closed a WebSocket connection! (4 total)
2023-01-12 00:03:48 2023/01/12 00:03:48 WebSocket connection #3 QUIC listener error (failed to get reader: received close frame: status = StatusNormalClosure and reason = "failed to read: context canceled"), closing!
2023-01-12 00:03:48 2023/01/12 00:03:48 Closed a WebSocket connection! (3 total)
2023-01-12 00:03:48 2023/01/12 00:03:48 WebSocket connection #2 QUIC listener error (failed to get reader: received close frame: status = StatusNormalClosure and reason = "failed to read: context canceled"), closing!
2023-01-12 00:03:48 2023/01/12 00:03:48 Closed a WebSocket connection! (2 total)
2023-01-12 00:03:48 2023/01/12 00:03:48 WebSocket connection #4 QUIC listener error (failed to get reader: received close frame: status = StatusNormalClosure and reason = "failed to read: context canceled"), closing!
2023-01-12 00:03:48 2023/01/12 00:03:48 Closed a WebSocket connection! (1 total)
2023-01-12 00:03:48 2023/01/12 00:03:48 WebSocket connection #4 stream error (INTERNAL_ERROR (local): failed to get reader: received close frame: status = StatusNormalClosure and reason = "failed to read: context canceled"), closing QUIC connection!
2023-01-12 00:03:48 2023/01/12 00:03:48 WebSocket connection #5 QUIC listener error (failed to get reader: received close frame: status = StatusNormalClosure and reason = "failed to read: context canceled"), closing!
2023-01-12 00:03:48 2023/01/12 00:03:48 Closed a WebSocket connection! (0 total)
2023-01-12 00:03:48 GET 101 https://broflake-egress-xdy27ofj3a-ue.a.run.app/ws
2023-01-12 00:03:48 GET 101 https://broflake-egress-xdy27ofj3a-ue.a.run.app/ws
2023-01-12 00:03:48 GET 101 https://broflake-egress-xdy27ofj3a-ue.a.run.app/ws
2023-01-12 00:03:48 GET 101 https://broflake-egress-xdy27ofj3a-ue.a.run.app/ws
2023-01-12 00:03:48 GET 101 https://broflake-egress-xdy27ofj3a-ue.a.run.app/ws
2023-01-12 00:03:55 2023/01/12 00:03:55 Accepted a new WebSocket connection! (1 total)
2023-01-12 00:03:55 2023/01/12 00:03:55 Accepted a new WebSocket connection! (2 total)
2023-01-12 00:03:55 2023/01/12 00:03:55 Accepted a new WebSocket connection! (3 total)
2023-01-12 00:03:55 2023/01/12 00:03:55 Accepted a new WebSocket connection! (4 total)
2023-01-12 00:03:55 2023/01/12 00:03:55 Accepted a new WebSocket connection! (5 total)
2023-01-12 00:04:56 2023/01/12 00:04:56 WebSocket connection #5 accepted a new QUIC connection!
2023-01-12 00:05:19 2023/01/12 00:05:19 [004] INFO: Running 0 CONNECT handlers
2023-01-12 00:05:20 2023/01/12 00:05:20 [004] INFO: Accepting CONNECT to speed.hetzner.de:443
2023-01-12 00:09:33 2023/01/12 00:09:33 Accepted a new WebSocket connection! (6 total)
2023-01-12 00:09:49 2023/01/12 00:09:49 [004] WARN: Error copying to client: timeout: no recent network activity
2023-01-12 00:09:49 2023/01/12 00:09:49 [004] WARN: Error copying to client: readfrom tcp [fddf:3978:feb1:d745::c001]:20811->[2a01:4f8:0:59ed::2]:443: timeout: no recent network activity
2023-01-12 00:09:49 2023/01/12 00:09:49 WebSocket connection #5 stream error (timeout: no recent network activity), closing QUIC connection!
2023-01-12 00:09:49 2023/01/12 00:09:49 WebSocket connection #6 accepted a new QUIC connection!

desktop client:

2023/01/11 19:04:50 Too late for genesis message 5412c68c-8063-4b09-bba6-ca8696854296!
2023/01/11 19:04:50 Consumer state 1...
2023/01/11 19:04:50 Too late for genesis message 5412c68c-8063-4b09-bba6-ca8696854296!
2023/01/11 19:04:50 Consumer state 1...
2023/01/11 19:04:50 Too late for genesis message 5412c68c-8063-4b09-bba6-ca8696854296!
2023/01/11 19:04:50 Consumer state 1...
2023/01/11 19:04:50 Too late for genesis message 5412c68c-8063-4b09-bba6-ca8696854296!
2023/01/11 19:04:50 Consumer state 1...
2023/01/11 19:04:51 Consumer state 2...
2023/01/11 19:04:51 Consumer state 2...
2023/01/11 19:04:51 Consumer state 2...
2023/01/11 19:04:51 Consumer state 2...
2023/01/11 19:04:51 Ice gathering complete!
2023/01/11 19:04:51 Consumer state 3...
2023/01/11 19:04:51 Too late for genesis message 0664a237-b829-49dc-a78f-a4eb25706440!
2023/01/11 19:04:51 Consumer state 1...
2023/01/11 19:04:51 Too late for genesis message 0664a237-b829-49dc-a78f-a4eb25706440!
2023/01/11 19:04:51 Consumer state 1...
2023/01/11 19:04:51 Too late for genesis message 0664a237-b829-49dc-a78f-a4eb25706440!
2023/01/11 19:04:51 Consumer state 1...
2023/01/11 19:04:51 Consumer state 2...
2023/01/11 19:04:51 Consumer state 2...
2023/01/11 19:04:51 Consumer state 2...
2023/01/11 19:04:51 Too late for genesis message 9555c653-8260-424a-9aa9-46ec7f72b41f!
2023/01/11 19:04:51 Consumer state 1...
2023/01/11 19:04:51 Ice gathering complete!
2023/01/11 19:04:51 Consumer state 3...
2023/01/11 19:04:51 Ice gathering complete!
2023/01/11 19:04:51 Consumer state 3...
2023/01/11 19:04:51 Ice gathering complete!
2023/01/11 19:04:51 Consumer state 3...
2023/01/11 19:04:52 Peer connection state change: connected
2023/01/11 19:04:52 A datachannel has opened!
2023/01/11 19:04:52 Peer connection state change: connected
2023/01/11 19:04:52 A datachannel has opened!
2023/01/11 19:04:52 Peer connection state change: connected
2023/01/11 19:04:52 A datachannel has opened!
2023/01/11 19:04:52 Peer connection state change: connected
2023/01/11 19:04:52 A datachannel has opened!
2023/01/11 19:04:53 QUIC dial failed (timeout: no recent network activity), retrying...
2023/01/11 19:04:56 Consumer state 4, signaling complete!
2023/01/11 19:04:56 A WebRTC connection has been established!
2023/01/11 19:04:56 Consumer state 2...
2023/01/11 19:04:56 QUIC connection established, ready to proxy!
2023/01/11 19:04:56 Consumer state 4, signaling complete!
2023/01/11 19:04:56 A WebRTC connection has been established!
2023/01/11 19:04:56 Consumer state 4, signaling complete!
2023/01/11 19:04:56 A WebRTC connection has been established!
2023/01/11 19:04:56 Consumer state 4, signaling complete!
2023/01/11 19:04:56 A WebRTC connection has been established!
2023/01/11 19:04:56 Ice gathering complete!
2023/01/11 19:04:56 Consumer state 3...
2023/01/11 19:04:57 Peer connection state change: connected
2023/01/11 19:04:57 A datachannel has opened!
2023/01/11 19:05:01 Consumer state 4, signaling complete!
2023/01/11 19:05:01 A WebRTC connection has been established!
2023/01/11 19:05:19 [001] INFO: Running 0 CONNECT handlers
2023/01/11 19:05:20 [001] INFO: Accepting CONNECT to speed.hetzner.de:443
2023/01/11 19:09:49 QUIC connection error (timeout: no recent network activity), closing!
2023/01/11 19:09:49 [001] WARN: Error copying to client: readfrom tcp 127.0.0.1:1080->127.0.0.1:54634: timeout: no recent network activity
2023/01/11 19:09:49 QUIC connection established, ready to proxy!

widget:

2023/01/11 19:04:51 Producer state 4, signaling complete!
main.js:2 2023/01/11 19:04:51 Peer connection state change: connecting
main.js:2 2023/01/11 19:04:51 Producer state 4, signaling complete!
main.js:2 2023/01/11 19:04:52 Peer connection state change: connected
main.js:2 2023/01/11 19:04:52 Created new datachannel...
main.js:2 2023/01/11 19:04:52 A datachannel has opened!
main.js:2 2023/01/11 19:04:52 A WebRTC connection has been established!
main.js:2 2023/01/11 19:04:52 Producer state 5...
main.js:2 2023/01/11 19:04:52 Peer connection state change: connected
main.js:2 2023/01/11 19:04:52 Created new datachannel...
main.js:2 2023/01/11 19:04:52 A datachannel has opened!
main.js:2 2023/01/11 19:04:52 A WebRTC connection has been established!
main.js:2 2023/01/11 19:04:52 Producer state 5...
main.js:2 2023/01/11 19:04:52 Peer connection state change: connected
main.js:2 2023/01/11 19:04:52 Created new datachannel...
main.js:2 2023/01/11 19:04:52 A datachannel has opened!
main.js:2 2023/01/11 19:04:52 A WebRTC connection has been established!
main.js:2 2023/01/11 19:04:52 Producer state 5...
main.js:2 2023/01/11 19:04:52 Peer connection state change: connected
main.js:2 2023/01/11 19:04:52 Created new datachannel...
main.js:2 2023/01/11 19:04:52 A datachannel has opened!
main.js:2 2023/01/11 19:04:52 A WebRTC connection has been established!
main.js:2 2023/01/11 19:04:52 Producer state 5...
main.js:2 2023/01/11 19:04:56 No answer for genesis message!
main.js:2 2023/01/11 19:04:56 Producer state 1...
main.js:2 2023/01/11 19:04:56 Producer state 2...
main.js:2 2023/01/11 19:04:56 Producer state 3...
main.js:2 2023/01/11 19:04:56 Ice gathering complete!
main.js:2 2023/01/11 19:04:56 Peer connection state change: connecting
main.js:2 2023/01/11 19:04:56 Producer state 4, signaling complete!
main.js:2 2023/01/11 19:04:57 Peer connection state change: connected
main.js:2 2023/01/11 19:04:57 Created new datachannel...
main.js:2 2023/01/11 19:04:57 A datachannel has opened!
main.js:2 2023/01/11 19:04:57 A WebRTC connection has been established!
main.js:2 2023/01/11 19:04:57 Producer state 5...
main.js:2 2023/01/11 19:09:33 Egress consumer WebSocket read error: failed to read: WebSocket closed: unclean connection close: status = StatusAbnormalClosure and reason = ""
main.js:2 2023/01/11 19:09:33 Egress consumer state 0, opening WebSocket connection...
main.js:2 2023/01/11 19:09:33 Egress consumer state 1, WebSocket connection established!
noahlevenson commented 1 year ago

I've been trying to reproduce the WebSocket issue while running a native binary widget, and i can't seem to repro it!

(I hit that big curl ~15 times and it always completes flawlessly)

Since our WebSocket implementation comes from https://github.com/nhooyr/websocket when using the native binary, but it comes from the browser when using the web, a logical hypothesis is that there's something going on with Chrome's WebSocket implementation that's disagreeing with us.

It's dawned on me that we disabled WebSocket compression at the egress server to deal with https://github.com/getlantern/broflake/issues/45

Might be related!

noahlevenson commented 1 year ago

Couple updates:

  1. I pushed ~30 GB through the widget -- both the native binary and the web wasm -- and I wasn't able to repro the bug. I never saw a dangling connection all day. I never had a curl fail. I started to think: "Is this issue related to some consistency funkiness with Google Cloud Run? Maybe we were just running instanced containers for a little while even though the Cloud Run console told us we weren't..." Then I tried something different: I proxied my Firefox traffic through the web widget, which I knew would create many QUIC streams at the egress server, and a mix of HTTP (not HTTPS) requests. Within a few moments, this reproduced the bug. We should take a look at two things: HTTP (not HTTPS) requests, which have historically caused weird issues -- and what happens when we run out of QUIC streams (currently capped at 2 ^ 16, I think).

  2. The browser's WebSocket API doesn't even let you fiddle with compression AFAICT, so I'm skeptical that disabling compression at the egress server is consequential here.

anacrolix commented 1 year ago

I still think a failure in either the WebRTC to the widget, or in the WebSocket from the widget to the egress server is failing, and that failure is not propagating to the QUIC connection sitting across those.

Given that the WebSocket-in-the-browser-via-WASM abstraction is the most dubious, that seems like the most likely culprit. I wonder when it hangs when I'm curling and it's through a widget that I'm running what would happen if I killed the tab with the widget.

noahlevenson commented 1 year ago

I think I may have fixed it here: de0927f9d1127931ed48cfb7c0e7feb66f300c55

Today I was able to make the egress server leak WebSocket connections simply by loading a widget in the browser and toggling the switch back and forth very rapidly. This was very illuminating!

I did a bit of code inspection and reasoned that it if a WebSocket connection failed during the handshake, we might not have handled the error and released the resource correctly. There was an easy refactor to correct that.

While I was in there, I also simplified the code paths and fixed something else that may have been consequential: when we wrapped each websocket.Conn to make it behave as a net.PacketConn, we were not assigning it a suitably unique address. If that does indeed matter, it's the sort of trouble that's unlikely to manifest in trivially small networks during local testing, but which would reveal itself during the chaos of real world group testing. (Which is exactly what happened).

Gonna monitor the server for a while and see how things go...

noahlevenson commented 1 year ago

Looking pretty good so far. I haven't been able to make the server leak any WebSocket connections, and it seems to be cleaning up every dangling QUIC stream too.

noahlevenson commented 1 year ago

Closing for now, but we may revisit after observing server behavior at scale...