fails-components / webtransport

Http/3 webtransport support for node
Other
132 stars 20 forks source link

Cannot dial webtransport-go server #258

Closed achingbrain closed 4 months ago

achingbrain commented 4 months ago

I'm trying to dial a very simple echo server using webtransport-go but the opening handshake always times out so the session never becomes ready.

I've put a small repro case together here.

Running npm start will:

  1. Start a webtransport-go echo server as a separate process
  2. Dial the server using the client from @fails-components/webtransport.
  3. Starts a bidirectional stream
  4. Write one buffer
  5. Close the writable
  6. Read from the readable until it closes.

npm run browser starts the webtransport-go server but then spits out some JS you can paste into a browser to run the client portion of the test.

This works in Chrome but in node it fails with a handshake timeout after a short delay.

% npm start 

> webtransport-dial-go-server@1.0.0 start
> node index.js

SERVER start
SERVER ready
CLIENT create session https://127.0.0.1:12345/echo
CLIENT wait for session
WebTransportError: Opening handshake failed.
    at Timeout._onTimeout (file:///Users/alex/Documents/Workspaces/achingbrain/webtransport-go-server/node_modules/@fails-components/webtransport/lib/client.js:91:11)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7) {
  [Symbol(Symbol.toStringTag)]: 'WebTransportError'
}
martenrichter commented 4 months ago

It will take some until I can look into it. You may want to put some debug code into the socket classes on the js side. Then one can see, if at least packets are being sent or received, to sort out if the ports or firewall issues occur.

achingbrain commented 4 months ago

This appears to be a regression as dialling the go-webtransport server works as expected with @fails-components/webtransport@0.3.1.

I'm pretty sure there are no port of firewall issues as the only change to make to get it to work is to change the @fails-components/webtransport version in package.json, remove node_modules and reinstall deps.

martenrichter commented 4 months ago

Sure, this is correct. Let me explain the change. Before 1.0 libquiche used its own sockets, it now uses node.js sockets. So this code is completely replaced by standard node.js code. So the regression may happen in this completely rewritten code. That is why I am asking to put some debugging code in serversocket.js, clientsocket.js etc. Since this is the new code. (Btw. I hoped, that this replacement solves more issues then introducing new issues.)

martenrichter commented 4 months ago

Probably, the problems are caused by these lines: https://github.com/fails-components/webtransport/blob/36b5569b599795d7afeec773351443d47acde35e/transports/http3-quiche/lib/clientsocket.js#L31

achingbrain commented 4 months ago

I've made the following changes:

diff --git a/transports/http3-quiche/lib/clientsocket.js b/transports/http3-quiche/lib/clientsocket.js
index 5bfa22d..a2c0836 100644
--- a/transports/http3-quiche/lib/clientsocket.js
+++ b/transports/http3-quiche/lib/clientsocket.js
@@ -49,6 +49,7 @@ export class Http3WebTransportClientSocket extends Http3WebTransportSocket {
           //
         })
         this.socketInt.on('message', (msg, rinfo) => {
+          console.info('clientsocket incoming', msg, rinfo)
           this.cobj.recvPaket({
             msg,
             rinfo,
@@ -56,6 +57,16 @@ export class Http3WebTransportClientSocket extends Http3WebTransportSocket {
           })
         })
         this.socketInt.bind(this.localPort)
+
+        const send = this.socketInt.send
+
+        // @ts-expect-error incorrect types
+        this.socketInt.send = (...args) => {
+          console.info('clientsocket outgoing', ...args)
+          // @ts-expect-error incorrect types
+          return send.call(this.socketInt, ...args)
+        }
+
         setImmediate(() => this.cobj.onCanWrite())
       })
       .catch((error) => {

If I run the example I get:

% DEBUG=websocket*, DEBUG_TRACE=* npm start

> webtransport-dial-go-server@1.0.0 start
> node index.js

SERVER start
SERVER ready
CLIENT create session https://127.0.0.1:12345/echo
CLIENT wait for session
clientsocket outgoing <Buffer c3 00 00 00 01 08 fa ab e7 73 c0 3f f0 29 00 00 44 d0 75 9d c7 63 76 4f f6 63 ba e6 04 0c 25 0f 01 68 4a 4a ec fd 5e 7d 04 40 71 15 b5 1e ec 7f 80 91 ... 1200 more bytes> 0 1250 12345 127.0.0.1 [Function: bound packetSendCB]
clientsocket incoming <Buffer ca 00 00 00 01 00 04 cd d1 77 00 00 42 3e 63 17 bc e2 1f 6e f4 8a b6 18 a6 6b f0 c1 82 7f 95 ba 28 09 c9 4b 2b b1 8c bb d9 40 1a a9 37 ec 28 ce 3f 47 ... 1202 more bytes> { address: '127.0.0.1', family: 'IPv4', port: 12345, size: 1252 }
clientsocket outgoing <Buffer ee 00 00 00 01 04 cd d1 77 00 00 40 3d 35 d1 0b 09 cc a8 79 84 b7 6e c9 b0 05 82 7b e6 9f 1e 57 18 9d ad 80 8e 02 9b 6b 1e 80 0b 62 50 87 f4 ab 24 4b ... 114 more bytes> 0 164 12345 127.0.0.1 [Function: bound packetSendCB]
clientsocket incoming <Buffer 47 3d 8f fd 7f 4b 4b b2 5e 45 b2 e8 13 c5 37 8e 05 c1 f1 ef 2e 62 39 78 20 9c be a1 c2 d6 5b> { address: '127.0.0.1', family: 'IPv4', port: 12345, size: 31 }
clientsocket incoming <Buffer 43 5a b8 ba 5d bd 9b 12 e4 b4 c0 0c 12 e2 cd fe e1 a9 f6 35 bf 29 cb 43 37 db ee 8c ad 8a 0d 91 f7 f7 41 17 98 43 30 46 d6 ae 2d 80 cc 06 18 ae 4b ad ... 246 more bytes> { address: '127.0.0.1', family: 'IPv4', port: 12345, size: 296 }
clientsocket outgoing <Buffer 4c cd d1 77 00 7c 69 e6 fe b2 8c 33 04 c9 d4 8d 4e d7 85 d6 f6 41 1a af 46 88 fd> 0 27 12345 127.0.0.1 [Function: bound packetSendCB]
WebTransportError: Opening handshake failed.
    at Timeout._onTimeout (file:///Users/alex/Documents/Workspaces/achingbrain/webtransport-go-server/node_modules/@fails-components/webtransport/lib/client.js:91:11)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7) {
  [Symbol(Symbol.toStringTag)]: 'WebTransportError'
}
martenrichter commented 4 months ago

Ok, that means at least network communication works. Then, it must be something within the handshake process. I can debug it at the weekend (Friday is already a holiday). Anyway, there is another possibility: quiche dropped an older WebTransport version; if I had a quick guess, I would look there. You should also consider increasing the quiche logging level; it is now possible to see the C++ logging by passing an argument to create the first object that uses libquiche.

martenrichter commented 4 months ago

Another comment, if you want to see what is going on: https://github.com/fails-components/webtransport/blob/d5205d8d416c2a3b490a2769f61b764f2f4de794/transports/http3-quiche/src/http3client.cc#L510 and the following lines handle the different stages of connecting. So with printfs you can analyze, where it is stuck.

achingbrain commented 4 months ago

I think maybe the bug is that Http3Client::handleConnecting() returns recheck as true to get the client to recheck the session, but in Http3ClientJS::recvPaket the return is negated when set as client_->connectionrecheck_?

That is, after the first packet is received, the recheck variable is set here.

Then after the second packet is received client_->connectionrecheck_ is false so client_->handleConnecting() is not called again.

If I remove the negation it seems to work?

martenrichter commented 4 months ago

Yes, this can be... But if it is wrong, it must also be changed in other places. Are the unit tests all passing regardless of whether it is changed? Then, the packets will probably arrive so fast in unit tests that they do not have to be rechecked. It makes sense, as this is the part I had to rewrite.

achingbrain commented 4 months ago

They all still pass if I remove the negation here but some fail if I also remove the negation here.

martenrichter commented 4 months ago

Ok, then I have to look into it. You nailed it down, but the connection process will probably starve somewhere if some fail. I believe I did not return the correct recheck value for some cases. (You may add a debug message before recheck is set to determine in which situation it breaks.) Anyway, the connection process is seriously broken.

achingbrain commented 4 months ago

I have to agree, I can't make it work in anything but the most trivial demo app so I think there are certainly some timing bugs in there.

For example my ultimate goal here is to integrate @fails-compononents/webtransport with this js-libp2p PR and even after removing the first negation, it's possible for the onReady callback of HttpWTSession to never be invoked, which means the visitor isn't being set so I think it's possible for the connection to get into a state where that branch is missed.

martenrichter commented 4 months ago

I am away from my dev machine. But https://github.com/fails-components/webtransport/blob/d5205d8d416c2a3b490a2769f61b764f2f4de794/transports/http3-quiche/src/http3client.cc#L535 maybe the problem. I believe it must be a true.

achingbrain commented 4 months ago

That seems to make the tests pass with both negations removed.

Does this need to be false too:

https://github.com/fails-components/webtransport/blob/d5205d8d416c2a3b490a2769f61b764f2f4de794/transports/http3-quiche/src/http3client.cc#L572-L575

achingbrain commented 4 months ago

I do wonder if the checkSession method needs a bit of attention, it's possible for it to return true even if the wtsession has not had it's visitor set, which seems to be the reason the onReady is never called in js-land.

Edit: I'm not sure this is true now, I can see the visitor being set but onReady is still not called.

martenrichter commented 4 months ago

That seems to make the tests pass with both negations removed.

Does this need to be false too:

https://github.com/fails-components/webtransport/blob/d5205d8d416c2a3b490a2769f61b764f2f4de794/transports/http3-quiche/src/http3client.cc#L572-L575

No I think this is correct, if we assume that checkSession returns a value for rechecking.

I do wonder if the checkSession method needs a bit of attention, it's possible for it to return true even if the wtsession has not had it's visitor set, which seems to be the reason the onReady is never called in js-land.

That, I agree. We probably need another bool indicating whether a session has been acquired. In the moment, it is only rechecking, if there are other pending streams, but this may change.

martenrichter commented 4 months ago

I am starting to work on the problem. First, without your example, I have recognized that if I only run the bidirectional streams, they also fail, as you said. Btw. so you know a way to influence the order of the tests, if the bidi streams always run first, I would have recognized the problems earlier.

martenrichter commented 4 months ago

Here are my attempts. I haven't tested the Go server yet, but I found another issue: The add-on was not loaded quickly enough. https://github.com/fails-components/webtransport/tree/fix_http3_client_connect

martenrichter commented 4 months ago

Ok, the changes discussed seem to work. But before merging, it needs to be tested in my real-world application, as the fixed timings in the test do not uncover all problems. One thing I recognized is that the output shows the following:

SERVER start
SERVER ready
CLIENT create session https://127.0.0.1:12345/echo
CLIENT wait for session
CLIENT session ready
CLIENT create bidi stream
CLIENT get writer
CLIENT wait for writer
CLIENT write
CLIENT close writer
CLIENT read from stream
CLIENT got from stream { value: Uint8Array(4) [ 0, 1, 2, 3 ], done: false }
CLIENT read from stream

So, the fin is not received. I suspect the echo server is the source of the problem, but I will recheck if a fin is sent and/or received. (These echo applications are not trivial; both Firefox and Chrome had bugs that I had to file and patch, but for go I am afraid I have no motivation).

martenrichter commented 4 months ago

Also, the real-world application seems to work. merging next...

martenrichter commented 4 months ago

And regarding the fin, it does not seem to be reported by the go code.

achingbrain commented 4 months ago

And regarding the fin, it does not seem to be reported by the go code.

Good news on that front - it was user error - the go server needs to explicitly close streams rather than io.Copy doing it implicitly as I'd thought.

Thanks for fixing this up.

martenrichter commented 4 months ago

Ok, that is good.... so no work to do here.