deathandmayhem / jolly-roger

Dead men tell no tales!
MIT License
18 stars 5 forks source link

First attempt at starting an audio call fails sometimes #2130

Open jpd236 opened 4 months ago

jpd236 commented 4 months ago

From JS console (but probably not helpful since the code is optimized; is it possible to deobfuscate this given the Docker image?):

Uncaught (in promise) AwaitQueueStoppedError: AwaitQueue stopped
    at e.stop (index.js:1:1712)
    at C.close (Transport.js:1:3075)
    at P.debug.direction (76d9292fb3a47a03a0eae3d3561909c11809ddfd.js?meteor_js_resource=true:226:301467)
    at cB (76d9292fb3a47a03a0eae3d3561909c11809ddfd.js?meteor_js_resource=true:70:101972)
    at cM (76d9292fb3a47a03a0eae3d3561909c11809ddfd.js?meteor_js_resource=true:70:103837)
    at f8 (76d9292fb3a47a03a0eae3d3561909c11809ddfd.js?meteor_js_resource=true:70:123356)
    at 76d9292fb3a47a03a0eae3d3561909c11809ddfd.js?meteor_js_resource=true:70:121804
    at eI (76d9292fb3a47a03a0eae3d3561909c11809ddfd.js?meteor_js_resource=true:78:1571)
    at MessagePort.eV (76d9292fb3a47a03a0eae3d3561909c11809ddfd.js?meteor_js_resource=true:78:2154)

Doesn't seem to happen on subsequent attempts - I've only seen it the first time I've tried launching a call after the server. It says it'll connect automatically, but seems to be stuck permanently.

This could be related to my pending single instance changes - I'll need to try to reproduce with the standard network + NLB setup to make sure that isn't it.

ebroder commented 4 months ago

We do ship source maps for Javascript, which should result in Chrome decoding those errors, but I have seen cases where it doesn't work. Googling around a little bit, it seems like uncaught exceptions thrown inside of promises might just not trigger sourcemap decoding? Which is pretty unfortunate.

There is still enough information that we can start digging. I'm reasonably certain that exception is getting thrown from here: https://github.com/deathandmayhem/jolly-roger/blob/75273d779f7862c61e6d3bed1035a486266901b7/imports/client/hooks/useCallState.ts#L421

(If you want to check my math, AwaitQueue is an abstraction from the mediasoup people that queues promises so they don't run in parallel. Within mediasoup-client, it's only used by the Transport class. What I don't understand is why that's throwing from within the Transport.close method, since as far as I can tell, AwaitQueue.stop causes other enqueued promises to reject with that exception.

FWIW, I don't seem to be able to reproduce this on our instance of JR. I'm skeptical that NLB is somehow protecting us here, but I'm not entirely sure what would be happening instead. Have you verified that coturn is behaving correctly? I left some notes on https://github.com/deathandmayhem/jolly-roger/issues/1970 about how I test our coturn server.

jpd236 commented 4 months ago

Looks to be working per those steps, at least when I have calls that work:

image

FWIW there are two turnConfig events in the WS logs - one for direction "recv" and one for direction "send". Not sure if it matters which I use.

In any case I'm also skeptical that the lack of NLB would matter. And the issue only ever seems to come up on the first attempt to connect after starting a new instance (and I don't think it's 100% consistent there, though I definitely saw it a few times in that scenario).

Only other oddity I've noticed is that sometimes, even when it works, I see that yellow warning bar very briefly, though it does connect fast enough that I can't even read the warning on it.

jpd236 commented 4 months ago

I am also seeing rare page load failures in the web UI where some initial load fails with a ERR_INCOMPLETE_CHUNKED_ENCODING error in the console. I wonder if something is just up on the networking side.

jpd236 commented 4 months ago

Just reproduced again on my first attempt of the night (so doesn't seem like haproxy 2.9.0 was the issue vs 2.9.7). FWIW BugSnag agrees with you on the impacted line of code.

On success I typically see a line in haproxy's logs indicating a turn connection. In the failure case I don't see that.

jpd236 commented 4 months ago

I don't know if this is relevant, but looking at the coturn logs, it seems like every successful session starts with a 401 error before things become successful, e.g.:

154269: (15): INFO: IPv4. tcp or tls connected to: 127.0.0.1:55170
154269: (15): INFO: session 000000000000000015: realm <ange.management> user <>: incoming packet message processed, error 401: Unauthorized
154269: (15): INFO: IPv4. Local relay addr: 127.0.0.1:41748
154269: (15): INFO: session 000000000000000015: new, realm=<ange.management>, username=<snip>, lifetime=600
154269: (15): INFO: session 000000000000000015: realm <ange.management> user <snip>: incoming packet ALLOCATE processed, success
154271: (15): INFO: session 000000000000000015: refreshed, realm=<ange.management>, username=<snip>, lifetime=0
154271: (15): INFO: session 000000000000000015: realm <ange.management> user <snip>: incoming packet REFRESH processed, success
154271: (15): INFO: session 000000000000000015: TCP socket closed remotely 127.0.0.1:55170
154271: (15): INFO: session 000000000000000015: usage: realm=<ange.management>, username=<snip>, rp=3, rb=276, sp=3, sb=280
154271: (15): INFO: session 000000000000000015: peer usage: realm=<ange.management>, username=<snip>, rp=0, rb=0, sp=0, sb=0
154271: (15): INFO: session 000000000000000015: closed (2nd stage), user <snip> realm <ange.management> origin <>, local 127.0.0.1:3478, remote 127.0.0.1:55170, reason: TCP connection closed by client (callback)
154271: (15): INFO: session 000000000000000015: delete: realm=<ange.management>, username=<snip>
ebroder commented 4 months ago

I'm not positive but I suspect that's normal - at least with HTTP, it's standard to try requesting without auth and only provide auth when the server demands it, so it makes sense to me that browser behavior around TURN would be similar.

Is it easy for you to change https://github.com/deathandmayhem/jolly-roger/blob/main/imports/client/hooks/useCallState.ts#L31 to also pass level: "debug"? That should generate a bunch of extra log entries that might tell us more about at least how things are getting sequenced.

jpd236 commented 4 months ago

Captured the following logs. (FWIW the change you described didn't seem to do anything with console log output or Bugsnag reporting; I had to change imports/logger.ts's level to have any effect. Something unexpected with the logging library?)

configureLogger.ts:52 [useCallState] Clearing Mediasoup device
configureLogger.ts:52 [useCallState] clearing transport {direction: 'send'}
configureLogger.ts:52 [useCallState] clearing transport {direction: 'recv'}
configureLogger.ts:52 [useCallState] producerTracks {tracks: Array(0)}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-peers', selfPeer: undefined, otherPeers: Array(0)}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-router', router: undefined}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-device', device: undefined}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'send', transport: undefined}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'recv', transport: undefined}
configureLogger.ts:52 [useCallState] producerTracks {tracks: Array(0)}
configureLogger.ts:52 [useCallState] dispatch {type: 'request-capture'}
configureLogger.ts:52 [useCallState] dispatch {type: 'join-call', audioState: {…}}
configureLogger.ts:52 [useCallState] producerTracks {tracks: Array(1)}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-peers', selfPeer: {…}, otherPeers: Array(0)}
configureLogger.ts:52 [useCallState] server set initial peer state {initialPeerState: 'active'}
configureLogger.ts:52 [useCallState] Fetching mediasoup-client code
configureLogger.ts:52 [useCallState] dispatch {type: 'set-router', router: {…}}
configureLogger.ts:52 [useCallState] Creating new Mediasoup device
configureLogger.ts:52 [useCallState] dispatch {type: 'set-device', device: e}
configureLogger.ts:52 [useCallState] subscribe mediasoup:transports {peerId: 'tFkshY49XT2qTnyqn', rtpCaps: '{"codecs":[{"mimeType":"audio/opus","kind":"audio"…preferredEncrypt":false,"direction":"sendrecv"}]}'}
configureLogger.ts:52 [useCallState] clearing transport {direction: 'send'}
configureLogger.ts:52 [useCallState] clearing transport {direction: 'recv'}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'send', transport: undefined}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'recv', transport: undefined}
configureLogger.ts:52 [useCallState] Creating new Mediasoup transport {transportId: 'e299e359-02ab-42e8-82c5-d3bdf8efcf24', direction: 'send'}
configureLogger.ts:52 [useCallState] setting transport {direction: 'send', newTransport: C}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'send', transport: C}
configureLogger.ts:52 [useCallState] producerTracks {tracks: Array(1)}
configureLogger.ts:52 [useCallState] Creating Mediasoup producer {track: 'c0e7e06a-c82d-487d-9be9-5bcac449ee3c'}
configureLogger.ts:52 [useCallState] Creating new Mediasoup transport {transportId: 'e299e359-02ab-42e8-82c5-d3bdf8efcf24', direction: 'send'}
configureLogger.ts:52 [useCallState] setting transport {direction: 'send', newTransport: C}
configureLogger.ts:52 [useCallState] Creating new Mediasoup transport {transportId: '1c02de70-f42c-4328-a1c2-dd35d2f7b372', direction: 'recv'}
configureLogger.ts:52 [useCallState] setting transport {direction: 'recv', newTransport: C}
index.js:1 Uncaught (in promise) AwaitQueueStoppedError: AwaitQueue stopped
    at e.stop (index.js:1:1712)
    at C.close (Transport.js:1:3075)
    at P.debug.direction (useCallState.ts:421:48)
    at cB (is-plain-object.mjs:1:47)
    at cM (is-plain-object.mjs:1:47)
    at f8 (is-plain-object.mjs:1:47)
    at is-plain-object.mjs:1:47
    at eI (is-plain-object.mjs:1:47)
    at MessagePort.eV (is-plain-object.mjs:1:47)
stop @ index.js:1
close @ Transport.js:1
P.debug.direction @ useCallState.ts:421
cB @ is-plain-object.mjs:1
cM @ is-plain-object.mjs:1
f8 @ is-plain-object.mjs:1
(anonymous) @ is-plain-object.mjs:1
eI @ is-plain-object.mjs:1
eV @ is-plain-object.mjs:1
await in eV (async)
(anonymous) @ useCallState.ts:890
(anonymous) @ useCallState.ts:851
cR @ is-plain-object.mjs:1
f8 @ is-plain-object.mjs:1
(anonymous) @ is-plain-object.mjs:1
eI @ is-plain-object.mjs:1
eV @ is-plain-object.mjs:1
Show 13 more frames
Show less
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'send', transport: C}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'recv', transport: C}
configureLogger.ts:52 [useCallState] producerTracks {tracks: Array(1)}
jpd236 commented 4 months ago

There was a new release of mediasoup-client a few days ago but it doesn't seem to have fixed this.

I can consistently reproduce the AwaitQueueStoppedError locally - but not the permanent failure state - as follows:

In this case, I see an additional failure, Meteor method call failed: Mediasoup.Transports.methods.connect. I suspect that the 60 second wait causes something to time out, which causes a second try to occur. If I skip the 60 second wait, the breakpoint doesn't trigger a second time, and there are no issues.

The failure doesn't seem to be fatal in this case as the call does connect, though it is fatal when I am seeing it in production. But I wonder if fixing it in this case would also fix it there.

What I don't understand is why that's throwing from within the Transport.close method, since as far as I can tell, AwaitQueue.stop causes other enqueued promises to reject with that exception.

Perhaps what is happening is that there are two calls to stop, and the second one is rejecting the first? So they look like they come from the same place but are in fact from two different operations. That would maybe line up with this reproducing when a retry occurs.

jpd236 commented 4 months ago

I see in the debug logs that there are two instances of:

configureLogger.ts:52 [useCallState] Creating new Mediasoup transport {transportId: 'e299e359-02ab-42e8-82c5-d3bdf8efcf24', direction: 'send'}

On successful calls this only happens once.