Open jpd236 opened 4 months ago
We do ship source maps for Javascript, which should result in Chrome decoding those errors, but I have seen cases where it doesn't work. Googling around a little bit, it seems like uncaught exceptions thrown inside of promises might just not trigger sourcemap decoding? Which is pretty unfortunate.
There is still enough information that we can start digging. I'm reasonably certain that exception is getting thrown from here: https://github.com/deathandmayhem/jolly-roger/blob/75273d779f7862c61e6d3bed1035a486266901b7/imports/client/hooks/useCallState.ts#L421
(If you want to check my math, AwaitQueue
is an abstraction from the mediasoup people that queues promises so they don't run in parallel. Within mediasoup-client, it's only used by the Transport class. What I don't understand is why that's throwing from within the Transport.close
method, since as far as I can tell, AwaitQueue.stop
causes other enqueued promises to reject with that exception.
FWIW, I don't seem to be able to reproduce this on our instance of JR. I'm skeptical that NLB is somehow protecting us here, but I'm not entirely sure what would be happening instead. Have you verified that coturn is behaving correctly? I left some notes on https://github.com/deathandmayhem/jolly-roger/issues/1970 about how I test our coturn server.
Looks to be working per those steps, at least when I have calls that work:
FWIW there are two turnConfig
events in the WS logs - one for direction "recv" and one for direction "send". Not sure if it matters which I use.
In any case I'm also skeptical that the lack of NLB would matter. And the issue only ever seems to come up on the first attempt to connect after starting a new instance (and I don't think it's 100% consistent there, though I definitely saw it a few times in that scenario).
Only other oddity I've noticed is that sometimes, even when it works, I see that yellow warning bar very briefly, though it does connect fast enough that I can't even read the warning on it.
I am also seeing rare page load failures in the web UI where some initial load fails with a ERR_INCOMPLETE_CHUNKED_ENCODING error in the console. I wonder if something is just up on the networking side.
Just reproduced again on my first attempt of the night (so doesn't seem like haproxy 2.9.0 was the issue vs 2.9.7). FWIW BugSnag agrees with you on the impacted line of code.
On success I typically see a line in haproxy's logs indicating a turn connection. In the failure case I don't see that.
I don't know if this is relevant, but looking at the coturn logs, it seems like every successful session starts with a 401 error before things become successful, e.g.:
154269: (15): INFO: IPv4. tcp or tls connected to: 127.0.0.1:55170
154269: (15): INFO: session 000000000000000015: realm <ange.management> user <>: incoming packet message processed, error 401: Unauthorized
154269: (15): INFO: IPv4. Local relay addr: 127.0.0.1:41748
154269: (15): INFO: session 000000000000000015: new, realm=<ange.management>, username=<snip>, lifetime=600
154269: (15): INFO: session 000000000000000015: realm <ange.management> user <snip>: incoming packet ALLOCATE processed, success
154271: (15): INFO: session 000000000000000015: refreshed, realm=<ange.management>, username=<snip>, lifetime=0
154271: (15): INFO: session 000000000000000015: realm <ange.management> user <snip>: incoming packet REFRESH processed, success
154271: (15): INFO: session 000000000000000015: TCP socket closed remotely 127.0.0.1:55170
154271: (15): INFO: session 000000000000000015: usage: realm=<ange.management>, username=<snip>, rp=3, rb=276, sp=3, sb=280
154271: (15): INFO: session 000000000000000015: peer usage: realm=<ange.management>, username=<snip>, rp=0, rb=0, sp=0, sb=0
154271: (15): INFO: session 000000000000000015: closed (2nd stage), user <snip> realm <ange.management> origin <>, local 127.0.0.1:3478, remote 127.0.0.1:55170, reason: TCP connection closed by client (callback)
154271: (15): INFO: session 000000000000000015: delete: realm=<ange.management>, username=<snip>
I'm not positive but I suspect that's normal - at least with HTTP, it's standard to try requesting without auth and only provide auth when the server demands it, so it makes sense to me that browser behavior around TURN would be similar.
Is it easy for you to change https://github.com/deathandmayhem/jolly-roger/blob/main/imports/client/hooks/useCallState.ts#L31 to also pass level: "debug"
? That should generate a bunch of extra log entries that might tell us more about at least how things are getting sequenced.
Captured the following logs. (FWIW the change you described didn't seem to do anything with console log output or Bugsnag reporting; I had to change imports/logger.ts's level to have any effect. Something unexpected with the logging library?)
configureLogger.ts:52 [useCallState] Clearing Mediasoup device
configureLogger.ts:52 [useCallState] clearing transport {direction: 'send'}
configureLogger.ts:52 [useCallState] clearing transport {direction: 'recv'}
configureLogger.ts:52 [useCallState] producerTracks {tracks: Array(0)}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-peers', selfPeer: undefined, otherPeers: Array(0)}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-router', router: undefined}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-device', device: undefined}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'send', transport: undefined}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'recv', transport: undefined}
configureLogger.ts:52 [useCallState] producerTracks {tracks: Array(0)}
configureLogger.ts:52 [useCallState] dispatch {type: 'request-capture'}
configureLogger.ts:52 [useCallState] dispatch {type: 'join-call', audioState: {…}}
configureLogger.ts:52 [useCallState] producerTracks {tracks: Array(1)}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-peers', selfPeer: {…}, otherPeers: Array(0)}
configureLogger.ts:52 [useCallState] server set initial peer state {initialPeerState: 'active'}
configureLogger.ts:52 [useCallState] Fetching mediasoup-client code
configureLogger.ts:52 [useCallState] dispatch {type: 'set-router', router: {…}}
configureLogger.ts:52 [useCallState] Creating new Mediasoup device
configureLogger.ts:52 [useCallState] dispatch {type: 'set-device', device: e}
configureLogger.ts:52 [useCallState] subscribe mediasoup:transports {peerId: 'tFkshY49XT2qTnyqn', rtpCaps: '{"codecs":[{"mimeType":"audio/opus","kind":"audio"…preferredEncrypt":false,"direction":"sendrecv"}]}'}
configureLogger.ts:52 [useCallState] clearing transport {direction: 'send'}
configureLogger.ts:52 [useCallState] clearing transport {direction: 'recv'}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'send', transport: undefined}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'recv', transport: undefined}
configureLogger.ts:52 [useCallState] Creating new Mediasoup transport {transportId: 'e299e359-02ab-42e8-82c5-d3bdf8efcf24', direction: 'send'}
configureLogger.ts:52 [useCallState] setting transport {direction: 'send', newTransport: C}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'send', transport: C}
configureLogger.ts:52 [useCallState] producerTracks {tracks: Array(1)}
configureLogger.ts:52 [useCallState] Creating Mediasoup producer {track: 'c0e7e06a-c82d-487d-9be9-5bcac449ee3c'}
configureLogger.ts:52 [useCallState] Creating new Mediasoup transport {transportId: 'e299e359-02ab-42e8-82c5-d3bdf8efcf24', direction: 'send'}
configureLogger.ts:52 [useCallState] setting transport {direction: 'send', newTransport: C}
configureLogger.ts:52 [useCallState] Creating new Mediasoup transport {transportId: '1c02de70-f42c-4328-a1c2-dd35d2f7b372', direction: 'recv'}
configureLogger.ts:52 [useCallState] setting transport {direction: 'recv', newTransport: C}
index.js:1 Uncaught (in promise) AwaitQueueStoppedError: AwaitQueue stopped
at e.stop (index.js:1:1712)
at C.close (Transport.js:1:3075)
at P.debug.direction (useCallState.ts:421:48)
at cB (is-plain-object.mjs:1:47)
at cM (is-plain-object.mjs:1:47)
at f8 (is-plain-object.mjs:1:47)
at is-plain-object.mjs:1:47
at eI (is-plain-object.mjs:1:47)
at MessagePort.eV (is-plain-object.mjs:1:47)
stop @ index.js:1
close @ Transport.js:1
P.debug.direction @ useCallState.ts:421
cB @ is-plain-object.mjs:1
cM @ is-plain-object.mjs:1
f8 @ is-plain-object.mjs:1
(anonymous) @ is-plain-object.mjs:1
eI @ is-plain-object.mjs:1
eV @ is-plain-object.mjs:1
await in eV (async)
(anonymous) @ useCallState.ts:890
(anonymous) @ useCallState.ts:851
cR @ is-plain-object.mjs:1
f8 @ is-plain-object.mjs:1
(anonymous) @ is-plain-object.mjs:1
eI @ is-plain-object.mjs:1
eV @ is-plain-object.mjs:1
Show 13 more frames
Show less
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'send', transport: C}
configureLogger.ts:52 [useCallState] dispatch {type: 'set-transport', direction: 'recv', transport: C}
configureLogger.ts:52 [useCallState] producerTracks {tracks: Array(1)}
There was a new release of mediasoup-client a few days ago but it doesn't seem to have fixed this.
I can consistently reproduce the AwaitQueueStoppedError
locally - but not the permanent failure state - as follows:
dispatch({ type: "set-device", device: newDevice });
)In this case, I see an additional failure, Meteor method call failed: Mediasoup.Transports.methods.connect
. I suspect that the 60 second wait causes something to time out, which causes a second try to occur. If I skip the 60 second wait, the breakpoint doesn't trigger a second time, and there are no issues.
The failure doesn't seem to be fatal in this case as the call does connect, though it is fatal when I am seeing it in production. But I wonder if fixing it in this case would also fix it there.
What I don't understand is why that's throwing from within the
Transport.close
method, since as far as I can tell,AwaitQueue.stop
causes other enqueued promises to reject with that exception.
Perhaps what is happening is that there are two calls to stop, and the second one is rejecting the first? So they look like they come from the same place but are in fact from two different operations. That would maybe line up with this reproducing when a retry occurs.
I see in the debug logs that there are two instances of:
configureLogger.ts:52 [useCallState] Creating new Mediasoup transport {transportId: 'e299e359-02ab-42e8-82c5-d3bdf8efcf24', direction: 'send'}
On successful calls this only happens once.
From JS console (but probably not helpful since the code is optimized; is it possible to deobfuscate this given the Docker image?):
Doesn't seem to happen on subsequent attempts - I've only seen it the first time I've tried launching a call after the server. It says it'll connect automatically, but seems to be stuck permanently.
This could be related to my pending single instance changes - I'll need to try to reproduce with the standard network + NLB setup to make sure that isn't it.