ctm / mb2-doc

Mb2, poker software
https://devctm.com
7 stars 2 forks source link

Firefox trouble (with closed web sockets?) #827

Open ctm opened 2 years ago

ctm commented 2 years ago

Look into the panic that comes up in certain Firefox situations.

jrx: Firefox error, maybe due to reboot?? TypeError: n is undefined in https://devctm.com/235.mb2-web.js at line 1

I've seen it myself. It doesn't appear to happen simply due to me installing a new version. That seems to work OK. For me, I've seen it when my laptop has been closed overnight. Simply closing my laptop briefly isn't enough to trigger it.

Unfortunately, when I most recently saw it, I didn't copy down the backtrace, but IIRC, there was no Rust code in the backtrace, which will make this hard to track down.

ctm commented 2 years ago

Shutting my laptop for two minutes is not enough to trigger the bug, which suggests it's not as simple as the lobby ping being scheduled when the laptop is shut.

ctm commented 2 years ago

Logging in locally using both Brave and Firefox and then stopping the local server for over two and a half minutes (the RECONNECT_GRACE period) was not sufficient to trigger this bug (it didn't even cause the session to expire, because, I believe, the timer doesn't run while the server is stopped).

I will try logging into devctm.com on both Brave and Firefox and closing my laptop for at least ten minutes and see if that triggers it.

ctm commented 2 years ago

I had my laptop shut for 22 minutes and Firefox was fine and I didn't get an expired session.

ctm commented 2 years ago

I'm not going to consider this high priority, but I'll try to stay logged in to Firefox, either as deadhead or dh2 round the clock and I'll note when it happens again (and when it doesn't each time I have my laptop closed longer than the time before).

ctm commented 2 years ago

This happened last night. I believe I closed my laptop sometime around 20:45. The console output suggests that the initial error happened at 22:48, so there may be some magic "2 hour" boundary that was hit.

Here's the console output:

Screen Shot 2022-01-30 at 05 22 33

A web search turns up tons of complaints about this error message throughout the years.

Here's a complaint from people using socketio (which mb2 doesn't use).

Here's what might be a relevant Bugzilla issue.

Here's stackoverflow with a suggestion to close the WebSocket:

$(window).on('beforeunload', function(){
    socket.close();
});

I'll add the equivalent code in Rust and see if that helps.

ctm commented 2 years ago

The wasm-sockets crate doesn't allow the underlying WebSocket to be closed. So for now what I've done is made it so that the underlying WebSocket's handlers are dropped and then the WebSocket itself is dropped. There's a chance that will fix the problem with Firefox. If it does, then wasm-sockets won't need to be modified.

So, I'm deploying now and then will shut my laptop for a couple of hours and then check back and see. If the error is still there, I'll hack up my fork of wasm-sockets to allow close and then try again. If that fixes the bug, I'll open an issue in the wasm-sockets repository.

ctm commented 2 years ago

After deploying I had my laptop shut for two hours and twenty minutes and the problem didn't occur. I'll see if it occurs tonight.

ctm commented 2 years ago

Ugh. My attempted solution made things worse:

Screen Shot 2022-01-30 at 12 41 51

I got the above panic after what was then my most recent deploy (where I added some more debug info in an effort to track down the ErrorImpl bug #839).

I've reverted the relevant commit and am redeploying. I'm making this issue high priority because I'd like to rework my attempted solution and have it in place by the time I go to bed this evening.

ctm commented 2 years ago

Oh my. I didn't realize that when send_message is called from within a JavaScript callback, not only is the message sent, but it's "delivered" (acted upon) as well. As such, I should be able to implement the dropping of the wasm-socket EventClient without needing Rc<RefCell<...>>.

That should allow me to get around the panic I introduced, but it also makes me wonder about the way I'm doing the pop-up windows. I'm probably asking to pop them up inside the message receiving handler for the window, which is not what I thought I was doing. I'll have to think about that some more.

ctm commented 2 years ago

I am deploying the new workaround now. I've already verified that it doesn't have the same problem the workaround I had to revert had. It's unlikely that I'll run another two hour and twenty test though, so I'm going to once again remove the high priority label and simply let the new code run overnight.

ctm commented 2 years ago

Nope. That version gave me the same error:

22:33:30.007 The connection to wss://ws.devctm.com/ws/ was interrupted while the page was loading. 235.mb2-web.js:1:28332

It looks like it's about 2 hours after I shut my laptop. I had written myself a note that I had shut it at 20:31, but after I wrote that note I reopened the laptop to start log dh2 in using Brave so I would have something to compare it to.

GamboMouse won at 22:28 and created the new tournament at 22:30. My guess is it's just a coincidence that those times are so close to when the error occurred, but I'll have more details in the morning.

I quickly hacked the ability to close the underlying WebSocket in my fork of wasm-sockets and I'm deploying a version of mb2 to devctm.com now so that I can leave Firefox connected to that version overnight.

ctm commented 2 years ago

I can reproduce this simply by closing my laptop, restarting my wireless access point and reopening my laptop. That's the good news. The bad news is that doing so even with the hacked in close in the unload handler still exhibits the bug.

I am going to add a few lines of logging and run my experiment again.

For the record, here's how I got to this level of knowledge:

I woke up in the middle of the night (3:47 AM?), opened my laptop and neither Firefox nor Brave had lost the session. My notes say:

shutting laptop at 0128 0347 no error but no lost session either closing again at 0348

Specifically, I had a hard time getting back to sleep so I reopened my laptop and read news. I'm not actually sure when I closed it for the final time. I thought I had updated my notes before I did so, but either I forgot to do so or I botched the update. I was trying (unsuccessfully!) not to interfere with my sleep, and luckily the exact details don't actually matter.

This morning, when I woke up, I thought a bit about the possibility that what has been happening has been my access point (or something else) dropping the connection at the same time each night. Ironically, I might not have thought of that if I had reread my notes in this issue, but (again luckily), I tried it and sure enough, I got the error, although by then my laptop had been shut for close to two hours. So, before writing this comment I tried it again and had it reproduced in just about two minutes.

ctm commented 2 years ago

So, the page unload handler never gets called. What follows is what I got on the console, with all of the info! invocations happening before I closed my laptop:

06:41:18.994 Navigated to https://devctm.com/
06:41:19.009 INFO front-ends/web/src/reconnecting_web_socket.rs:330 _unload_fn: about to call close 235.mb2-web.js:1:24929
06:41:19.010 INFO front-ends/web/src/reconnecting_web_socket.rs:141 about to call self.close() from update 235.mb2-web.js:1:24929
06:41:19.011 INFO front-ends/web/src/reconnecting_web_socket.rs:250 inside fn close 235.mb2-web.js:1:24929
06:41:19.011 INFO front-ends/web/src/reconnecting_web_socket.rs:252 Some(mut ws) 235.mb2-web.js:1:24929
06:41:19.011 INFO front-ends/web/src/reconnecting_web_socket.rs:257 about to call ws.ws.close() 235.mb2-web.js:1:24929
06:41:19.012 INFO front-ends/web/src/reconnecting_web_socket.rs:259 r: Ok(()) 235.mb2-web.js:1:24929
06:41:19.012 INFO front-ends/web/src/reconnecting_web_socket.rs:143 close called 235.mb2-web.js:1:24929
06:41:19.012 INFO front-ends/web/src/reconnecting_web_socket.rs:332 _unload_fn: close called 235.mb2-web.js:1:24929
06:41:19.019
GEThttps://devctm.com/
[HTTP/2 304 Not Modified 137ms]

06:41:19.341
GEThttps://devctm.com/spinner.css
[HTTP/2 304 Not Modified 136ms]

06:41:19.344
GEThttps://devctm.com/mb2-web.js
[HTTP/2 304 Not Modified 124ms]

06:41:19.466
GEThttps://devctm.com/235.mb2-web.js
[HTTP/2 304 Not Modified 118ms]

06:41:19.468
XHRGEThttps://devctm.com/mb2-web.wasm
[HTTP/2 304 Not Modified 87ms]

06:41:19.618
GEThttps://devctm.com/favicon.ico
[HTTP/2 200 OK 0ms]

06:41:20.040
XHRGEThttps://devctm.com/mb2-web.wasm
[HTTP/2 304 Not Modified 92ms]

06:41:20.188
GETwss://ws.devctm.com/ws/
[HTTP/1.1 101 Switching Protocols 303ms]

06:42:59.045 The connection to wss://ws.devctm.com/ws/ was interrupted while the page was loading.

So, calling close in the unload handler has no effect because the unload handler is never called. At this point I probably need to dig deep into Firefox's Bugzilla database and see if I can see this reported in there. If not, I believe I have enough knowledge to create a simple test case and that's what I should do, but fighting with Firefox is getting old…

ctm commented 2 years ago

I spent a ridiculous amount of time rummaging around in Firefox's Bugzilla database and although I never found an exact match for what I'm seeing, It's clear that Firefox's WebSockets work differently from every other browser's and that some of these differences have been causing trouble for developers for years.

I like Firefox in principle and I very much like Rust, but I'm incredibly short on free time, so I've added a warning modal dialog that Firefox users will get and I'm deploying that now and closing this issue.

At some point, when I have more free time, or I can pay someone else to do it, I may try to write up a Bugzilla report that shows how to reproduce this, however, it appears that one of the things that has kept various WebSocket Bugzilla issues from getting fixed is that inherently it's necessary to have a WebSocket server somewhere and hence all WebSocket errors are harder to document and more easily ignored by the Firefox developers. In this case, the only way I currently know to reproduce the error requires restarting my wireless access point, so if I were to stop what I'm doing now and come up with a complicated way others can reproduce it, I strongly suspect the bug report would languish for years.

The bottom line is that web sockets can have error handlers attached to them, but Firefox WebSockets have at least one (and probably more than one) error condition that completely ignores the handler and simply blows up the application that is using the WebSocket. In mb2's case there appears to be no workaround.

erb2000 commented 2 years ago

This is very troubling. But, it might just be that I drank too much tonight. Sorry to bother you.

On Mon, Jan 31, 2022 at 6:03 PM Clifford T. Matthews < @.***> wrote:

I spent a ridiculous amount of time rummaging around in Firefox's Bugzilla database and although I never found an exact match for what I'm seeing, It's clear that Firefox's WebSockets work differently from every other browser's and that some of these differences have been causing trouble for developers for years.

I like Firefox in principle and I very much like Rust, but I'm incredibly short on free time, so I've added a warning modal dialog that Firefox users will get and I'm deploying that now and closing this issue.

At some point, when I have more free time, or I can pay someone else to do it, I may try to write up a Bugzilla report that shows how to reproduce this, however, it appears that one of the things that has kept various WebSocket Bugzilla issues from getting fixed is that inherently it's necessary to have a WebSocket server somewhere and hence all WebSocket errors are harder to document and more easily ignored by the Firefox developers. In this case, the only way I currently know to reproduce the error requires restarting my wireless access point, so if I were to stop what I'm doing now and come up with a complicated way others can reproduce it, I strongly suspect the bug report would languish for years.

The bottom line is that web sockets can have error handlers attached to them, but Firefox WebSockets have at least one (and probably more than one) error condition that completely ignores the handler and simply blows up the application that is using the WebSocket. In mb2's case there appears to be no workaround.

— Reply to this email directly, view it on GitHub https://github.com/ctm/mb2-doc/issues/827#issuecomment-1026296986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU2OVJVCQVSQOHYQDBNCPLUY4IKRANCNFSM5MZW3FRQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

ctm commented 2 years ago

@erb2000 Well, it's been a bit of a time sink, but I don't think it's greatly hurt anyone. If people want to use Firefox, they still can, but with the new disclaimer, I don't feel like I have to spend more time on this right now.

Cheers!

ctm commented 1 month ago

The underlying Firefox issue has been assigned, so I'm reopening this issue and bumping it to high priority, at least until I test to see if I can still reproduce the problem and then add a comment to the Bugzilla issue. I'd love to get rid of my Firefox disclaimer.

ctm commented 1 month ago

The problem is still there. I logged in to craftpoker.com as deadhead on Brave and dh2 on Firefox, closed my laptop, restarted my DSL modem (I no longer have a separate wireless access point), then, after it had restarted, I opened my laptop and tried to send a message from both. Brave took a little while to come back, but it did come back, whereas Firefox immediately spit out a completely useless error message in user-visible alert with the explanation in the console.

Here's the alert that a user would see:

Screenshot 2024-05-04 at 06 28 36

and here's the much more readable explanation in the Firefox console:

Screenshot 2024-05-04 at 06 29 03
ctm commented 1 month ago

FWIW, 300.mb2-web.js is an artifact of the stack that I use, which uses yarn to package up all the WASM and JavaScript that makes up mb2's front-end. It's almost all WASM that comes from Rust with a tiny bit of JavaScript glue that I've written myself as well as the uikit and dragee JavasScript packages and anything generated from about a dozen developer dependencies. Since 300.mb2-web.js is minimized, mentioning n and line 1 doesn't really point out much. There is probably a way I could build so that file isn't minimized, but I certainly don't know it off the top of my head. That aspect of my stack is opaque to me.

ctm commented 1 month ago

I'm stripping this of high priority for now. I'm also stripping bug, because it's not mb2's bug.