Firefox didn't show the table for a newly started tournament

ctm commented 4 years ago

Starting a brand new tournament, all three browsers pop-up tables that no longer have unwanted scroll bars, but Safari and Brave show the table while Firefox doesn't. That strikes me as amazingly strange, but it's happened 3 out of 3 times.

Since this is reproducible, it's probably easily fixable, but these various browser differences are really driving me nuts. OTOH, because I wasn't testing anything other than Brave, we have a backlog of differences to deal with and presumably once we've cleaned up all the obvious ones, there will only occasionally be new ones to deal with.

ctm commented 4 years ago

Interestingly, while testing a different bug-fix I had started a tournament with deadhead and brave on Brave, and firefox on Firefox and safari on Safari and all the lobbies showed the table, I think. I'm not 100% sure that I checked both Brave lobbies, nor do I remember the exact order I had players join.

I then logged in one player per browser, Safari first, Firefox in the middle and finally Brave. I then started and sure enough Firefox didn't show the tournament table. So, it continues to appear reproducible. I'll add some debug lines and figure out what's going on.

ctm commented 4 years ago

Uh oh. The JavaScript console had:

Error: closure invoked recursively or destroyed already 1.mb2-web.js:1:14751
    St https://localhost:8081/1.mb2-web.js:1
    __wbindgen_throw https://localhost:8081/mb2-web.js:1
    h7c386864a9275cbf https://localhost:8081/mb2-web.wasm:5114255
    <dyn core::ops::function::FnMut<(&A,)>+Output = R as wasm_bindgen::closure::WasmClosure>::describe::invoke::haffd1a5e2a97f296 https://localhost:8081/mb2-web.wasm:5003323
    O https://localhost:8081/1.mb2-web.js:1
    u https://localhost:8081/1.mb2-web.js:1
    ln https://localhost:8081/1.mb2-web.js:1
    __wbg_open_2f60def672712c09 https://localhost:8081/mb2-web.js:1
    ha34352e85c9fe382 https://localhost:8081/mb2-web.wasm:4647319
    <mb2_web::lobby::Lobby as yew::html::Component>::update::h052e3951034b66ec https://localhost:8081/mb2-web.wasm:1708424
    <yew::html::scope::UpdateComponent<COMP> as yew::scheduler::Runnable>::run::hab2b2b98205da578 https://localhost:8081/mb2-web.wasm:3622133
    hfa40a234f52d6336 https://localhost:8081/mb2-web.wasm:4156528
    h78fd53a1a1121199 https://localhost:8081/mb2-web.wasm:4985997
    hf50b1ffd77642f69 https://localhost:8081/mb2-web.wasm:4988164
    h0f30a232f318f7fd https://localhost:8081/mb2-web.wasm:5035319
    h5f440fe7cc8d4f63 https://localhost:8081/mb2-web.wasm:4981254
    h32bf4ed7f94d6203 https://localhost:8081/mb2-web.wasm:4994377
    h6098128644840d16 https://localhost:8081/mb2-web.wasm:4540378
    hd662fc3918e18391 https://localhost:8081/mb2-web.wasm:4990972
    he505a85d21001deb https://localhost:8081/mb2-web.wasm:5098364
    <dyn core::ops::function::FnMut<(&A,)>+Output = R as wasm_bindgen::closure::WasmClosure>::describe::invoke::haffd1a5e2a97f296 https://localhost:8081/mb2-web.wasm:5003304
    O https://localhost:8081/1.mb2-web.js:1
    u https://localhost:8081/1.mb2-web.js:1

which led me to this reddit article.

My guess is I can figure out what specifically is going wrong (since it's reproducible) and figure out a workaround, but this particular bug makes me very nervous, because it's the sort of bug that I didn't think Rust would allow me to create.

ctm commented 4 years ago

This is kind of grim, in that what's happening is an almost silent dropping of one of the messages that comes from the server into the lobby through the WebSocket. Here are some info lines:

INFO:mb2_web::lobby -- data: Ok(Private(SitAt(EventPlayer(TournamentId(1), PlayerId(13)), TableSessionId(10576095826332076396)))) 1.mb2-web.js:1:15688
INFO:mb2_web::lobby -- msg: WsReady(Ok(Private(SitAt(EventPlayer(TournamentId(1), PlayerId(13)), TableSessionId(10576095826332076396))))) 1.mb2-web.js:1:15688
INFO:mb2_web::lobby -- about to open 1.mb2-web.js:1:15688
Error: closure invoked recursively or destroyed already 1.mb2-web.js:1:14751
INFO:mb2_web::lobby -- _delme: Ok(Some(Window { obj: EventTarget { obj: Object { obj: JsValue(Window) } } })) 1.mb2-web.js:1:15688
INFO:mb2_web::lobby -- data: Ok(Public(ChatMessage(PlayerId(12), "whee"))) 1.mb2-web.js:1:15688
INFO:mb2_web::lobby -- msg: WsReady(Ok(Public(ChatMessage(PlayerId(12), "whee")))) 1.mb2-web.js:1:15688

The above is from Firefox, when it is glitching. Safari and Brave don't glitch and they don't drop the message:

NFO:mb2_web::lobby -- data: Ok(Private(SitAt(EventPlayer(TournamentId(1), PlayerId(12)), TableSessionId(8468895295255539722))))
14:48:49.417 1.mb2-web.js:1 INFO:mb2_web::lobby -- msg: WsReady(Ok(Private(SitAt(EventPlayer(TournamentId(1), PlayerId(12)), TableSessionId(8468895295255539722)))))
14:48:49.417 1.mb2-web.js:1 INFO:mb2_web::lobby -- about to open
14:48:49.424 1.mb2-web.js:1 INFO:mb2_web::lobby -- _delme: Ok(Some(Window { obj: EventTarget { obj: Object { obj: JsValue(Window) } } }))
14:48:49.425 1.mb2-web.js:1 INFO:mb2_web::lobby -- data: Ok(Public(PlayersAdded(TableId(1), [PlayerId(12), PlayerId(11), PlayerId(13)])))
14:48:49.425 1.mb2-web.js:1 INFO:mb2_web::lobby -- msg: WsReady(Ok(Public(PlayersAdded(TableId(1), [PlayerId(12), PlayerId(11), PlayerId(13)]))))
14:50:12.325 1.mb2-web.js:1 INFO:mb2_web::lobby -- msg: WsAction(UpdateChat("w"))
14:50:12.472 1.mb2-web.js:1 INFO:mb2_web::lobby -- msg: WsAction(UpdateChat("wh"))
14:50:12.600 1.mb2-web.js:1 INFO:mb2_web::lobby -- msg: WsAction(UpdateChat("whe"))
14:50:12.756 1.mb2-web.js:1 INFO:mb2_web::lobby -- msg: WsAction(UpdateChat("whee"))
14:50:12.975 1.mb2-web.js:1 INFO:mb2_web::lobby -- msg: WsAction(SendChat)
14:50:12.978 1.mb2-web.js:1 INFO:mb2_web::lobby -- data: Ok(Public(ChatMessage(PlayerId(12), "whee")))
14:50:12.978 1.mb2-web.js:1 INFO:mb2_web::lobby -- msg: WsReady(Ok(Public(ChatMessage(PlayerId(12), "whee"))))
14:50:12.981

and

[Info] INFO:mb2_web::lobby -- msg: WsReady(Ok(Private(SitAt(EventPlayer(TournamentId(1), PlayerId(11)), TableSessionId(15438155183687356131))))) (1.mb2-web.js, line 1)
[Info] INFO:mb2_web::lobby -- about to open (1.mb2-web.js, line 1)
[Info] INFO:mb2_web::lobby -- _delme: Ok(Some(Window { obj: EventTarget { obj: Object { obj: JsValue(Window) } } })) (1.mb2-web.js, line 1)
[Info] INFO:mb2_web::lobby -- data: Ok(Public(PlayersAdded(TableId(1), [PlayerId(12), PlayerId(11), PlayerId(13)]))) (1.mb2-web.js, line 1)
[Info] INFO:mb2_web::lobby -- msg: WsReady(Ok(Public(PlayersAdded(TableId(1), [PlayerId(12), PlayerId(11), PlayerId(13)])))) (1.mb2-web.js, line 1)
[Info] INFO:mb2_web::lobby -- data: Ok(Public(ChatMessage(PlayerId(12), "whee"))) (1.mb2-web.js, line 1)

In both the other browsers, the PlayersAdded message came in and that's what caused the table to be shown. Since that message never comes in on Firefox, the table doesn't get shown. There is no error directly associated with the message not coming in, however, indirectly there's some error associated with our call to open_with_target_url_and_features:

                    info!("about to open");
                    let _delme = window.open_with_url_and_target_and_features(
                        &url,
                        &table_to_track,
                        "width=700,height=880",
                    );
                    info!("_delme: {:?}", _delme);

The closure panic happens after about to open is written and before _delme: ... is written.

I've spent a bunch of time looking at wasm-bindgen/src/closure.rs, but there's no way I'm going to come up to speed on wasm-bindgen internals quickly to be of use here, so I'll probably have to bite the bullet and try to write a "minimal" test case, although that's daunting because of how many moving parts there might be. I'm a bit flabbergasted that when this occurs, the pop-up opens and works perfectly fine, it's just that the message is silently dropped.

My biggest concern isn't what I'm seeing happening, it's what similar stuff could happen that would be exceedingly hard to track down. IOW, if we truly silently have messages dropped, it'll just cause random glitches and our software will just look buggy as hell.

ctm commented 4 years ago

FWIW, just firing up a game with two players on Firefox reproduces the problem.

Using open_with_url still has the problem. Changing from &url to "https://duckduckgo.com" still has the problem.

Getting rid of the call to window.open_with_url... makes the problem go away.

So, there's a chance I'll be able to cook up a small test case tomorrow.

ctm commented 4 years ago

Not surprisingly, calling window.open_with_url before starting the yew app doesn't trigger the bug. i.e., the following opens the web page w/o the closure complaint:

#[wasm_bindgen]
pub fn run_app() -> Result<(), JsValue> {
    // FWIW, using panic::set_hook doesn't seem to be helping.
    std::panic::set_hook(Box::new(console_error_panic_hook::hook));
    web_logger::init();

    let window = yew::utils::window();
    let _ = window.open_with_url("https://duckduckgo.com");

    yew::start_app::<App>();
...

ctm commented 4 years ago

Adding the open to create (where it certainly would never belong) doesn't invoke the bug:

    fn create(_: Self::Properties, link: ComponentLink<Self>) -> Self {
        let window = yew::utils::window();
        let _ = window.open_with_url("https://duckduckgo.com");

        Self {
            ws_service: WebSocketService::new(),

ctm commented 4 years ago

If I add a call to open to the first update message where we get info back from the websocket, it does happen:

            Msg::WsReady(response) => {
                if !self.ddg {
                    info!("in this section");
                    let window = yew::utils::window();
                    let _ = window.open_with_url("https://duckduckgo.com");
                    self.ddg = true;
                }

                match response {
                Ok(PenMessage::Private(Private::LoginResult(result))) => match result {

FWIW, when recompilation was relatively rare I built production by default, now in testing I'm building dev and it still happens. The turnaround time is much quicker, so that's helping me reduce this. Not adding a comment each time I try something will speed it up even further, so I'll now just use git to record all the various hacks I make and probably won't update this issue for a while, although I'll be so happy if I can get the problem to occur before there's any user input that I may add a comment when (if?) that happens.

ctm commented 4 years ago

I decided to download Firefox 55.0 to see how it compared to 74.0 and ... the bug doesn't appear there. I'm going to jump up to 70.0 and look there. I'll then do a binary search to figure out where the change occurred, although there's a chance that I can get a false positive since memory corruption can be pretty darned mobile.

Whee! (Not!)

ctm commented 4 years ago

70.0 has the bug. Trying 63.0, which has the bug. Tryihg 59.0, which has it. Trying 57.0, which has the bug. Trying 56.0, which also has it.

I've double checked and 55.0 doesn't. There are various non whole number releases between 55.0 and 56.0. I may check them later.

ctm commented 4 years ago

Assuming the releases that have the letter b after the decimal point are beta releases, then the only releases after 55.0 and before 56.0 are 55.0.1, 55.0.2, 55.0.3 and the 12 beta releases of 56.0 (i.e. 56.0b1, 56.0b2, ..., 56.0b12). I'll start with 56.0b1, which has the bug. Now I'll check 55.0.3 which doesn't have it.

55.0: August 3rd, 2017 56.0b1: August 8th, 2017 55.0.3: August 24th, 2017 56.0: September 26th, 2017

So, it strongly looks like a bug was introduced in with the development of 56. Time to file a bug report, but ... who's going to believe me?

ctm commented 4 years ago

Yesterday evening, at the end of my day, I started writing up the bug report and then realized I could grab the source for 55.0 and 56.0b1 and look for an obvious big change and ... I was overwhelmed. As such, I saved the text to the bug-report I started writing and resigned myself to building a minimal actix-web server to reproduce the problem, which I'll do today.

Once I have the minimal actix-web server reproducing the problem, I'll see if there's anything I can hack in actix-web's WebSocket support to get it to the point where the bug (presumably in Firefox) doesn't manifest, so that when I'm done I'll have a self-contained rust crate with a client and server that shows the problem, along with the specific change to the WebSocket protocol that triggers it.

Along the way I may find that some of my guesses have been wrong, but if so, that's still likely to lead me closer to the truth.

In some sense this is a blast from the past and I've already mentally resigned to needing to find a workaround since it may take months to get this properly resolved.

ctm commented 4 years ago

Ugh! My stripped down actix-web based server does not result in the bug. OTOH, my stripped down actix-web based version isn't 100% the same, for example it doesn't listen for https, since we don't use https in this context. OTOH, something is different.

I should be able to remove things from my "real" server (it's already hacked up a bit to prevent the http -> https redirect, for example) until the bug no longer manifests itself, since eventually if I take enough away there should be no difference between the stripped version and what I'm left with.

Like yesterday, I'll probably go silent on this ticket as I hack around a bunch. I'm leaving messages to myself in my git commits to my firefox-open-url branch just so I can go back and double-check results, because what I'm finding is very surprising.

ctm commented 4 years ago

Oops. I previously entered this on the wrong issue (#90):

D'oh. I think the more likely scenario is simply if a web-socket receives a message while a new window is being opened, we get that closure complaint and that message is dropped.

It's very embarrassing to have not thought of that earlier, but this is something I should be able to trivially test.

Then just now I added an update:

Sure enough, that's what's happening and of course I can use the tiny tungstenite server to demonstrate this, because it had absolutely nothing to do with actix-web, it's just that I overlooked (blindingly obvious!) fact that my "real" server sends a second message shortly after the login result.

It's hard for me to believe that this is has anything to do with yew, so my next step is to write a web-sys test case.

I now suspect that when openurl... is called, there is.something like a window of vulnerability in the Firefox implementation where two threads (or thread-like contexts; I don't know what's under the hood) both receive the incoming message, so the complaint about the closure running twice is correct, it really is incorrectly running twice, and if the incoming message is received at all, it's received on the wrong thread.

This could be a misunderstanding on my part about tricky issues surrounding openurl, although if that's the case, it would have been nice for there to be a warning in the documentation (or even better yet a failure or at least warning when compiling). OTOH, since neither Safari, Chrome, Brave (yes I know, it's essentially Chrome), nor Firefox 55 exhibit the behavior we're seeing, my guess is it's a bug in Firefox, just not the bug I was thinking I was seeing yesterday evening (when I was tired from about 12 hours of hacking away at this).

ctm commented 4 years ago

FWIW, I've created a test case and will submit a bug report to the Firefox team.

My guess is I can workaround either by creating a dummy message that's ignored when received and sending that message in between the login response and the table creation message, or by adding some sort of delay. I'll futz around with a workaround before reporting the bug because there's a chance that doing so will change my opinion of what's happening or provide me with additional info that I should include.

ctm commented 4 years ago

Ugh. While starting the bug report process, I found this bug which is similar, but is different from what I'm seeing in yew. However, my test case calls open_with_url... from within the callback which means it could be triggering that (8 year old!) bug.

In the actual program open_with_url... is called in the update function after the callback has completed and there's only one message that then comes in while the pop-up is being created, so assuming the callback is only called when there's a message to be read, I don't think it can be called twice from a single incoming message, but perhaps I'm missing something. Perhaps the value returned by the callback is passed through a communication channel and can arrive and be acted on before the closure fully exits. I'll have to poke around a bit more.

ctm commented 4 years ago

My guess now is that the eight year old bug is indeed what's biting us. It looks like the data is moved out of the closure using a mechanism that could allow the yew reactor to continue to run before the closure has been completely exited.

If I understand the description of the eight year old bug, normally callbacks are effectively serialized, not because there's a strict guarantee, per-se, but because it was implemented in a way that the "right thing" almost always happened.

I'm going to make another entry in my test case that does use yew. I'm not sure where I'll go from there, but I might add a comment to the eight year old bug and also send a warning to the yew developers. Unfortunately, this makes something that has been 100% rock solid on all other browsers (AFAICT) flaky in a way that could waste other people's time or affect people's confidence.

ctm commented 4 years ago

After sleeping on this, my plan is to proceed by doing three things

detecting Firefox and putting up a warning saying something like "It is strongly recommended that you use a different browser"
add a resync command for the lobby and calling it on Firefox after the pop-up is up
explore what it would take to add a mutex to a flavor of callbacks that could be used in this case

I'll do the first one "now". It's basically drop-dead simple, except I'm going to use it as an excuse to bring UIkit in. I did not want to add a JS framework this early into development, but Covid-19 has already altered a bunch of my other plans, so I'll bite the bullet and do it, even though we may want to rip it out and go with something else once we know who the front-end person will be.

I'll do the second one after I've hit the Unsupervised play-testing milestone. By then I'll have tweaked some of the commands associated with ring games anyway.

I'll be happy to work on the third if I can get some assistance from someone who really knows this space well. I like giving back to the community, but this is a situation where a crappy "solution" is worse than no solution at all and I simply don't think in Rust once it comes to things like pinning. I strongly suspect that will change in the next year, but right now it's best to be realistic rather than improperly ambitious.

ctm commented 4 years ago

I'm deploying a version that disallows Firefox. It's a stop-gap as per the above. I have now changed the label from bug to enhancement, the enhancement being for us to include a workaround for the Firefox bug.

ctm commented 4 years ago

I decided to make it an actual error rather than a warning. I love Firefox and the Mozilla team, but I absolutely do not want to have to chase down bugs related to missing packets only to find out that someone was using Firefox and not mentioning it. My current plan is still to add a workaround shortly after getting Unsupervised Play-Testing, but I made the alert be a little more open ended.

ctm commented 4 years ago

Perhaps the best workaround would simply be to implement lobby sessions and the automatic reconnection of websockets and then either capture that error somehow and close the websocket or preemptively close the websocket and then open it again when it's safe. The former seems to be the best approach, because the latter suffers from the question of "when is it safe?" and the behavior we're dealing with is a bug in Firefox that we have no control of.

ctm commented 3 years ago

I'm going to take a quick poke at this today. I think the worst-case scenario would be for me to detect the error and the simply reopen the lobby. There's a slight chance that I can figure some sort of other workaround, but I'll only pursue it if the answer is handed to me quickly once I start testing under Firefox again.

ctm commented 3 years ago

Ugh. Firefox 86.0 gives me:

Uncaught ReferenceError: WebAssembly is not defined
    wasm https://localhost:8081/mb2-web.js:2
    wasm https://localhost:8081/mb2-web.js:2
    e https://localhost:8081/mb2-web.js:2
    e https://localhost:8081/mb2-web.js:2
    8349 https://localhost:8081/mb2-web.js:2
    ci https://localhost:8081/mb2-web.js:2
    <anonymous> https://localhost:8081/mb2-web.js:2
    <anonymous> https://localhost:8081/mb2-web.js:2

when I try to use mb2 on it with the Firefox detection bypassed. My guess is it's turning off WebAssembly due to some perceived error, perhaps because I'm running locally using the certificate for devctm.com, which generates an initial complaint.

I suspect we're running into some variant of this complaint and if I had more time I could probably figure it out, but the truth is if we had been using Firefox from day one, I'd have had a much easier time because I'd have noticed the regression soon after it appeared. However, the reason we're not using Firefox is because of their bug that they've had for nine years, so …

… I can't really justify working on this today.

ctm commented 3 years ago

I took a look at this again last night and this morning and it appears that my Firefox (i.e., the one on my laptop) simply won't run anything that uses WebAssembly. I tried deleting it and reinstalling it and that didn't help, although I didn't look for and nuke preferences.

FTR, I tried a simple Rust blur-demo and then tried the non-Rust Doom 3 port.

I did a bunch of web searching and I'm not seeing a bunch of other people with this complaint, so my guess is it's just my version that's broken. I'll try other computers in the household and see if they're too affected. My guess is they won't be.

ctm commented 3 years ago

Sure enough, my iMac's Firefox works fine. I'll see what I can do to reset things on my development laptop.

ctm commented 3 years ago

Resetting Firefox worked, so I may take a look at my Firefox restriction again.

ctm commented 3 years ago

I added a gross, almost 900 line diff, hack that appears to work. The essence of the hack is the client decides if it's running under Firefox or not and all the login-type messages now include an additional bool that is true if the lobby connection should be tagged as a Firefox connection.

The, whenever NickMapper sends a SitAt message, it creates a Vec<PenMessage> in the new firefox_replay_queue HashMap. Then, anytime NickMapper has a message to deliver to a lobby, it first checks to see if there's an entry in firefox_replay_queue and if so, it just queues the message for deferred play. Finally, whenever open is called within the lobby, it also adds a timeout triggered to go off immediately and the timeout then sends a FirefoxResume message which tells NickMapper to drain and empty the queue.

Due to the way demos multiplex the lobby connection there's some tomfoolery that wouldn't have been required otherwise.

FWIW, although the diff is almost 900 lines, almost all of it is trivial changes that cause the firefox bool to be passed around.

I'm deploying now.

ctm / mb2-doc

Firefox didn't show the table for a newly started tournament #98