Ensembl / ensembl-dauphin-style-compiler

"dauphin" style-compiler for genome browser
0 stars 0 forks source link

memory bounds issue unknown cause #19

Closed ens-ds23 closed 6 months ago

ens-ds23 commented 2 years ago

image

ens-ds23 commented 2 years ago

What were you testing when this happened @azangru? Was it interstitials? Were you loading/unloading the browser? I suspect the only way this could happen would be at the Rust/JS boundary, so knowing your activity would be useful.

Most likely is either:

  1. the timer listener. This code has the scariest lifetime code because it involves keeping closures around between ticks.
  2. incomplete/improper unregistering.
azangru commented 2 years ago

What were you testing when this happened

I was switching between the apps. I believe I saw this error after a jump from genome browser page to species selector page (to get a new species), and then back to the genome browser. So out of your two options, the incomplete/improper unregistering is the most likely.

What was interesting was that that error didn't kill the genome browser code. But with currently undefined behaviour upon DOM nodes unmount, I can't even begin to imagine what was happening under the hood.

ens-ds23 commented 2 years ago

That sounds plausible (and is less of a worry than other possibilities). Lifetime management can be fun between JS and Rust on shared objects which both need to maintain a reference to long-term (with one language having directly managed memory and the other being garbage collected).

Fortunately it's mercifully rare to need to do so (it's usually a matter of passing on data and washing your hands of it, in one direction or the other). The exceptions are repeated callbacks (listeners and timers) which are very short pieces of code, typically just a jump out to a method from a captured reference. I suspect a callback was somehow run on something which no longer existed in some sense perhaps due to a race (either way around). That would also explain why your stack trace is so shallow.

When I get a good test case for unregistering and reregistering, hopefully I'll be able to cause it to happen again.

azangru commented 2 years ago

I've caught this error again, this time on video, by switching between the apps on http://fix-bootstrap-load-focus-object.review.ensembl.org. In fact, it can be produced there quite reliably, following the steps captured on the video.

It looks to be caused by faulty/missing cleanup logic.

https://user-images.githubusercontent.com/6834224/146358488-74f5b002-f303-4aa9-a6d6-8cd7521b5018.mp4

ens-ds23 commented 2 years ago

Ah that's great. Reproducible errors are much nicer. Thanks for spotting that!

ens-ds23 commented 2 years ago

Can now reproduce semi-reliably on locally running docker instance of site.

ens-ds23 commented 2 years ago

I've got a handle on this now and have a "fix" which is bad code (typescript) but illustrates the origin of the bug and the way it should be fixed (by someone who knows typescript) and which reliably makes the bug go away. I need to untangle all of the changes first as my setup is kind of "exotic" now as this was difficult to track down.

The problem originates in ensembl-genome-browser where we don't currently distinguish the init() method, which should be run once on loading the browser's autogerneated js file and only once, from more general init and teardown of the browser which will happen from time to time even on a single "page". Of course, that autogenerated js file should also only be included once. It was consuming lots of CPU as far as I can tell, recompiling the WASM in the browser whenever we switched views.

But, more importantly in terms of why not to do it more than once, the init method sets up various memory management thingies which shouldn't be redone. The bug was triggered when an overhanging callback (usually an AJAX response) which was setup before init was erroneously rerun was called after a rerun "on another browser". If you switched quickly you could set off an AJAX, tear down a browser and bring up another before the call returned. This being networking, the reliability of this varied with the weather! Of course the memory footprint once reinitialised probably completely different, so the callback was jumping into random code.

So we need to distinguish init-the-code from init-a-browser. The latter can (and should) be done many times (after the former is run once and only once). I'll paste in the ts for someone who knows what they're doing to write properly once I've untangled my changes.

In the process I tidied lots of my own code, though, and cast a sidelong glance at a 3rd-party library, now no-longer used. None of that was the cause, but it was good to do it. I also sped up the Docker process by a lot (often about 3 mins now). So it's all good.

The Rust looks in an OK place for multiple browsers (separated in time), though it's probably not perfect. The WebGL side is the worst, I think, but it's still mostly there. There's a separate ticket for doing a general audit of this stuff which doesn't seem as scary as it did yesterday. But there are probably other causes of panics on unload lurking in the code. But I'm 90% sure we have this one, at least.

ens-ds23 commented 2 years ago

Patch in https://github.com/Ensembl/ensembl-genome-browser/pull/5