Coincident deadlocks if there is a high volume of messages from a worker to the main thread

e-nikolov commented 1 year ago

While working on a pyscript app (source) I noticed that the worker would randomly stop responding after a while depending on the volume of messages being sent from the worker to the main thread. Sometimes it's after a couple of minutes, other times within seconds of starting.

When the worker thread is stuck, if I pause its execution with the Chrome Dev Tools, it is always on this line of coincident

const waitInterrupt = (sb, delay, handler) => {
  while (wait(sb, 0, 0, delay) === 'timed-out') //<---
    handler();
};

and the wait function always returns 'timed-out'.

I saw this earlier issue https://github.com/WebReflection/coincident/issues/18, which seems related, but in my case, it happens even if the main thread does not call any worker functions.

I think I have a minimal reproduction here: https://github.com/e-nikolov/coincident-repro

There are three scenarios, that only print to the dev console to keep things simple:

Demo 1 - the worker calls a main thread function 100 000 times in quick succession
Demo 2 - the same, but the main thread occasionally calls a worker function to check if the worker is still alive
Demo 3 - the worker does not do anything. This demo is just so you can call the call_py_func() function from the Dev Tools

In the first two demos the worker gets stuck pretty quickly. In the second demo I see the warning message about a possible deadlock from https://github.com/WebReflection/coincident/issues/18 even though I don't have any py -> js -> py calls like in that case.

WebReflection commented 1 year ago

the while loop will always keep going until the main thread has finished, that's the whole goal of that loop, it allows interrupts (checks) in the process ... now, I haven't checked your code (yet?) but are you passing a custom interrupt?

in short, that is expected to keep timing out if your callback on the main takes long time.

About your demo, you seem to use a non async function that invokes the worker from the main ... you can do that, but you better use async and await when a worker function is meant to be called ... have you tried that?

About minimal reproducible example, it would be great to have something that uses just this module, it's JS based (you don't seem to use TS much anyway in your files) and I can reproduce that in here, otherwise I can't test a project that depends on this project in here, so it becomes a bit cumbersome to help you out there.

WebReflection commented 1 year ago

P.S. forgot to mention, the deadlock warning happens after a second your worker has been waiting for the main ... but I am not even sure I am able to reproduce your issue, also it's not mentioned which browser does that ... Firefox branches out as example for waitAsync as it's half broken there (by not exposing it).

WebReflection commented 1 year ago

had a quick look and still not sure what I am looking at or where is the issue ... could you please tell me just one single failing expectations and what are you doing there?

example:

I create an XWorker (or PyWorker) that points to a file that invokes the main in a loop
I expose through the main sync / proxy a function that returns to that worker
... anything else?

With a step by step reproduction I am sure I can both work on this and put an integration test to avoid further regressions, thanks!

e-nikolov commented 1 year ago

the while loop will always keep going until the main thread has finished, that's the whole goal of that loop, it allows interrupts (checks) in the process ... now, I haven't checked your code (yet?) but are you passing a custom interrupt?

I don't have a custom interupt.

in short, that is expected to keep timing out if your callback on the main takes long time.

In the demos here the main does not do much, but there are many calls from the python side.

About your demo, you seem to use a non async function that invokes the worker from the main ... you can do that, but you better use async and await when a worker function is meant to be called ... have you tried that?

I tried both async and no async, the behavior was the same.

I tried firefox, chrome, vivaldi, edge and they all have the same issue for me.

WebReflection commented 1 year ago

if the main doesn't do much, you should not see any deadlock warning ... if the main keeps resolving ASAP without waiting, I expect a queue of callbacks to be added to the stack and might sum up to more than a second, hence you see the deadlock warning ... the deadlock warning doesn't even mean "there is a deadlock", it means something ain't likely right ... but overall you can ignore that warning if the code is meant to be stuck for a second on the main ... the other way around (main calling a worker utility) is never gonna show that warning because long tasks in the worker are expected, long tasks from the main are "shenanigans" or kinda unexpected, as the worker exist to help off-loading long tasks elsewhere, not to make the main stuck for a second or more ... do you see that warning at all? I still would love to understand the basic reproduction steps without PyScript, Polyscript, anything else in between, and clear steps of what's going on and where or how it fails, thanks!

e-nikolov commented 1 year ago

When I open this demo - https://e-nikolov.github.io/coincident-repro/demo.html

What I expect to see the following output in the dev tools console:

main-demo.ts
worker-demo.py
js_func: 1000
js_func: 2000
...
js_func: 100000

What I see is that at some point it stops printing anything, indicating that the worker is stuck. Additionally, any function call I make to the python worker is pending forever.

For example if I type await call_py_func() in the Console, I see the output

call_py_func: 1
all_py_func: 2

When I call that function from the passive worker demo I get the correct output:

call_py_func: 1
all_py_func: 2
true

WebReflection commented 1 year ago

let me try to rephrase ... your demos have a lot of indirections that are just noise for the purpose of this bug + await call_py_func() if that function is not async makes little sense ... can you please stick with a single demo logic and expectations so that I can reproduce in here what is it that is failing?

I really need plain English explanation around how to reproduce it, I am not asking you to show me live code or files, I need step-by-step way, as description, to reproduce this, thanks for your understanding.

1 demo, 1 list of things you do on main, things you do on worker, and your expectations.

e-nikolov commented 1 year ago

No problem, I'll do my best. In the meantime, can you share what you see in the dev console when you open this demo: https://e-nikolov.github.io/coincident-repro/demo.html

WebReflection commented 1 year ago

P.S. I don't know what passive worker means neither but that's why I am asking to please provide step-by-step reproduction of the issue so that I can remove terminology, vite, TypeScript, your server, from any of the things that matters in here, and focus on the concrete issue within this project, not through remote live examples ... once again, thanks for filing this, I do want stuff to never end up into an unexpected deadlock (I want to warn if I can anyhow understand what might happen and suggest how to fix it) so that we all benefit from the change upstream :wink:

WebReflection commented 1 year ago

can you share what you see in the dev console when you open this demo

I see this and nothing else ... the resource monitor of my PC also tells me there's nothing wrong, nothing stuck, not a thread blocked, and so on.

Screenshot from 2023-10-09 16-03-49

e-nikolov commented 1 year ago

I see, the demo prints once every 1000 calls because otherwise there was too much output, but I guess in your case it dies before reaching that so it never prints anything. I'll adjust it.

WebReflection commented 1 year ago

please don't ... I need to understand what you are doing there, not another live demo I can't use in here due the amount of setup needed to reproduce it!

WebReflection commented 1 year ago

OK, after a hard refresh (SW can be troubling) I see this ... is there when it dies?

Screenshot from 2023-10-09 16-06-37

e-nikolov commented 1 year ago

To summarize what the demo does:

Main Thread:
1. Creates a python worker via @pyscript/core's PyWorker() function
2. Registers a callback for the worker - worker.sync.js_func. 2.1 js_func() counts how many times it has been called by the worker and prints to the console once every 1000 calls
Worker Thread:
1. calls js_func() 100 000 times via polyscript.xworker.sync.js_func()

What I expect to see in the dev console is the output from js_func() being called 100 000 times:

js_func: 1000
js_func: 2000
...
js_func: 100000

What I actually see is the output for the first couple of thousand calls and then nothing.

WebReflection commented 1 year ago

Thank You ... so ... if I got this correctly, this is the index file:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <script type="module">
    console.log('main');
    import coincident from '../../es.js';
    const proxy = coincident(new Worker('./worker.js', {type: 'module'}));
    let func_calls = 0;
    proxy.func = () => {
      func_calls++
      if (func_calls % 1000 == 0) {
        console.log(`func: ${func_calls}`);
      }
    }
  </script>
</head>
</html>

and this is the worker file:

console.log('worker');

import coincident from '../../es.js';
const proxy = coincident(self);

for (let i = 0; i < 100000; i++)
  proxy.func();

I have witnessed the count up to 100000 but indeed after a refresh I have intermittent results ... I can finally dig into this issue!

WebReflection commented 1 year ago

OK, "here be dragons" ... if I delay by a setTimeout (note not queueMicrotask and not Promise.resolve()) the loop every XXX iterations (where XXX is 100 of them here) I get consistent results ... would you please test that's the case for you too? Bear in mind the test takes much longer this way ...

console.log('worker.js');

import coincident from '../../es.js';
const proxy = coincident(self);

(async () => {
  for (let i = 0; i < 100000; i++) {
    if (i % 100) await new Promise($ => setTimeout($, 0));
    proxy.func();
  }
})();

the issue seems to be related to the Garbage Collector where the heap grows exponentially without having any chance, within the threshold you chose for a loop, to collect enough so it kinda bails out out of the blue.

Screenshot from 2023-10-09 16-53-34

Screenshot from 2023-10-09 16-54-02

Screenshot from 2023-10-09 16-54-46

Please note this is not even a memory leak as the heap stays OK after 80K iterations

Screenshot from 2023-10-09 16-59-14

WebReflection commented 1 year ago

after this investigation I have very few options:

file a bug in every single browser about this and see what's the outcome (time consuming, it won't solve your issue, it will take ... indeed, time to be fixed)
I put a setTimeout behind te scene for every Atomic iteration to let the GC breath between invokes and minimize surprises

The second point means everything will instantly become less snappy, less real-world use cases driven, but maybe (as the underlying issue is not with this module) more predictable or robust, the first point means me "wasting" a lot of time for all the issues I am going to open, including eventually remove this library completely from the equation and create a fine-tuned reproducible issue based on JS only ...

The last alternative, is that you just don't do that on real-world use cases, which I also struggle to see why you even tried to do that in the first place, but I'd be more than happy to learn reasons around your greedy loop if not just to test how Atomics can break in the wild ... it's also very possible it's this library doing something in between that makes browsers fail, but as I can consistently reproduce success just by adding a setTimeout delay every X calls, I am pretty sure the issue is not in here, rather in browsers vendors use cases and tests ...

That last thing being said, this library might as well try to avoid too greedy operations out of the box, but I'd feel a little dirty in patching an underlying issue just for the sake of it, although it might be worth until browsers issues are filed and more answers are gathered within that process.

What do you think? Or better ... what's your poison of choice in here?

WebReflection commented 1 year ago

P.S. one thing I haven't tried, is to play around that while loop and see if indeed it introduced an unexpected behavior ... I should likely do that anyway before filing any issue whatsoever ...

WebReflection commented 1 year ago

I am going to put @hoodmane as CC here, because this might be either unlocked from him or interesting for him too.

the relevant bit is here https://github.com/WebReflection/coincident/issues/26#issuecomment-1753176596

WebReflection commented 1 year ago

the bit I am probaby missing here, is how to explicitly free SharedArrayBuffer after these are not used anymore, hopefully helping GC keeping up or avoiding this kind of unpredictable situation.

WebReflection commented 1 year ago

one thing I haven't tried, is to play around that while loop and see if indeed it introduced an unexpected behavior ... I should likely do that anyway before filing any issue whatsoever ...

I did, and as soon as I log the behavior I can't reproduce the issue anymore, no idea why, but basically a console.log in the while loop fixes the issue ... I really have no idea how to fix this in JS at this point if not by slowing everything else down for everything.

mchilvers commented 1 year ago

Does setTimeout with a timeout of zero produce the consistent results? (so is it a matter of getting something into the event loop or the actual time delay).

e-nikolov commented 1 year ago

I really appreciate your effort! I was going a little crazy trying to figure out what was going on in my main project. I recently updated another dependency (PeerJS) that somehow increased how much this silent deadlock(?) was occurring. It's possible that the latest release of PeerJS improved its performance which put more pressure on the worker.

The last alternative, is that you just don't do that on real-world use cases, which I also struggle to see why you even tried to do that in the first place, but I'd be more than happy to learn reasons around your greedy loop if not just to test how Atomics can break in the wild

In my project, I don't have a loop like that, but there are many coroutines in the Python worker that send data to the main thread which in turn forwards it to other hosts via PeerJS, which in effect might be similar to a greedy loop.

My project is a browser port of https://github.com/lschoe/mpyc - a python framework for Secure Multiparty Computations. It allows a group of parties to jointly compute a function that takes a secret input from each party without the parties learning each other's secrets. The port uses PyScript and PeerJS to run it in the browser and to replace its TCP transport with a WebRTC based transport.

To compute the joint function, MPyC uses an interactive protocol that splits secret inputs and secret intermediate state into shares, distributes the shares among other parties and uses the shares from other parties to continue the computation of the function.

Many of those operations happen in Python coroutines and MPyC does not currently try to limit the number of concurrent coroutines. This might be problematic when it runs in a worker. Initially I thought it made sense to run it in a Worker in order to not lock up the main thread with the computations, but perhaps the communication between the main thread and the worker is a bigger bottleneck. I'll have to investigate.

Another thing I tried earlier was to reduce the wait time of cross-thread functions by making sure they call a handler function via setTimeout() and to immediately return. Most such functions in my case don't return anything, so there was nothing to wait for. But somehow this change made the deadlock issue worse, not sure why.

WebReflection commented 1 year ago

@mchilvers

Does setTimeout with a timeout of zero produce the consistent results?

anything that let the GC "breath" produces consistent results in my case ... a console.log would do too, and I am not sure why would that even be ... this looks like a very internal thing around Atomics, and I think tomorrow I'll try my best to reproduce the issue without using this library, because if I can do that, it means all browsers have bugs around these primitives that are indeed not so common out there or not so tested, but if it doesn't, I'd be back to my roots trying to figure out what is that this library is doing wrong.

So far my conclusions:

the fact the same code sometimes works, sometimes doesn't, without changing anything, suggests to me the issue is not in this library code, rather in the postMessage / Atomics dance behind the scene, otherwise I'd see always the same failure
previous point suggests me this is a browsers bugs around an edge case they haven't tested
previous points also suggest me if it is indeed the GC pressure that makes everything failing, I might find a temporary workaround but I still want browsers to fix this issue as it's not "our" (Web developers) responsibility to fix this issue, specially over an already historically problematic primitive as SharedArrayBuffer has been on the Web
if none of my previous points end up to be true, I really need to investigate and find out what is it that, given the exact same loop and exact same circumstances around every single time, might pass once, but fail other times

This is both super interesting, as bug, and super annoying, for unimaginable edge cases like the one shown in here.

I do expect no real-world consequences for real-world scenarios but I also do expect predictable and expected results out of the Web standards we're building on our projects.

e-nikolov commented 1 year ago

OK, "here be dragons" ... if I delay by a setTimeout (note not queueMicrotask and not Promise.resolve()) the loop every XXX iterations (where XXX is 100 of them here) I get consistent results ... would you please test that's the case for you too? Bear in mind the test takes much longer this way ...

I tried with your snippet

console.log('worker.js');

import coincident from 'coincident';
const proxy = coincident(self);

(async () => {
    for (let i = 0; i < 100000; i++) {
        if (i % 100) await new Promise($ => setTimeout($, 0));
        proxy.func();
    }
})();

It still fails at some point for me, but it takes a lot longer.

WebReflection commented 1 year ago

It still fails at some point for me, but it takes a lot longer.

I should've mentioned: caveat of setTimeout is that you gotta stay on the page to look at results because if you leave the page, the browser might pause it and resume it too late, when everything else is already broken due that timeout waiting.

So I am going to ask: did you get bored to see all results in console without ever leaving that page and without letting your laptop sleep while waiting for that 100000 goal? ... I did, which is why I've said testing will take much longer.

This, although, still shows that the issue is not the code, but a browser internal thing ... I can't catch any different error or result from those await in Atomics, but I am not excluding I am doing something really wrong there ... maybe I should try comlink and see if results there are different, as that's the only "competitor" I know out there.

e-nikolov commented 1 year ago

So I am going to ask: did you get bored to see all results in console without ever leaving that page and without letting your laptop sleep while waiting for that 100000 goal? ... I did, which is why I've said testing will take much longer.

Yeah, I am waiting on the page, sometimes even clicking on it. I ran it 20 times and only saw it reach 100 000, 2-3 times on chrome-based browsers. On Firefox it seems to complete most of the time but it's ~3 times slower than chrome.

hoodmane commented 1 year ago

the bit I am probaby missing here, is how to explicitly free SharedArrayBuffer after these are not used anymore

The v8 people I've talked to think creating a SharedArrayBuffer is expensive. I think the correct thing to do here is to use a freelist. For the four byte I32_BYTES you can have one free list. The result buffers are a bit trickier because they can have various sizes, but you can start with say 256 bytes and then size the buffer as the first power of 2 that fits:

const freeLists = Array.from({length: 24}, () => []);

function allocateBuffer (spaceNeeded) {
    const powerOfTwoNeeded = Math.max(24 - Math.clz32(resultSize - 1), 0);
    let buffer = freeLists[powerOfTwoNeeded].pop();
    if (buffer) {
        return buffer;
    }
    // Freelist was empty, create new buffer
    return new Int32Array(new SharedArrayBuffer(256 << powerOfTwoNeeded));
}

function freeBuffer (buffer) {
    const bucket = 24 - Math.clz32(buf.length - 1);
    freeLists[bucket].push(buffer);
}

hoodmane commented 1 year ago

Though if you only need one buffer at a time, this is overkill: just make a single buffer and double the size whenever more space is needed.

WebReflection commented 1 year ago

@hoodmane thanks ... it is a one-off operation each time, I can recycle the shared buffer that waits just for the length of the sult though, that would save some SAB creation (if I am not doing that already) ... I'll check and see if anything improves.

WebReflection commented 1 year ago

@e-nikolov I think I've found the culprit in this MR https://github.com/WebReflection/coincident/pull/27 but please confirm me that if you return a value from the remote callback you currently never have issues ... example:

  <script type="module">
    console.log('main');
    import coincident from '../../es.js';
    const proxy = coincident(new Worker('./worker.js', {type: 'module'}));
    let func_calls = 0;
    proxy.func = () => {
      func_calls++
      if (func_calls % 1000 == 0) {
        console.log(`func: ${func_calls}`);
      }
      // HERE
      return func_calls;
    }
  </script>

I think there's some internal behavior that's hard to reason about or reproduce but basically it looks like passing a SharedArrayBuffer and notify it at index 0 might in some case fail the communication channel if that index was not explicitly set or initialized before such notification.

It looks like the MR found the right workaround that would make even callbacks that return undefined absolutely consistent and reliable.

/cc @hoodmane

WebReflection commented 1 year ago

to whom it might concern, it looks like there's some branching logic at point 19 of the specifications https://tc39.es/ecma262/multipage/structured-data.html#sec-dowait and I think that might be the one causing issues but I am not really planning to re-compile v8, jsc and spidermonkey to validate my suspect.

e-nikolov commented 1 year ago

@e-nikolov I think I've found the culprit in this MR https://github.com/WebReflection/coincident/pull/27 but please confirm me that if you return a value from the remote callback you currently never have issues ... example:

Indeed, if I add an explicit return it doesn't lock up anymore in the minimal reproduction. I'll check if it also solves it for the bigger project or if there is another issue there.

WebReflection commented 1 year ago

I've filed an issue and I could create a minimal viable example based on primitives only and I can reproduce the issue even there or confirm just setting some value the issue doesn't exist anymore:

https://bugs.chromium.org/p/chromium/issues/detail?id=1491342

WebReflection commented 1 year ago

Firefox issue here: https://bugzilla.mozilla.org/show_bug.cgi?id=1858092

WebReflection commented 1 year ago

Last, but not least, WebKit (Safari) bug landed: https://bugs.webkit.org/show_bug.cgi?id=262948

WebReflection commented 1 year ago

@e-nikolov FYI PyScript latest should have fixed the problem out of npm module ... please let me know if even without returning a value you have any issue.

At the technical level, not returning any value means 1 roundtrip instead of 2 so expect performance boost when any sync callback does not actually return anything, double the time + serialization needed to pass values cross realms, to instead see a (hopefully mild) performance penalty.

WebReflection commented 1 year ago

After the ugliest exchange ever one could expect from Firefox maintainers, there is an answer that seems to be both reasonable and working reliably for all major browsers: https://bugzilla.mozilla.org/show_bug.cgi?id=1858092#c7

/cc @hoodmane the TL;DR is to double the buffer size when using extra arguments of Atomics.wait(one, two, ...extra) so that the third argument can be used to control the state of the the operation, while the first index can be used to read the result.

This effectively avoid ugly checks or needs to the length of an answer, where 0 is perfectly valid, instead of -1 "wonder/hack", and 8 bytes instead of just 4 don't look like the end of the world for the initial roundtrip.

hoodmane commented 1 year ago

I guess that's what I did in synclink: https://github.com/pyodide/synclink/blob/main/src/sync_task.ts#L59 The size buffer is 8 bytes: 4 bytes for the actual size, 4 bytes to indicate readiness.

WebReflection commented 1 year ago

@hoodmane then it was me taking your demo on interrupts too lightly as you were doing the right thing while I was using the wait without extra arguments so that once added the delay and the control value it started randomly failing.

Good thing it's solved now in the best possible way, thanks.

WebReflection / coincident

Coincident deadlocks if there is a high volume of messages from a worker to the main thread #26