Closed e-nikolov closed 1 year ago
the while
loop will always keep going until the main thread has finished, that's the whole goal of that loop, it allows interrupts (checks) in the process ... now, I haven't checked your code (yet?) but are you passing a custom interrupt?
in short, that is expected to keep timing out if your callback on the main takes long time.
About your demo, you seem to use a non async function that invokes the worker from the main ... you can do that, but you better use async
and await
when a worker function is meant to be called ... have you tried that?
About minimal reproducible example, it would be great to have something that uses just this module, it's JS based (you don't seem to use TS much anyway in your files) and I can reproduce that in here, otherwise I can't test a project that depends on this project in here, so it becomes a bit cumbersome to help you out there.
P.S. forgot to mention, the deadlock
warning happens after a second your worker has been waiting for the main ... but I am not even sure I am able to reproduce your issue, also it's not mentioned which browser does that ... Firefox branches out as example for waitAsync
as it's half broken there (by not exposing it).
had a quick look and still not sure what I am looking at or where is the issue ... could you please tell me just one single failing expectations and what are you doing there?
example:
sync
/ proxy
a function that returns to that workerWith a step by step reproduction I am sure I can both work on this and put an integration test to avoid further regressions, thanks!
the while loop will always keep going until the main thread has finished, that's the whole goal of that loop, it allows interrupts (checks) in the process ... now, I haven't checked your code (yet?) but are you passing a custom interrupt?
I don't have a custom interupt.
in short, that is expected to keep timing out if your callback on the main takes long time.
In the demos here the main does not do much, but there are many calls from the python side.
About your demo, you seem to use a non async function that invokes the worker from the main ... you can do that, but you better use async and await when a worker function is meant to be called ... have you tried that?
I tried both async and no async, the behavior was the same.
I tried firefox, chrome, vivaldi, edge and they all have the same issue for me.
if the main doesn't do much, you should not see any deadlock warning ... if the main keeps resolving ASAP without waiting, I expect a queue of callbacks to be added to the stack and might sum up to more than a second, hence you see the deadlock warning ... the deadlock warning doesn't even mean "there is a deadlock", it means something ain't likely right ... but overall you can ignore that warning if the code is meant to be stuck for a second on the main ... the other way around (main calling a worker utility) is never gonna show that warning because long tasks in the worker are expected, long tasks from the main are "shenanigans" or kinda unexpected, as the worker exist to help off-loading long tasks elsewhere, not to make the main stuck for a second or more ... do you see that warning at all? I still would love to understand the basic reproduction steps without PyScript, Polyscript, anything else in between, and clear steps of what's going on and where or how it fails, thanks!
When I open this demo - https://e-nikolov.github.io/coincident-repro/demo.html
What I expect to see the following output in the dev tools console:
main-demo.ts
worker-demo.py
js_func: 1000
js_func: 2000
...
js_func: 100000
What I see is that at some point it stops printing anything, indicating that the worker is stuck. Additionally, any function call I make to the python worker is pending forever.
For example if I type await call_py_func()
in the Console, I see the output
call_py_func: 1
all_py_func: 2
When I call that function from the passive worker demo I get the correct output:
call_py_func: 1
all_py_func: 2
true
let me try to rephrase ... your demos have a lot of indirections that are just noise for the purpose of this bug + await call_py_func()
if that function is not async makes little sense ... can you please stick with a single demo logic and expectations so that I can reproduce in here what is it that is failing?
I really need plain English explanation around how to reproduce it, I am not asking you to show me live code or files, I need step-by-step way, as description, to reproduce this, thanks for your understanding.
1 demo, 1 list of things you do on main, things you do on worker, and your expectations.
No problem, I'll do my best. In the meantime, can you share what you see in the dev console when you open this demo: https://e-nikolov.github.io/coincident-repro/demo.html
P.S. I don't know what passive worker means neither but that's why I am asking to please provide step-by-step reproduction of the issue so that I can remove terminology, vite, TypeScript, your server, from any of the things that matters in here, and focus on the concrete issue within this project, not through remote live examples ... once again, thanks for filing this, I do want stuff to never end up into an unexpected deadlock (I want to warn if I can anyhow understand what might happen and suggest how to fix it) so that we all benefit from the change upstream :wink:
can you share what you see in the dev console when you open this demo
I see this and nothing else ... the resource monitor of my PC also tells me there's nothing wrong, nothing stuck, not a thread blocked, and so on.
I see, the demo prints once every 1000 calls because otherwise there was too much output, but I guess in your case it dies before reaching that so it never prints anything. I'll adjust it.
please don't ... I need to understand what you are doing there, not another live demo I can't use in here due the amount of setup needed to reproduce it!
OK, after a hard refresh (SW can be troubling) I see this ... is there when it dies?
To summarize what the demo does:
Main Thread:
Worker Thread:
What I expect to see in the dev console is the output from js_func() being called 100 000 times:
js_func: 1000
js_func: 2000
...
js_func: 100000
What I actually see is the output for the first couple of thousand calls and then nothing.
Thank You ... so ... if I got this correctly, this is the index file:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<script type="module">
console.log('main');
import coincident from '../../es.js';
const proxy = coincident(new Worker('./worker.js', {type: 'module'}));
let func_calls = 0;
proxy.func = () => {
func_calls++
if (func_calls % 1000 == 0) {
console.log(`func: ${func_calls}`);
}
}
</script>
</head>
</html>
and this is the worker file:
console.log('worker');
import coincident from '../../es.js';
const proxy = coincident(self);
for (let i = 0; i < 100000; i++)
proxy.func();
I have witnessed the count up to 100000
but indeed after a refresh I have intermittent results ... I can finally dig into this issue!
OK, "here be dragons" ... if I delay by a setTimeout
(note not queueMicrotask
and not Promise.resolve()
) the loop every XXX iterations (where XXX is 100 of them here) I get consistent results ... would you please test that's the case for you too? Bear in mind the test takes much longer this way ...
console.log('worker.js');
import coincident from '../../es.js';
const proxy = coincident(self);
(async () => {
for (let i = 0; i < 100000; i++) {
if (i % 100) await new Promise($ => setTimeout($, 0));
proxy.func();
}
})();
the issue seems to be related to the Garbage Collector where the heap grows exponentially without having any chance, within the threshold you chose for a loop, to collect enough so it kinda bails out out of the blue.
Please note this is not even a memory leak as the heap stays OK after 80K iterations
after this investigation I have very few options:
setTimeout
behind te scene for every Atomic iteration to let the GC breath between invokes and minimize surprisesThe second point means everything will instantly become less snappy, less real-world use cases driven, but maybe (as the underlying issue is not with this module) more predictable or robust, the first point means me "wasting" a lot of time for all the issues I am going to open, including eventually remove this library completely from the equation and create a fine-tuned reproducible issue based on JS only ...
The last alternative, is that you just don't do that on real-world use cases, which I also struggle to see why you even tried to do that in the first place, but I'd be more than happy to learn reasons around your greedy loop if not just to test how Atomics can break in the wild ... it's also very possible it's this library doing something in between that makes browsers fail, but as I can consistently reproduce success just by adding a setTimeout
delay every X calls, I am pretty sure the issue is not in here, rather in browsers vendors use cases and tests ...
That last thing being said, this library might as well try to avoid too greedy operations out of the box, but I'd feel a little dirty in patching an underlying issue just for the sake of it, although it might be worth until browsers issues are filed and more answers are gathered within that process.
What do you think? Or better ... what's your poison of choice in here?
P.S. one thing I haven't tried, is to play around that while loop and see if indeed it introduced an unexpected behavior ... I should likely do that anyway before filing any issue whatsoever ...
I am going to put @hoodmane as CC here, because this might be either unlocked from him or interesting for him too.
the relevant bit is here https://github.com/WebReflection/coincident/issues/26#issuecomment-1753176596
the bit I am probaby missing here, is how to explicitly free SharedArrayBuffer after these are not used anymore, hopefully helping GC keeping up or avoiding this kind of unpredictable situation.
one thing I haven't tried, is to play around that while loop and see if indeed it introduced an unexpected behavior ... I should likely do that anyway before filing any issue whatsoever ...
I did, and as soon as I log the behavior I can't reproduce the issue anymore, no idea why, but basically a console.log
in the while loop fixes the issue ... I really have no idea how to fix this in JS at this point if not by slowing everything else down for everything.
Does setTimeout with a timeout of zero produce the consistent results? (so is it a matter of getting something into the event loop or the actual time delay).
I really appreciate your effort! I was going a little crazy trying to figure out what was going on in my main project. I recently updated another dependency (PeerJS) that somehow increased how much this silent deadlock(?) was occurring. It's possible that the latest release of PeerJS improved its performance which put more pressure on the worker.
The last alternative, is that you just don't do that on real-world use cases, which I also struggle to see why you even tried to do that in the first place, but I'd be more than happy to learn reasons around your greedy loop if not just to test how Atomics can break in the wild
In my project, I don't have a loop like that, but there are many coroutines in the Python worker that send data to the main thread which in turn forwards it to other hosts via PeerJS, which in effect might be similar to a greedy loop.
My project is a browser port of https://github.com/lschoe/mpyc - a python framework for Secure Multiparty Computations. It allows a group of parties to jointly compute a function that takes a secret input from each party without the parties learning each other's secrets. The port uses PyScript and PeerJS to run it in the browser and to replace its TCP transport with a WebRTC based transport.
To compute the joint function, MPyC uses an interactive protocol that splits secret inputs and secret intermediate state into shares, distributes the shares among other parties and uses the shares from other parties to continue the computation of the function.
Many of those operations happen in Python coroutines and MPyC does not currently try to limit the number of concurrent coroutines. This might be problematic when it runs in a worker. Initially I thought it made sense to run it in a Worker in order to not lock up the main thread with the computations, but perhaps the communication between the main thread and the worker is a bigger bottleneck. I'll have to investigate.
Another thing I tried earlier was to reduce the wait time of cross-thread functions by making sure they call a handler function via setTimeout() and to immediately return. Most such functions in my case don't return anything, so there was nothing to wait for. But somehow this change made the deadlock issue worse, not sure why.
@mchilvers
Does setTimeout with a timeout of zero produce the consistent results?
anything that let the GC "breath" produces consistent results in my case ... a console.log would do too, and I am not sure why would that even be ... this looks like a very internal thing around Atomics, and I think tomorrow I'll try my best to reproduce the issue without using this library, because if I can do that, it means all browsers have bugs around these primitives that are indeed not so common out there or not so tested, but if it doesn't, I'd be back to my roots trying to figure out what is that this library is doing wrong.
So far my conclusions:
postMessage
/ Atomics
dance behind the scene, otherwise I'd see always the same failureThis is both super interesting, as bug, and super annoying, for unimaginable edge cases like the one shown in here.
I do expect no real-world consequences for real-world scenarios but I also do expect predictable and expected results out of the Web standards we're building on our projects.
OK, "here be dragons" ... if I delay by a setTimeout (note not queueMicrotask and not Promise.resolve()) the loop every XXX iterations (where XXX is 100 of them here) I get consistent results ... would you please test that's the case for you too? Bear in mind the test takes much longer this way ...
I tried with your snippet
console.log('worker.js');
import coincident from 'coincident';
const proxy = coincident(self);
(async () => {
for (let i = 0; i < 100000; i++) {
if (i % 100) await new Promise($ => setTimeout($, 0));
proxy.func();
}
})();
It still fails at some point for me, but it takes a lot longer.
It still fails at some point for me, but it takes a lot longer.
I should've mentioned: caveat of setTimeout
is that you gotta stay on the page to look at results because if you leave the page, the browser might pause it and resume it too late, when everything else is already broken due that timeout waiting.
So I am going to ask: did you get bored to see all results in console without ever leaving that page and without letting your laptop sleep while waiting for that 100000 goal? ... I did, which is why I've said testing will take much longer.
This, although, still shows that the issue is not the code, but a browser internal thing ... I can't catch any different error or result from those await in Atomics, but I am not excluding I am doing something really wrong there ... maybe I should try comlink and see if results there are different, as that's the only "competitor" I know out there.
So I am going to ask: did you get bored to see all results in console without ever leaving that page and without letting your laptop sleep while waiting for that 100000 goal? ... I did, which is why I've said testing will take much longer.
Yeah, I am waiting on the page, sometimes even clicking on it. I ran it 20 times and only saw it reach 100 000, 2-3 times on chrome-based browsers. On Firefox it seems to complete most of the time but it's ~3 times slower than chrome.
the bit I am probaby missing here, is how to explicitly free SharedArrayBuffer after these are not used anymore
The v8 people I've talked to think creating a SharedArrayBuffer is expensive. I think the correct thing to do here is to use a freelist. For the four byte I32_BYTES
you can have one free list. The result buffers are a bit trickier because they can have various sizes, but you can start with say 256 bytes and then size the buffer as the first power of 2 that fits:
const freeLists = Array.from({length: 24}, () => []);
function allocateBuffer (spaceNeeded) {
const powerOfTwoNeeded = Math.max(24 - Math.clz32(resultSize - 1), 0);
let buffer = freeLists[powerOfTwoNeeded].pop();
if (buffer) {
return buffer;
}
// Freelist was empty, create new buffer
return new Int32Array(new SharedArrayBuffer(256 << powerOfTwoNeeded));
}
function freeBuffer (buffer) {
const bucket = 24 - Math.clz32(buf.length - 1);
freeLists[bucket].push(buffer);
}
Though if you only need one buffer at a time, this is overkill: just make a single buffer and double the size whenever more space is needed.
@hoodmane thanks ... it is a one-off operation each time, I can recycle the shared buffer that waits just for the length of the sult though, that would save some SAB creation (if I am not doing that already) ... I'll check and see if anything improves.
@e-nikolov I think I've found the culprit in this MR https://github.com/WebReflection/coincident/pull/27 but please confirm me that if you return a value from the remote callback you currently never have issues ... example:
<script type="module">
console.log('main');
import coincident from '../../es.js';
const proxy = coincident(new Worker('./worker.js', {type: 'module'}));
let func_calls = 0;
proxy.func = () => {
func_calls++
if (func_calls % 1000 == 0) {
console.log(`func: ${func_calls}`);
}
// HERE
return func_calls;
}
</script>
I think there's some internal behavior that's hard to reason about or reproduce but basically it looks like passing a SharedArrayBuffer and notify it at index 0
might in some case fail the communication channel if that index was not explicitly set or initialized before such notification.
It looks like the MR found the right workaround that would make even callbacks that return undefined absolutely consistent and reliable.
/cc @hoodmane
to whom it might concern, it looks like there's some branching logic at point 19 of the specifications https://tc39.es/ecma262/multipage/structured-data.html#sec-dowait and I think that might be the one causing issues but I am not really planning to re-compile v8, jsc and spidermonkey to validate my suspect.
@e-nikolov I think I've found the culprit in this MR https://github.com/WebReflection/coincident/pull/27 but please confirm me that if you return a value from the remote callback you currently never have issues ... example:
Indeed, if I add an explicit return it doesn't lock up anymore in the minimal reproduction. I'll check if it also solves it for the bigger project or if there is another issue there.
I've filed an issue and I could create a minimal viable example based on primitives only and I can reproduce the issue even there or confirm just setting some value the issue doesn't exist anymore:
https://bugs.chromium.org/p/chromium/issues/detail?id=1491342
Firefox issue here: https://bugzilla.mozilla.org/show_bug.cgi?id=1858092
Last, but not least, WebKit (Safari) bug landed: https://bugs.webkit.org/show_bug.cgi?id=262948
@e-nikolov FYI PyScript latest should have fixed the problem out of npm
module ... please let me know if even without returning a value you have any issue.
At the technical level, not returning any value means 1 roundtrip instead of 2 so expect performance boost when any sync
callback does not actually return anything, double the time + serialization needed to pass values cross realms, to instead see a (hopefully mild) performance penalty.
After the ugliest exchange ever one could expect from Firefox maintainers, there is an answer that seems to be both reasonable and working reliably for all major browsers: https://bugzilla.mozilla.org/show_bug.cgi?id=1858092#c7
/cc @hoodmane the TL;DR is to double the buffer size when using extra arguments of Atomics.wait(one, two, ...extra)
so that the third argument can be used to control the state of the the operation, while the first index can be used to read the result.
This effectively avoid ugly checks or needs to the length of an answer, where 0
is perfectly valid, instead of -1
"wonder/hack", and 8 bytes instead of just 4 don't look like the end of the world for the initial roundtrip.
I guess that's what I did in synclink: https://github.com/pyodide/synclink/blob/main/src/sync_task.ts#L59 The size buffer is 8 bytes: 4 bytes for the actual size, 4 bytes to indicate readiness.
@hoodmane then it was me taking your demo on interrupts too lightly as you were doing the right thing while I was using the wait
without extra arguments so that once added the delay and the control value it started randomly failing.
Good thing it's solved now in the best possible way, thanks.
While working on a pyscript app (source) I noticed that the worker would randomly stop responding after a while depending on the volume of messages being sent from the worker to the main thread. Sometimes it's after a couple of minutes, other times within seconds of starting.
When the worker thread is stuck, if I pause its execution with the Chrome Dev Tools, it is always on this line of coincident
and the wait function always returns 'timed-out'.
I saw this earlier issue https://github.com/WebReflection/coincident/issues/18, which seems related, but in my case, it happens even if the main thread does not call any worker functions.
I think I have a minimal reproduction here: https://github.com/e-nikolov/coincident-repro
There are three scenarios, that only print to the dev console to keep things simple:
In the first two demos the worker gets stuck pretty quickly. In the second demo I see the warning message about a possible deadlock from https://github.com/WebReflection/coincident/issues/18 even though I don't have any py -> js -> py calls like in that case.