Threads.expose is called to late - "Did not receive an init message from worker after 10000ms" on clean 1.8.16 install

anonymous-14523 commented 2 years ago

Clean install of etherpad-lite 1.8.16 fails to run on alpine edge in an lxc container on armhf architecture. The same command chain on a x86_64 succeeds (sic!).

I'm quite lost where to look for a solution to this problem, especially that the SBC (banana pro) that etherpad fails to start on is slow.

What I see on the first request from any web browser is:

[2022-01-26 12:38:09.239] [INFO] server - Etherpad is running
[2022-01-26 12:38:35.968] [ERROR] console - Error: Timeout: Did not receive an init message from worker after 10000ms. Make sure the worker calls expose().
    at Timeout._onTimeout (/var/etherpad-lite-1.8.16/src/node_modules/threads/dist/master/spawn.js:35:53)
    at listOnTimeout (node:internal/timers:557:17)
    at processTimers (node:internal/timers:500:7)

The error repeats multiple times at once. The error appears almost immediately after entering container IP in a web browser, not after the 10 seconds that the message mentions.

Node.js versions tested: 16.13.2 and 17.4.0

To Reproduce

lxc-create -t download -n ep
1. Distribution: alpine
2. Release: edge
3. Architecture: armhf
lxc-start ep && lxc-attach ep
apk update
apk add nodejs npm build-base python3 sqlite # extra packages are needed due to ep dependencies
wget https://github.com/ether/etherpad-lite/archive/1.8.16.zip
unzip 1.8.16.zip
cd etherpad-lite-1.8.16
cp settings.json.template settings.json
vi settings.json # added an admin user and changed the port to 80
./bin/run.sh --root

Server

Etherpad version: 1.8.16 (I hit the problem first with current development branch, then tried a clean install once all else failed)
OS: alpine in lxc (run in debian testing, but I doubt this matters)
Hardware: armhf, sun7i-a20-bananapro
Node.js version: v16.13.2 (upgrading to 17.4.0 did not help either)
npm version: 8.3.0
Is the server free of plugins: yes

Additional For some reason installing dependencies compiles sqlite3 (for node-pre-gyp fails to find precompiled binaries), and so installing dependencies takes way too long.

I ended up hitting this issue on an upgrade from etherpad-lite1.8.0-beta.1 running on node v12.13.1 in another alpine container on the same machine that now fails to run current etherpad-lite & node. Simple upgrade failed with the same results, so I tried a clean install and it fails the same way…

deurbroucq commented 2 years ago

I am getting the same error on Rocky Linux 8.5 with Etherpad version 1.8.16 and Node v14.17.5.

rhansen commented 2 years ago

I ended up hitting this issue on an upgrade from etherpad-lite1.8.0-beta.1 running on node v12.13.1 in another alpine container on the same machine

@anonymous-14523 To confirm I understand correctly:

Etherpad 1.8.0-beta.1 on Node.js v12.13.1 works on armhf.
Etherpad 1.8.16 on Node.js v16.13.2 (or v17.4.0) does not work on armhf.

Is that correct? If so, can you try bisecting to identify the commit causing it to fail?

I am getting the same error on Rocky Linux 8.5 with Etherpad version 1.8.16 and Node v14.17.5.

@deurbroucq Also you also running on armhf hardware?

rhansen commented 2 years ago

The error appears almost immediately after entering container IP in a web browser, not after the 10 seconds that the message mentions.

I wonder if the worker code is throwing an error before it calls expose(). On the armhf machine, can you try running the following:

$ (cd src && node)
Welcome to Node.js 16.13.2.
Type ".help" for more information.
> require('./node/utils/MinifyWorker')

You should see an exception like this:

Uncaught Error: expose() called in the master thread.
    at Object.expose (/home/rhansen/floss/etherpad-lite/src/node_modules/threads/dist/worker/index.js:146:15)
    at Object.<anonymous> (/home/rhansen/floss/etherpad-lite/src/node/utils/MinifyWorker.js:30:9)
    at Module._compile (internal/modules/cjs/loader.js:1085:14)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
    at Module.load (internal/modules/cjs/loader.js:950:32)
    at Function.Module._load (internal/modules/cjs/loader.js:790:12)
    at Module.require (internal/modules/cjs/loader.js:974:19)
    at require (internal/modules/cjs/helpers.js:93:18)

If you see some other exception, then the timeout error message is probably a red herring.

rhansen commented 2 years ago

A possible workaround is to set the following in your settings.json:

  "minify": false,

anonymous-14523 commented 2 years ago

(you can skip to past the last <hr>)

To confirm I understand correctly:

Etherpad 1.8.0-beta.1 on Node.js v12.13.1 works on armhf. Etherpad 1.8.16 on Node.js v16.13.2 (or v17.4.0) does not work on armhf.

Yes.

Is that correct? If so, can you try bisecting to identify the commit causing it to fail?

Hm... this is theoretically possible, for I tested right now that 1.8.0-beta.1 still works with updated node (v16.13.2).

The practical problem that refrains me from this is that new node (1.8.16) needs something that needs sqlite3. Now the problem is that node-pre-gyp finds no pre-build binaries for sqlite3 and compiles it itself, and so installing dependencies takes half an hour (that arm is old and sqlite3 alone compiles about 3 minutes on my desktop…).

> require('./node/utils/MinifyWorker')

This seems fine, I get similar error (line numbers slightly differ, though):

Welcome to Node.js v16.13.2.
Type ".help" for more information.
require('./node/utils/MinifyWorker')
Uncaught Error: expose() called in the master thread.
at Object.expose (/var/etherpad-lite/src/node_modules/threads/dist/worker/index.js:146:15)
at Object.<anonymous> (/var/etherpad-lite/src/node/utils/MinifyWorker.js:30:9)
at Module._compile (node:internal/modules/cjs/loader:1101:14)
at Object.Module._extensions..js (node:internal/modules/cjs/loader:1153:10)
at Module.load (node:internal/modules/cjs/loader:981:32)
at Function.Module._load (node:internal/modules/cjs/loader:822:12)
at Module.require (node:internal/modules/cjs/loader:1005:19)
at require (node:internal/modules/cjs/helpers:102:18)

A possible workaround is to set the following in your settings.json:
  "minify": false,

Wow! This worked! :+1:

Also: this significantly speeds up loading time: tell me, does etherpad cache minifed js, or minifies it frequently?

(I'm fine with this workround, so the issue might be closed.)

deurbroucq commented 2 years ago

@rhansen No, I am not running on armhf. I am running on x86_64. However, the minify workaround in settings.json worked for me too! Thank you!

rhansen commented 2 years ago

Also: this significantly speeds up loading time:

Do you mean the time it takes the server to log "Etherpad is running", or the time it takes a browser to finish loading a pad?

does etherpad cache minifed js, or minifies it frequently?

The results are cached, so pad load time should only be poor the first time any user visits any pad.

No, I am not running on armhf. I am running on x86_64.

Hmm, that's unexpected. Is your machine slow? I wonder if both of you are seeing genuine timeouts due to machine slowness. Would both of you try setting the THREADS_WORKER_INIT_TIMEOUT environment varible to something like 60000 (60 seconds) to see if it's just being slow? For example:

THREADS_WORKER_INIT_TIMEOUT=60000 ./src/bin/fastRun.sh

anonymous-14523 commented 2 years ago

Also: this significantly speeds up loading time:

Do you mean the time it takes the server to log "Etherpad is running", or the time it takes a browser to finish loading a pad?

Time it takes the browser to load the pad (or the admin interface). It's feasible that the delay I saw earlier was the time it took to minify the js, and while annoying, it was below the 10 second timeout.

does etherpad cache minifed js, or minifies it frequently?

The results are cached, so pad load time should only be poor the first time a user visits a pad.

That makes sense. Interestingly, with minify set to false, in the var directory I still see lots of fresh minified_… files (in pairs - plain & deflated).

Hmm, that's unexpected. Is your machine slow? I wonder if both of you are seeing genuine timeouts due to machine slowness. Would both of you try setting the THREADS_WORKER_INIT_TIMEOUT environment varible to something like 60000 (60 seconds) to see if it's just being slow? For example:
THREADS_WORKER_INIT_TIMEOUT=60000 ./src/bin/fastRun.sh

Actually, I can answer this one without testing anything. The message appears within half a second from the moment I press return in URL bar of the browser. No chance for 10s timeout to pass.

Misc

I still see some Did not receive an init message from worker after 10000ms when shutting down etherpad-lite – that is, when I hit ctrl+c on the shell where I run bin/run.sh. This does not seem to break anything for subsequent start.

./src/bin/fastRun.sh binary has /bin/bash in shebang, while it needs only /bin/sh – minimalist environments such as those in containers do not necessarily have bash installed, you could consider changing that ;-)

Thanks for helping me out!

webzwo0i commented 2 years ago

I also witness this error. The instance is fast enough to handle >300 users with a lot of parallel edits, but is indeed slow when it comes to i/o (and to some extent also CPU).

When expose isn't called, I cannot find a way to reinitialize the worker thread. So we would need to remove the unexposed threads from the pool and re-add them. Is it worth the effort? Though, I see no way to catch this error.

bisect shows 085bc8cbb3160b392c3f67c69c4851868818b6e2 is the problematic commit, however it's hard to reproduce reliably, even without any plugins installed. Note that commenting await Promise.all(Object.keys(defs.plugins).map((p) => hooks.aCallAll(init_${p}, {}))); won't fix the problem, but adding something like await fs.stat('.'); before await Promise.all... does. So I assume this has something to do with node's event loop. But as said, it's hard to reproduce so it might not be the real cause.

@anonymous-14523 @deurbroucq In case you can reliably trigger this error, can you please test the following: Can you add await fs.stat('.'); right before https://github.com/ether/etherpad-lite/blob/develop/src/static/js/pluginfw/plugins.js#L101 ie don't comment the Promise.all line but only add fs.stat in the line above.

For future reference: The workers are created when pluginUtils.extractHooks handles ep_etherpad-lite/Minify. However, Threads.expose is called not before the expressPreSession hook in src/node/hooks/express/static.js is called, right before the fs calls in getTar. So this again seems to indicate that execution of Threads.expose is scheduled late in the event loop, but before handling outstanding i/o events.

webzwo0i commented 2 years ago

Btw. I witnessed this on v14, the tests above are with v12. So I don't think this is related to node version. Also, everything happend on x86_64.

anonymous-14523 commented 2 years ago

@anonymous-14523 @… In case you can reliably trigger this error, can you please test the following: Can you add await fs.stat('.'); right before https://github.com/ether/etherpad-lite/blob/develop/src/static/js/pluginfw/plugins.js#L101 ie don't comment the Promise.all line but only add fs.stat in the line above.

Sure. I tested again with the change and the problem disappeared in all places where I've noticed it before.

I tested this with with minify=true, but on a version with some plugins. As I mentioned earlier, with minify=false ep runs correctly but still fails to … receive an init message from worker upon closing, at least by hitting ctrl+c. With the await fs.stat('.'); it both runs with minify=true and closes without that … receive … stuff. No changes in node or etherpad versions on my side.

If you're interested in console output, I put here some screen hardcopy (with useless printf-like noise that I added a couple weeks ago when I was trying to see what is going on): https://pastebin.com/RUyw0g0j

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rhansen commented 2 years ago

Is this still a problem? If so, would you please try the following patch:

diff --git a/src/node/utils/Minify.js b/src/node/utils/Minify.js
index 2e8a2d960..8fc231d01 100644
--- a/src/node/utils/Minify.js
+++ b/src/node/utils/Minify.js
@@ -35,7 +35,7 @@ const logger = log4js.getLogger('Minify');

 const ROOT_DIR = path.join(settings.root, 'src/static/');

-const threadsPool = new Threads.Pool(() => Threads.spawn(new Threads.Worker('./MinifyWorker')), 2);
+let threadsPool;

 const LIBRARY_WHITELIST = [
   'async',
@@ -271,6 +271,9 @@ const requireDefinition = () => `var require = ${RequireKernel.kernelSource};\n`

 const getFileCompressed = async (filename, contentType) => {
   let content = await getFile(filename);
+  if (threadsPool == null) {
+    threadsPool = new Threads.Pool(() => Threads.spawn(new Threads.Worker('./MinifyWorker')), 2);
+  }
   if (!content || !settings.minify) {
     return content;
   } else if (contentType === 'application/javascript') {
@@ -322,5 +325,6 @@ exports.minify = (req, res, next) => minify(req, res).catch((err) => next(err ||
 exports.requestURIs = requestURIs;

 exports.shutdown = async (hookName, context) => {
-  await threadsPool.terminate();
+  if (threadsPool != null) await threadsPool.terminate();
+  threadsPool = null;
 };

anonymous-14523 commented 2 years ago

Is this still a problem? If so, would you please try the following patch:
diff --git a/src/node/utils/Minify.js b/src/node/utils/Minify.js

1) I'm totally happy with the workaround proposed by webzwo0i. With the await fs.stat('.'); added where suggested the case is closed for me.

2) I can tell without testing that your patch won't help. The error occurs not only upon minifying, but also in other places – in my case it repeatedly appears upon shutting down ep as well. So the root cause definitely is not in Minify.js.

rhansen commented 2 years ago

I'm totally happy with the workaround proposed by webzwo0i. With the await fs.stat('.'); added where suggested the case is closed for me.

I'm unlikely to merge that workaround because it's a fragile duct tape fix that masks the real problem. If you don't want to maintain that workaround yourself as new versions of Etherpad are released (including the possibility that a future change may cause that workaround to stop working), then I'd prefer to properly fix the bug.

I can tell without testing that your patch won't help. The error occurs not only upon minifying, but also in other places – in my case it repeatedly appears upon shutting down ep as well. So the root cause definitely is not in Minify.js.

I haven't studied the threads library closely, but I suspect the display of the timeout error is disconnected in time from the error event itself. When shutting down Etherpad, we shut down the thread pool. The act of shutting down the thread pool probably surfaces the error that was actually generated long ago (e.g., it finally awaits a Promise that was rejected long ago). If you can, please try the patch.

anonymous-14523 commented 2 years ago

If you can, please try the patch.

OK, I finally found time for this.

I first tested a clean 1.8.18. I still see the Did not receive… message when I try to connect from a web browser (and ep does not work), but when I hit ctrl+c there is no Did not receive… more.

After applying the patch ep works and minifies js. Note that I did not test anything beyond creating a pad and typing random characters. I still have some time to waste this week, so if you want me to do some extra tests, let me know. I still fail to see how being unable to minify could end up with Did not receive… upon shutdown, but this message is not printed upon closing 1.8.18 anyway, so I cannot tell if the problem is gone in general or only at minify.

The output I got this time: https://pastebin.com/Rf59BfUp (unpatched version starts at line 1, patched at line 155).

baulaergrungim commented 1 year ago

It's been a while so I'll also throw my 2 cents, a clean install on raspberry pi 2 gives that error for me as well. Admin is then practically empty, like almost plain html, no css, plugin list isn't populated, settings tab also shows empty settings. Every time I tried to do anything on admin page, I got that error in the console, and on ctrl+c I also got the same thing:

[2022-09-23 22:07:51.810] [ERROR] server - Error: Timeout: Did not receive an init message from worker after 10000ms. Make sure the worker calls expose().
    at Timeout._onTimeout (/var/www/etherpad-lite/src/node_modules/threads/dist/master/spawn.js:35:53)
    at listOnTimeout (internal/timers.js:557:17)
    at processTimers (internal/timers.js:500:7)

I'm not too versed in fixing stuff etc, but at least this commit is a workaround for me sudo git reset --hard a796811

Upon picking that revision, I don't have issues running Etherpad. I would love to use something a bit more recent, at least, but it's just aesthetics to get rid of this: [WARN] console - Update available: Download the actual version 1.8.17

roughnecks commented 1 year ago

I have the same error running 1.8.18 on the ansible Matrix playbook: https://github.com/spantaleev/matrix-docker-ansible-deploy

Error: Timeout: Did not receive an init message from worker after 10000ms. Make sure the worker calls expose().

andsarr commented 1 year ago

I had the same problem with version 1.8.18. I created a linux/arm64 using Docker Buildx. Using the workaround "minify": false solved the problem.

ichdasich commented 5 months ago

So, let me chip in on this one; Also hit this with an instance running on a spinning-rust backed VM. Etherpad would become unavailable with this error when a user opened a pad after the service was idle for some time; Notably, restarting the service helped and did not see timeouts; I guess this hit when a user opened a pad after the service was inactive for a while (=files migrated out of the fs cache), and minify took too long.

(Debian 12 bookworm, node 18)

Fixed this by disabling minify and increasing timeouts; But, to unstale this again... might be nice to have this fixed. ;-)

SamTV12345 commented 5 months ago

So, let me chip in on this one; Also hit this with an instance running on a spinning-rust backed VM. Etherpad would become unavailable with this error when a user opened a pad after the service was idle for some time; Notably, restarting the service helped and did not see timeouts; I guess this hit when a user opened a pad after the service was inactive for a while (=files migrated out of the fs cache), and minify took too long.

(Debian 12 bookworm, node 18)

Fixed this by disabling minify and increasing timeouts; But, to unstale this again... might be nice to have this fixed. ;-)

Is there a way to reproduce this easily? I have enabled minify and it never occurred on my Apple M2 Mini. I'd be glad to work on this task and see this fixed.

ichdasich commented 5 months ago

Hrmmm... i will have to take a look. My guess is that this occurs due to a combination of 'latency heavy I/O' and 'FS cache eviction'. So the challenge would be finding that sweetspot of resource starvation and making it reproducible.

What could be tried is artificially reducing the timeout values and adding resource limits to permissible IOPS/BW for a VM. I can give that a shot in a bit.

ether / etherpad-lite

Threads.expose is called to late - "Did not receive an init message from worker after 10000ms" on clean 1.8.16 install #5365