Closed kprimice closed 3 years ago
Hey @kprimice. It would be useful to get more information about the environment you are running your Ghost instance in. Thanks
Hey @naz. I've currently got something similar happening on my system, and it was suggested to me that I share it with you.
CODE: ERR_WORKER_INIT_FAILED
MESSAGE: Worker initialization failure: EMFILE
Error [ERR_WORKER_INIT_FAILED]: Worker initialization failure: EMFILE
at Worker.[kOnExit] (internal/worker.js:195:26)
at Worker.<computed>.onexit (internal/worker.js:141:20)
[2021-01-01 06:52:32] ERROR "PUT /ghost/api/v3/admin/settings/" 400 8ms
I'm currently running, ahem, five instances on a single $5 Linode, as well as a sixth instance that is stopped. Ubuntu 16.04, RAM: 431 MB / 964 MB, Disk: 13G / 20G, ulimit -a --> open files (-n) 1024
. I should apparently also mention that "none of the processes are near the limit. Most node processes are in the upper 20s and Mysql is in the low 100s".
Four of the instances, plus the stopped instance, are for development purposes and so neither their admins not sites get hit up very often. The final instance is my live blog though, and although it's not exactly getting a lot of hits at the moment I do access the admin quite often.
Let me know if you have any other questions :)
I'm seeing this same issue with my Digital Ocean $5 Ghost installation droplet. The Ghost site is not in public use yet (it's in private mode). Runs just fine on it's own for a while, then it crashes and 502 bad gateway errors begin. Logs show the same error as in the messages above.
1 GB Memory / 25 GB Disk Ghost Version: 3.40.2 Node Version: 12.18.0 Database: MySQL 5.7.32-0ubuntu0.18.04.1 (Ubuntu)
Restarting ghost "fixes" the issue, until it happens again :)
@stromfeldt @jliikala thanks for your input guys. I was not able to see open file handle count increase after quick testing on my local instance. This might be a bit of a timesink of a problem to investigate. Would be cool if one of you could track how the Ghost process behaves and spot when the file handles count is growing (this SO answer could be useful for debugging purposes). Thanks!
Found a similar problem that was reported in one of the dependencies. A solution was to increase the maximum amount of open files. Still, would be interesting to know if there is no accidental leak happening.
How do I track file handles for an extended period of time? As well, I should apparently reiterate that "it seems like the number of active handles with Ghost is really low (10% of the limit)".
@stromfeldt cant think of any specific tool for that. You could create a cron record that would run a small monitoring script every minute or so. Script would find all open file handles for running node processes and append the output to a file:
lsof -i -n -P | grep "node.*" | wc -l >> ghost-handles.txt
might be useful to add a date to each record to be able to compare with Ghost's logs for correlation with any errors/etc.
This happened to me as well, also on a Digital Ocean droplet configured like @jliikala describes. A bit of additional information, in case it's useful: I created this droplet from DO's default Ghost image, I'm using the Attila theme, and I'm using mailgun for mail (and emails have been going out correctly so believe it's configured correctly).
I'm having exactly the same issue as everyone else including the error message. I'm also on a Digital Ocean droplet with Mailgun integration. Restarting temporarily resolves the problem for a day or two.
I'm not sure if it's related, but I disabled analytics under Labs > Members > Email > Enable newsletter open-rate analytics
when the feature was introduced, yet strangely I still see "emails opened" under member details.
but I disabled analytics under Labs > Members > Email > Enable newsletter open-rate analytics when the feature was introduced, yet strangely I still see "emails opened" under member details.
Hey @onurozer , this one could be worth investigating and filing as a separate bug. Would appreciate if you could do it!
Very frustrating to find my site throwing a 502 every few days lately... Is the increasing the max files allowed fixed it for anyone here?
@zackify I agree, I downgraded to v3.38.3 while I'm waiting for this bug to be fixed. Honestly, nothing significant has changed between 3.38.3 and 3.41.2 that would warrant tolerating the 502s.
@willkramer Does this mean that we know the bug was introduced after v3.38.3
?
It hasn’t happened since I’ve turned off click tracking for newsletters... unfortunate but better than crashing every week
@steakscience tracking for newsletters was a developer only feature in v3.38.3 and was exposed to the general interface in v3.39.0.
I assume this bug as been around but wasn't discovered until v3.39.0.
@zackify it's good to know that we can workaround the issue by turning off the tracking for newsletter feature. In the interim, you can enable newsletter tracking with Mailgun directly and check the stats from Mailgun's admin interface. In the Mailgun admin, go to Sending > Domain settings and enable "Open tracking."
I think the newsletter option may be a random coincidence rather than being a correlation. Toggling the checkbox only changes a flag that is set on outgoing emails sent via Mailgun's API, it does not turn on or off anything to do with workers or jobs.
In the Mailgun admin, go to Sending > Domain settings and enable "Open tracking."
As far as Ghost is concerned, this would have the same effect as toggling the option on in Ghost.
Can everyone who is reporting this issue please confirm their debug info:
The original reporter has deleted that part of the issue template, making this incredibly hard to debug. Thanks to @jliikala for providing this, but if we could get a few more that would really help with reproduction.
Thanks for looking into this!
Ghost Version: 3.40.5 Node Version: v12.20.1 Browser/OS: Safari Version 14.0.2 on macOS Big Sur Database: mysql Ver 14.14 Distrib 5.7.32
Also opened an issue with potentially related further detail.
Thanks for looking into it!
Ghost Version: 3.40.2 Node Version: 12.18.0 Server OS: Unbuntu 18.04.04 Database: mysql 14.14 Browser: happens on all, last verified on Chrome for Mac 88.0.4324.77 and Firefox for Mac 78.6.1esr
Using the standard DigitalOcean droplet
Ghost Version: 3.41.2 (but happens on previous versions as well) Node Version: 12.18.0 Server OS: Ubuntu 18.04 Database: mysql Ver 14.14 Distrib 5.7.32
Nevermind... it's down again after turning off email analytics. here's the log. I'm going to start losing search engine rankings with how much my site is down :(
Ghost-CLI version: 1.15.3 Ghost version: 3.40.5 Node Version: v14.15.3 Server OS: Ubuntu 18.04.4 LTS Database: mysql Ver 14.14
Thanks for all the additional debug info. If anyone else encounters this issue and wants to add theirs its always helpful.
I realise this is frustrating, but all of the members and email features are currently beta features, so this kind of thing is expected.
A potential workaround short term is to disable email analytics in config.production.com There are two options:
emailAnalytics: false
turns off the background jobs and hides any analytics related things in the admin. backgroundJobs: {emailAnalytics: false}
turns off the background jobs but will still show historic stats and set the open tracking flag on emails if that's turned on.As @kevinansfield mentioned, turning off the open tracking does not turn off the job.
As for what's causing the bug, I'm pretty sure it's something to do with how log files are handled inside the workers as hinted at here.
I'm curious if there's any common pattern with the errors, has anyone noticed if they happen at a specific time of day, at a common interval, or a certain time after an email is sent? Also curious if you guys have a lot of members/emails.
Also, curious if there's anything weird in your log directory.
I realise this is frustrating, but all of the members and email features are currently beta features, so this kind of thing is expected.
I'm glad we can help iron it out before the v4 release!
Same problem. I get 502 Bad Gateway every few days on a newly, successfully installed 1-Click DO Droplet ($5).
Here is my log: https://imgur.com/WKBXCL0
@ErisDS
I'm curious if there's any common pattern with the errors, has anyone noticed if they happen at a specific time of day, at a common interval, or a certain time after an email is sent? Also curious if you guys have a lot of members/emails.
For me, this has happened 3 times: Jan 3rd at 13:54, Jan 14th at 00:54 and Jan 24th at 16:17. So approximitely every 10 days or so, but not exactly. The site itself is still under development, so I only have few test subscribers there receiving the emails. Hasn't happened since I disabled analytics from the UI, but probably because I've also restarted ghost on a few occasions lately 😄
Also, curious if there's anything weird in your log directory.
Just two files, https_mydomain_comproduction.error.log and httpsmydomain_com_production.log. The first has a filesize of 5MB, the latter is 290MB.
Hi,
I may have some valuable information if my issue here is of any relevance.
@ErisDS Here are some patterns I found with the errors
Worker for job \"email-analytics-fetch-latest\" exited with code 1
Happened today again. 7 days after the last Bad Gateway error. This is starting to get unacceptable.
Happened to me today as well.
I even increased the open file limit. I can’t trust that my site will stay up and now I check it every morning and have to run ghost reboot often...
Why can’t the code be setup to catch if that worker fails and not bring down the whole app?
My nginx/log is full of:
6706#6706: *39938 upstream prematurely closed connection while reading response header from upstream, client: 162.xxxxx, server: www.xxxx.de, request: "GET /robots.txt HTTP/1.1", upstream: "http://127.0.xxxxx8/robots.txt", host: "
@steakscience Thanks for that information!
@zackify & @reloaddoc Please take the time to read the thread before commenting. I have already provided an explanation, simple way to prevent it and a request for useful debugging information:
https://github.com/TryGhost/Ghost/issues/12496#issuecomment-773585550
Thank you for the work around, somehow I missed that!
@steakscience Thanks for that information!
@zackify & @reloaddoc Please take the time to read the thread before commenting. I have already provided an explanation, simple way to prevent it and a request for useful debugging information:
Thanks. There is no emailAnalaytics line in my config.production.json, though.
@reloaddoc Follow this format, but make it false
https://github.com/TryGhost/Ghost/blob/main/core/shared/config/defaults.json#L119
@reloaddoc Follow this format, but make it
false
https://github.com/TryGhost/Ghost/blob/main/core/shared/config/defaults.json#L119
Thank you. Fingers crossed.
Running into this issue too, going to set up a healthcheck to ensure that I catch my site going down - it looks like it's been down at least since yesterday 🙃
Has anyone had success upgrading from the $5 DO droplet to $10+ droplets to fix this? It seems like it might be a memory issue, and I've tried bumping my ulimit
as well, seemingly to no effect.
Reproduction case for this issue contributed by @daniellockyer:
job.js
file as below, but require logging from Ghost's shared logging modulemain.js
below (or more just to make it quicker)ls -la /proc/$(pidof node)/fd | wc -l
- observe the count grow indicating the problemmain.js
file:
const JobManager = require('@tryghost/job-manager');
jobManager.addJob({
at: 'every 1 second',
job: './job.js'
});
job.js
file:
const {parentPort} = require('bthreads');
const {logging} = require('ghost-ignition');
const logger = logging({...conf to test...});
(async () => {
logging.info('job ran');
parentPort.postMessage('done');
})();
Configure logging to use file logs:
"logging": {
"path": "./content/logs/",
"level": "debug",
"transports": ["stdout", "file"]
},
TLDR summary: the issue boils down to having orphaned open file handles after worker thread termination.
With each worker thread that is started when a new recurring job is registered it creates a new instance of a logger through const {logging} = require('ghost-ignition');
. Under the hood ignition's logging is wrapping an instance of node-bunyan. Node bunyan has a known problem with stream closure on process termination (same applies for worker threads). Every time a job terminates there's a file handle left out open and the instance eventually runs out of handle limit (with cheap DO boxes the amount is quite small).
Solving internal problems in bunyan is a rabbit hole, it didn't get solved in 9 years, probably won't get solved within any reasonable time. In my opinion the only way out of this situation is working around the limitation while keeping the functionality same or very close to current. From some experimentation I was able to confirm that leak does not happen if we do not use internal logger and instead send information needed to be logged as a message through parentPort. To get the functionality close to how the logger.info
works in current jobs there would be a need to modify bree to accept a custom logger formatter similarly to existing error handler. This way, there will be single instance of logger used within job service (a singleton) where leaks like this one would be eliminated completely.
Some of the dependencies used in the jobs were still creating instances of logger, which is being solved atm.
Testing scenario for local Ghost instance:
testmode
and add file based logging by adding following to the config file:
"server": {
"testmode": true
},
"logging": {
"path": "./content/logs/",
"level": "debug",
"transports": ["stdout", "file"]
},
$GHOST_PID
by running:
GHOST_PID=$(lsof -i -n -P | grep 2368 | awk '{print $2}')
ls -la /proc/$GHOST_PID/fd | wc -l
graceful-job
to run every 10 seconds by running:
curl http://localhost:2368/ghost/api/schedule/every%2010%20seconds/graceful-job
If after multiple worker runs the number steadily increases, there's still a leak present. For a ballpark of file handles on my local instance it usually hovers between 70 to 80.
Hi all! 👋🏻 We've released 3.41.7 with a fix for this so I'd recommend upgrading to this version with Ghost-CLI. If you continue to spot anything off - please do let us know! 🙂
Thank you for addressing this issue. I have been experiencing this for some time and finally got around to working on the issue when I stumbled on this. I have upgraded to latest release and will keep an eye on it!
Context
Every 2 weeks or so my Ghost website (hosted on a Digital Ocean droplet) crashes with this error message:
Reproduction scenario
A possible scenario confirming the issue described in this comment below.
Environment
Most cases reported come from: Ghost Version: 3.41.2 Node Version: 12.18.0 Server OS: Ubuntu 18.04 Database: mysql Ver 14.14 Distrib 5.7.32