TryGhost / Ghost

Independent technology for modern publishing, memberships, subscriptions and newsletters.
https://ghost.org
MIT License
47.24k stars 10.3k forks source link

Worker initialization failure: EMFILE #12496

Closed kprimice closed 3 years ago

kprimice commented 3 years ago

Context

Every 2 weeks or so my Ghost website (hosted on a Digital Ocean droplet) crashes with this error message:

[2020-12-27 05:30:19] ERROR

CODE: ERR_WORKER_INIT_FAILED
MESSAGE: Worker initialization failure: EMFILE

Error [ERR_WORKER_INIT_FAILED]: Worker initialization failure: EMFILE
    at Worker.[kOnExit] (internal/worker.js:229:26)
    at Worker.<computed>.onexit (internal/worker.js:165:20)

[2020-12-27 05:30:19] INFO Worker for job "email-analytics-fetch-latest" exited with code 0

Reproduction scenario

A possible scenario confirming the issue described in this comment below.

Environment

Most cases reported come from: Ghost Version: 3.41.2 Node Version: 12.18.0 Server OS: Ubuntu 18.04 Database: mysql Ver 14.14 Distrib 5.7.32

naz commented 3 years ago

Hey @kprimice. It would be useful to get more information about the environment you are running your Ghost instance in. Thanks

stromfeldt commented 3 years ago

Hey @naz. I've currently got something similar happening on my system, and it was suggested to me that I share it with you.

CODE: ERR_WORKER_INIT_FAILED
MESSAGE: Worker initialization failure: EMFILE

Error [ERR_WORKER_INIT_FAILED]: Worker initialization failure: EMFILE
    at Worker.[kOnExit] (internal/worker.js:195:26)
    at Worker.<computed>.onexit (internal/worker.js:141:20)

[2021-01-01 06:52:32] ERROR "PUT /ghost/api/v3/admin/settings/" 400 8ms

I'm currently running, ahem, five instances on a single $5 Linode, as well as a sixth instance that is stopped. Ubuntu 16.04, RAM: 431 MB / 964 MB, Disk: 13G / 20G, ulimit -a --> open files (-n) 1024. I should apparently also mention that "none of the processes are near the limit. Most node processes are in the upper 20s and Mysql is in the low 100s".

Four of the instances, plus the stopped instance, are for development purposes and so neither their admins not sites get hit up very often. The final instance is my live blog though, and although it's not exactly getting a lot of hits at the moment I do access the admin quite often.

Let me know if you have any other questions :)

jliikala commented 3 years ago

I'm seeing this same issue with my Digital Ocean $5 Ghost installation droplet. The Ghost site is not in public use yet (it's in private mode). Runs just fine on it's own for a while, then it crashes and 502 bad gateway errors begin. Logs show the same error as in the messages above.

1 GB Memory / 25 GB Disk Ghost Version: 3.40.2 Node Version: 12.18.0 Database: MySQL 5.7.32-0ubuntu0.18.04.1 (Ubuntu)

Restarting ghost "fixes" the issue, until it happens again :)

naz commented 3 years ago

@stromfeldt @jliikala thanks for your input guys. I was not able to see open file handle count increase after quick testing on my local instance. This might be a bit of a timesink of a problem to investigate. Would be cool if one of you could track how the Ghost process behaves and spot when the file handles count is growing (this SO answer could be useful for debugging purposes). Thanks!

naz commented 3 years ago

Found a similar problem that was reported in one of the dependencies. A solution was to increase the maximum amount of open files. Still, would be interesting to know if there is no accidental leak happening.

stromfeldt commented 3 years ago

How do I track file handles for an extended period of time? As well, I should apparently reiterate that "it seems like the number of active handles with Ghost is really low (10% of the limit)".

naz commented 3 years ago

@stromfeldt cant think of any specific tool for that. You could create a cron record that would run a small monitoring script every minute or so. Script would find all open file handles for running node processes and append the output to a file:

lsof -i -n -P | grep "node.*" | wc -l >> ghost-handles.txt

might be useful to add a date to each record to be able to compare with Ghost's logs for correlation with any errors/etc.

jonpincus commented 3 years ago

This happened to me as well, also on a Digital Ocean droplet configured like @jliikala describes. A bit of additional information, in case it's useful: I created this droplet from DO's default Ghost image, I'm using the Attila theme, and I'm using mailgun for mail (and emails have been going out correctly so believe it's configured correctly).

onurozer commented 3 years ago

I'm having exactly the same issue as everyone else including the error message. I'm also on a Digital Ocean droplet with Mailgun integration. Restarting temporarily resolves the problem for a day or two.

I'm not sure if it's related, but I disabled analytics under Labs > Members > Email > Enable newsletter open-rate analytics when the feature was introduced, yet strangely I still see "emails opened" under member details.

naz commented 3 years ago

but I disabled analytics under Labs > Members > Email > Enable newsletter open-rate analytics when the feature was introduced, yet strangely I still see "emails opened" under member details.

Hey @onurozer , this one could be worth investigating and filing as a separate bug. Would appreciate if you could do it!

zackify commented 3 years ago

Very frustrating to find my site throwing a 502 every few days lately... Is the increasing the max files allowed fixed it for anyone here?

willkramer commented 3 years ago

@zackify I agree, I downgraded to v3.38.3 while I'm waiting for this bug to be fixed. Honestly, nothing significant has changed between 3.38.3 and 3.41.2 that would warrant tolerating the 502s.

steakscience commented 3 years ago

@willkramer Does this mean that we know the bug was introduced after v3.38.3?

zackify commented 3 years ago

It hasn’t happened since I’ve turned off click tracking for newsletters... unfortunate but better than crashing every week

willkramer commented 3 years ago

@steakscience tracking for newsletters was a developer only feature in v3.38.3 and was exposed to the general interface in v3.39.0.

I assume this bug as been around but wasn't discovered until v3.39.0.

@zackify it's good to know that we can workaround the issue by turning off the tracking for newsletter feature. In the interim, you can enable newsletter tracking with Mailgun directly and check the stats from Mailgun's admin interface. In the Mailgun admin, go to Sending > Domain settings and enable "Open tracking."

kevinansfield commented 3 years ago

I think the newsletter option may be a random coincidence rather than being a correlation. Toggling the checkbox only changes a flag that is set on outgoing emails sent via Mailgun's API, it does not turn on or off anything to do with workers or jobs.

In the Mailgun admin, go to Sending > Domain settings and enable "Open tracking."

As far as Ghost is concerned, this would have the same effect as toggling the option on in Ghost.

ErisDS commented 3 years ago

Can everyone who is reporting this issue please confirm their debug info:

The original reporter has deleted that part of the issue template, making this incredibly hard to debug. Thanks to @jliikala for providing this, but if we could get a few more that would really help with reproduction.

onurozer commented 3 years ago

Thanks for looking into this!

Ghost Version: 3.40.5 Node Version: v12.20.1 Browser/OS: Safari Version 14.0.2 on macOS Big Sur Database: mysql Ver 14.14 Distrib 5.7.32

Also opened an issue with potentially related further detail.

jonpincus commented 3 years ago

Thanks for looking into it!

Ghost Version: 3.40.2 Node Version: 12.18.0 Server OS: Unbuntu 18.04.04 Database: mysql 14.14 Browser: happens on all, last verified on Chrome for Mac 88.0.4324.77 and Firefox for Mac 78.6.1esr

steakscience commented 3 years ago

Using the standard DigitalOcean droplet

Ghost Version: 3.41.2 (but happens on previous versions as well) Node Version: 12.18.0 Server OS: Ubuntu 18.04 Database: mysql Ver 14.14 Distrib 5.7.32

zackify commented 3 years ago

Nevermind... it's down again after turning off email analytics. here's the log. I'm going to start losing search engine rankings with how much my site is down :( image

Ghost-CLI version: 1.15.3 Ghost version: 3.40.5 Node Version: v14.15.3 Server OS: Ubuntu 18.04.4 LTS Database: mysql Ver 14.14

ErisDS commented 3 years ago

Thanks for all the additional debug info. If anyone else encounters this issue and wants to add theirs its always helpful.

I realise this is frustrating, but all of the members and email features are currently beta features, so this kind of thing is expected.

A potential workaround short term is to disable email analytics in config.production.com There are two options:

As @kevinansfield mentioned, turning off the open tracking does not turn off the job.

As for what's causing the bug, I'm pretty sure it's something to do with how log files are handled inside the workers as hinted at here.

I'm curious if there's any common pattern with the errors, has anyone noticed if they happen at a specific time of day, at a common interval, or a certain time after an email is sent? Also curious if you guys have a lot of members/emails.

Also, curious if there's anything weird in your log directory.

steakscience commented 3 years ago

I realise this is frustrating, but all of the members and email features are currently beta features, so this kind of thing is expected.

I'm glad we can help iron it out before the v4 release!

reloaddoc commented 3 years ago

Same problem. I get 502 Bad Gateway every few days on a newly, successfully installed 1-Click DO Droplet ($5).

Here is my log: https://imgur.com/WKBXCL0

jliikala commented 3 years ago

@ErisDS

I'm curious if there's any common pattern with the errors, has anyone noticed if they happen at a specific time of day, at a common interval, or a certain time after an email is sent? Also curious if you guys have a lot of members/emails.

For me, this has happened 3 times: Jan 3rd at 13:54, Jan 14th at 00:54 and Jan 24th at 16:17. So approximitely every 10 days or so, but not exactly. The site itself is still under development, so I only have few test subscribers there receiving the emails. Hasn't happened since I disabled analytics from the UI, but probably because I've also restarted ghost on a few occasions lately 😄

Also, curious if there's anything weird in your log directory.

Just two files, https_mydomain_comproduction.error.log and httpsmydomain_com_production.log. The first has a filesize of 5MB, the latter is 290MB.

samism commented 3 years ago

Hi,

I may have some valuable information if my issue here is of any relevance.

steakscience commented 3 years ago

@ErisDS Here are some patterns I found with the errors

Screen Shot 2021-02-11 at 8 52 24 AM
reloaddoc commented 3 years ago

Happened today again. 7 days after the last Bad Gateway error. This is starting to get unacceptable.

zackify commented 3 years ago

Happened to me today as well.

I even increased the open file limit. I can’t trust that my site will stay up and now I check it every morning and have to run ghost reboot often...

zackify commented 3 years ago

Why can’t the code be setup to catch if that worker fails and not bring down the whole app?

reloaddoc commented 3 years ago

My nginx/log is full of:

6706#6706: *39938 upstream prematurely closed connection while reading response header from upstream, client: 162.xxxxx, server: www.xxxx.de, request: "GET /robots.txt HTTP/1.1", upstream: "http://127.0.xxxxx8/robots.txt", host: "

ErisDS commented 3 years ago

@steakscience Thanks for that information!

@zackify & @reloaddoc Please take the time to read the thread before commenting. I have already provided an explanation, simple way to prevent it and a request for useful debugging information:

https://github.com/TryGhost/Ghost/issues/12496#issuecomment-773585550

zackify commented 3 years ago

Thank you for the work around, somehow I missed that!

reloaddoc commented 3 years ago

@steakscience Thanks for that information!

@zackify & @reloaddoc Please take the time to read the thread before commenting. I have already provided an explanation, simple way to prevent it and a request for useful debugging information:

#12496 (comment)

Thanks. There is no emailAnalaytics line in my config.production.json, though.

steakscience commented 3 years ago

@reloaddoc Follow this format, but make it false https://github.com/TryGhost/Ghost/blob/main/core/shared/config/defaults.json#L119

reloaddoc commented 3 years ago

@reloaddoc Follow this format, but make it false https://github.com/TryGhost/Ghost/blob/main/core/shared/config/defaults.json#L119

Thank you. Fingers crossed.

kristianfreeman commented 3 years ago

Running into this issue too, going to set up a healthcheck to ensure that I catch my site going down - it looks like it's been down at least since yesterday 🙃

Has anyone had success upgrading from the $5 DO droplet to $10+ droplets to fix this? It seems like it might be a memory issue, and I've tried bumping my ulimit as well, seemingly to no effect.

naz commented 3 years ago

Reproduction case for this issue contributed by @daniellockyer:

  1. Enabled file-based logging in config.development.json
  2. Add the job.js file as below, but require logging from Ghost's shared logging module
  3. Modify ./core/server/services/jobs/job-service.js to spin up a job every second with the contents of main.js below (or more just to make it quicker)
  4. Run Ghost
  5. ls -la /proc/$(pidof node)/fd | wc -l - observe the count grow indicating the problem

main.js file:

const JobManager = require('@tryghost/job-manager');
jobManager.addJob({
    at: 'every 1 second',
    job: './job.js'
});

job.js file:

const {parentPort} = require('bthreads');
const {logging} = require('ghost-ignition');

const logger = logging({...conf to test...});

(async () => {
    logging.info('job ran');
    parentPort.postMessage('done');
})();

Configure logging to use file logs:

     "logging": {
        "path": "./content/logs/",
        "level": "debug",
        "transports": ["stdout", "file"]
     },
naz commented 3 years ago

TLDR summary: the issue boils down to having orphaned open file handles after worker thread termination.

Full context

With each worker thread that is started when a new recurring job is registered it creates a new instance of a logger through const {logging} = require('ghost-ignition');. Under the hood ignition's logging is wrapping an instance of node-bunyan. Node bunyan has a known problem with stream closure on process termination (same applies for worker threads). Every time a job terminates there's a file handle left out open and the instance eventually runs out of handle limit (with cheap DO boxes the amount is quite small).

Way forward with current issue

Solving internal problems in bunyan is a rabbit hole, it didn't get solved in 9 years, probably won't get solved within any reasonable time. In my opinion the only way out of this situation is working around the limitation while keeping the functionality same or very close to current. From some experimentation I was able to confirm that leak does not happen if we do not use internal logger and instead send information needed to be logged as a message through parentPort. To get the functionality close to how the logger.info works in current jobs there would be a need to modify bree to accept a custom logger formatter similarly to existing error handler. This way, there will be single instance of logger used within job service (a singleton) where leaks like this one would be eliminated completely.

naz commented 3 years ago

Some of the dependencies used in the jobs were still creating instances of logger, which is being solved atm.

naz commented 3 years ago

Testing scenario for local Ghost instance:

  1. Put the instance into a testmode and add file based logging by adding following to the config file:
     "server": {
        "testmode": true
     },
     "logging": {
        "path": "./content/logs/",
        "level": "debug",
        "transports": ["stdout", "file"]
     },
  2. Start Ghost instance
  3. Find the process id of the instance and record it. For example save into $GHOST_PID by running:
    GHOST_PID=$(lsof -i -n -P | grep 2368 | awk '{print $2}')
  4. List amount of open file handles by the Ghost process by running:
    ls -la /proc/$GHOST_PID/fd | wc -l
  5. To create a file handle leak, schedule a job that uses logging module running in short interval. For example, schedule a graceful-job to run every 10 seconds by running:
    curl http://localhost:2368/ghost/api/schedule/every%2010%20seconds/graceful-job
  6. Run file handle counting command from step 4 to check file handle count after job termination.

If after multiple worker runs the number steadily increases, there's still a leak present. For a ballpark of file handles on my local instance it usually hovers between 70 to 80.

daniellockyer commented 3 years ago

Hi all! 👋🏻 We've released 3.41.7 with a fix for this so I'd recommend upgrading to this version with Ghost-CLI. If you continue to spot anything off - please do let us know! 🙂

learyjk commented 3 years ago

Thank you for addressing this issue. I have been experiencing this for some time and finally got around to working on the issue when I stumbled on this. I have upgraded to latest release and will keep an eye on it!