OpenNews / etherpad-lite

Etherpad: Really real-time collaborative document editing
http://etherpad.org
Apache License 2.0
1 stars 0 forks source link

resolve Heroku startup issues #1

Open ryanpitts opened 6 years ago

ryanpitts commented 6 years ago

TL;DR: On Heroku's daily restart, this app sometimes crashes because it doesn't start up quickly enough.

Quick background on how OpenNews etherpad works

Overview: We run an etherpad-lite instance on Heroku, which is publicly available at https://etherpad.opennews.org. The instance uses a Postgres db and a Standard 1X dyno (512MB RAM, 1x CPU share). In testing via etherpad-load-test, these resources are more than enough to handle our normal traffic.

Deployment details: We use SSL for our etherpad-lite instance, which means that to run on Heroku, we need our own forked version of etherpad-lite that includes heroku-ssl-redirect. That means we also use a forked version of etherpad-lite-heroku, which pulls in our version of the etherpad software as a submodule.

The etherpad-lite-heroku wrapper is what actually gets deployed to Heroku, where it runs a launch script that does some config and starts the etherpad service.

The reboot problem

Heroku restarts your app dynos once a day for maintenance, which is normally a fine thing. However, etherpad-lite occasionally takes too long to start back up, resulting in:

screen shot 2017-08-10 at 11 48 51 am

During the etherpad-lite startup process, it checks in with npm on a whole list of dependencies. A number of them appear to be outdated, which I think might be the root of the problem here. The software seems to work fine once it's actually running, but sometimes the reboot itself takes long enough that Heroku throws a timeout and the app crashes again.

The short-term fix is manually restarting our Heroku instance—usually this only requires 1 or 2 restarts, but occasionally takes 10 or so. The most problematic times are mid-morning and midday (when there's more overall web traffic, which is what makes me think those dependency checks are the problem). The long-term fix, of course, involves updating the etherpad-lite software.

Some logs

I've been able to fork and modify these etherpad apps, get them running on Heroku, and do a certain amount of troubleshooting, but I'm about at my limit of feeling comfortable ripping into node software. Here are a few logs that hopefully tell some tales:

knowtheory commented 6 years ago

Hey Ryan, here are a couple thoughts:

  1. Having read through the logs you posted, even the successful run is barely under 60s.
  2. You can bandaid this problem by asking to bump the 60s limit up to 120s.
  3. I'm curious whether it'd be possible to either prepackage all of the components so that dependency checking is unnecessary, or if disabling dependency checking is a viable option.
  4. How much have you dug into other ways of providing an HTTPS endpoint, other than maintaining your own fork of etherpad lite?
ryanpitts commented 6 years ago

1 & 2. AHHHHH I didn't realize that was a request I could make.

  1. Yes! That's exactly what I thinking, plus there are a number of things (e.g. "Jade has been renamed to pug, please install the latest version of pug instead of jade") that seem like they just need to be updated.
  2. I spent a decent amount of time investigating options for getting HTTPS working on Heroku, and the big sticking point was the force redirect. So ... some? At least? I fully recognize the limits of my node abilities though. It may be that there's a better way than https://github.com/OpenNews/etherpad-lite/commit/d14eb4942a3709491d683ce31e7512bc28e838ff
ryanpitts commented 6 years ago

noting that Heroku has increased the app's boot timeout to 120 seconds for now

knowtheory commented 6 years ago

Okay! So, to some extent, having waited a little bit has made this all easier.

Here are some observations about etherpad-lite (and we can get into the heroku wrapper as well).

npm layout

etherpad-lite uses npm in a slightly non-standard way. The root of the etherpad-lite repo is used as the primary working space, and the place where one can access the bin commands and all of that stuff. However, for npm's purposes the root of the project is actually the src directory. That's the location that etherpad-lite sticks its node_modules directory and package.json, and then sym-links node_modules from the src directory into the repo root.

Why does this matter? Well, when it comes to mismatches between local dev & deployment to Heroku, I had initially assumed that restoring to a fresh / clean startup just required clearing out the node_modules directory in the project root. NOPE. gotta kill it in the src directory.

Old versions of Node & npm

The heroku wrapper is pegged to the 0.10.x series of node (and the accompanying version of npm). That's probably been fine (and i'm impressed with how easy it is to get etherpad-lite up and running across node versions), but the 0.10.x versions of node are now quite out of date and out of LTS.

Additionally, the 5.x series for npm now caches the results of dependency checking & resolution in a package-lock.json file which substantially improves npm install speed.

Old project dependencies

A number of changes have been made to a lot of different npm packages since node 0.10.x, and in particular a couple different security changes, as some of nodes core classes were deemed to be potential vectors for exploits. As a consequence there's been a lot of stuff to upgrade.

Why does etherpad-lite take so long to start up?

Mostly because of npm dependency checking and resolving what to install from a cold-boot.

What can we do about it?

Well, mostly we should try and make npm do less work.

Okay so what should we actually do?

My immediate thought was to try to upgrade to npm 5.x and commit a package-lock.json file, and see if that helps.

Okay, how long is that going to take?

So this is where waiting things out a little helped. The main etherpad-lite repository just cut a new release 4 days ago: https://github.com/ether/etherpad-lite/commit/32027134cbe4e37ced89091bf05e9fd07980ca12

I've merged that up into this repository, added the package-lock.json and pushed it up to here and to staging.

Bump into anything weird?

Yeah. The other part of deployment was updating the heroku wrapper. The heroku wrapper is the thing that dictates what node version to run, and consequently what version of npm is being run by default. (you can run npm 5.x with older versions, but like, what the heck, lets try and upgrade all the way)

It's entirely unclear to me what the heck the dependencies listed in the heroku wrapper's package.json are about, and i deleted them to seemingly no effect. I did that and bumped up the node version to 8.x (which gets npm 5.x by default).

The one snafu is that the database driver that etherpad-lite relies upon, ueberdb hasn't cut a release for the aforementioned security changes in node. Under the hood node changed the way their Buffer class works, and that screws up ueberdb's ability to connect to postgres databases in their release version.

This is particularly peculiar, because they've merged changes to handle this into their master branch. They just haven't cut a release for a number of months.

Pegging this etherpad-lite repository to ueberdb's master branch fixes the issue (and is currently deployed staging).

So it's fixed?

Yeah, mostly, i think! The app starts up fast on staging, and skips most of the dependency checking.

What are we going to have to keep track of going forward?

Well, first things first, we should probably peg to a specific commit sha for ueberdb. I need to check/read up on what the npm syntax for that is.

Mostly we're going to need to keep on top of further changes going forward. It is vastly preferable not to have to peg to a github repository for ueberdb. Additionally when further changes come down the pike, it'll be important to update the package-lock.json and push that out to the app.

knowtheory commented 6 years ago

Checking in with @ryanpitts to see if we can close this ticket! :)

ryanpitts commented 6 years ago

ooh yes, we should close it, but we should also migrate the awesome notes you wrote up somewhere

knowtheory commented 6 years ago

Looks like there's some action towards getting a package released on the ueberdb side of things too! https://github.com/Pita/ueberDB/issues/101

knowtheory commented 6 years ago

btw, a new version of ueberDB2 was released a few months ago, so we should bump off of the github repo to the NPM version: https://www.npmjs.com/package/ueberdb2

ryanpitts commented 6 years ago

sweet! I'm going to make a calendar reminder to follow up here after SRCCON though :)

ryanpitts commented 5 years ago

noting that we're pegged to the latest ueberdb2 now https://github.com/OpenNews/etherpad-lite/commit/645d3569a5e9c4759d7c216f3baee7cf30c49c96

I think this issue could be closed now? @knowtheory what do you think?

knowtheory commented 5 years ago

Close!

Last recommendation i have now is that ueberdb2 has finally started cutting releases again, so we can stop pegging to their shas, and just point to the main released version (which appears to be 0.4.0), which the main etherpad-lite repo points to as well.

I would specify that version, give it a push to the staging server to test 4realz and then if that works push it to prod.

ryanpitts commented 5 years ago

yep, we're pointing to ueberdb2 0.4.0 as well https://github.com/OpenNews/etherpad-lite/blob/master/src/package.json#L58