Server down/stalled... - Githubissues

Martii commented 8 years ago

I'm unable to get into the VPS to restart it and it's spinning in a web browser. NOTE: This is purely a VPS issue with our provider and not the project nor the node configuration.

Messaged @sizzlemctwizzle Cc: @jonleibowitz

Last script update on local pro at 2016-04-05T12:07:05.214Z

Refs:

https://openuserjs.org/discuss/Huge_Number_of_Downloads (general issue on pro)
https://openuserjs.org/announcements/Server_Maintenance#comment-153ed665673 (announcement on pro)
https://status.github.com/messages (May be related to queue backlog and other GH related issues today... ongoing so far... slightly affecting local pro)
http://forums.debian.net/viewtopic.php?f=17&t=127845 (not exactly what happened this morning but possible if it's indexing constantly... allegedly another dist-upgrade should fix according to this)
http://forums.debian.net/viewtopic.php?f=5&t=127832 (kernel update breaking things... this is my guess)
https://www.debian.org/News/2016/20160402 (latest intel on fixes)
425
Tampermonkeys derived code with q.XMLHTTPREQUEST.RETRIES in .user.js

Martii commented 8 years ago

Found a break in the system... either a hiccup on whatever is causing this... or my distro update on laptop which may have a compatible client to connect to the latest Debian... or sizzle... although I did try Windows VM (virtual machine) and PM (physical machine), Debian VM, ArchLinux VM, ArchLinux PM, and other Linux PM too and those failed... so not entirely sure. (too many inet issues today everywhere)

I have already seen a 503 with toobusy-js on login that is probably GH's issue (still guessing here)... leaving this open for a more detailed investigation over the next few days. Apologies for this unscheduled outage... definitely out of my control at this time.

Btw dist-upgrade yielded no further updates. :\

Martii commented 8 years ago

Looks like it's still down.

Martii commented 8 years ago

No news yet.

Martii commented 8 years ago

PENDING!... got 5 "too busy's" on login... but I have now have access... we'll see how long this stays up on the VPS.

Martii commented 8 years ago

One server restart detected... investigating.

lelinhtinh commented 8 years ago

503 ...

Martii commented 8 years ago

I know... it's going to take a bit to resolve this... something is chewing up memory and causing the VPS to crash... this was before the 503 addition of toobusy-js... I'm probably going to take the server down, do some recompilations, and see if that helps. e.g. that is why it's PENDING status right now.

Patience please. :)

Martii commented 8 years ago

Downgrading node didn't help... seems that the malloc (or whichever lowlevel lib is being used) isn't freeing up memory in the distro/VPS.

I'm going to try disabling script minification just to be sure... with an environment variable to be added... don't worry I'll have it pass-through to the unminified so it doesn't break scripts.

Martii commented 8 years ago

Still losing memory, although slower, with disabling of script minification... e.g. the VPS is going to crash again... watching it right now go down, down, up, down, down, down, up, down, down, down, etc... until eventually there is zero free memory.

So systematically ruled out our project and reaffirmed this is a distro/VPS issue. :\

Martii commented 8 years ago

And there it goes. :\

Martii commented 8 years ago

Have to AFK for a few hours... will be back to try some other things as soon as I can. :\ Leaving the site OFFLINE for the moment.

Martii commented 8 years ago

@sizzlemctwizzle and anyone watching, So I've put up a constant 503 on all routes at the moment... it's not very pretty but it will at least let everyone know that "we're busy... try again later" (better than nothing). This is hard-coded into the app.js with a manual FORCE_BUSY='true' in the env and not here on GH dev yet... still running some tests to see if this portion stays up. So far we are around a constant ~6% memory usage... will monitor this for a few hours... sleep in between... waking up... seeing if Debian has an update that fixes this before I make more reports, etc.

I've tried many different versions of node all result in the same issue with this kernel image on the VPS.. e.g. memory gets eaten up. Using the precompiled _node_s, the server lasts for less than 5 minutes... with a manual build from _node_s source I can sometimes get about 45 minutes of uptime. NEITHER OPTION IS SUITABLE as I can't babysit the server that constantly.

I've also looked into backing out/rolling back the last dist-upgrade and of course the old packages aren't available on the official repos... so that will fail.

Only three options are left that I can think of...

Run the kernel recovery, assuming this works, and see if that works. There is exactly one snapshot, and only one total of any snapshot, of this bad VM... so twiddling can be undone now. Eventually there will be a decent snapshot that we can rollback to.
Recreate the VM from scratch and see if it still does this memory leak... if it does switch distros in a new VM. Some of this is beyond my access as well as @sizzlemctwizzle did some configuration that I'm not aware of (yet?).
Wait......................... (as for this option... adding tracking upstream... I'll have to create an issue on Debian first, then nodejs Cc: @mikeal ... after slumber though)

Just a sidenote... all script sources are intact as far as I can see in local pro. e.g. this is not a DB issue. (also made the HOST label here on GH as you might have noticed already)

Martii commented 8 years ago

~6.5% peak memory usage with styling applied to 503's

Manually enabling /about routes to test stability

Reinstalled all deps, and their deps, and so on... no dist-upgrade available.

Martii commented 8 years ago

~6.5% nominal and ~15% peak memory usage with /about routes ... no leaks detected

Manually enabling /users route to test stability

Martii commented 8 years ago

~7.1% nominal and ~8.6% peak memory usage with /users route ... slightly slower to release memory on /users/username/comments ... this will be cumulative during testing.

Manually enabling /forum route to test stability

Martii commented 8 years ago

~6.4% nominal and ~7.4% peak memory usage with /forum route

Manually enabling all other discussions except /scripts route to issue discussions to test stability

Martii commented 8 years ago

~7.1% nominal and ~8.8% peak memory usage for global discussions

Manually enabling /group route excluding api search to test stability

Martii commented 8 years ago

~7.5% nominal and ~8.9% peak memory usage for /groups route

Manually enabling /libs route excluding general / route ... this doesn't include script installations just yet but does show Source Code tab ... to test stability

Martii commented 8 years ago

~7.5% nominal and ~7.7% peak memory usage for /libs route

Manually enabling /scripts route excluding general / route ... this also doesn't include script installations just yet but does show Source Code tab... to test stability

Martii commented 8 years ago

~9.3% nominal and ~14% peak (spiked) with average ~9.5% peak memory usage ... slow to release on spikes

Manually enabling /meta route (doesn't include oujs - Meta View since that sends the header with a .user.js) to test stability

Martii commented 8 years ago

~9.8% nominal and ~16.7% peak (spiked) with average 10% peak memory usage ... fast to release on spikes and nominal

Manually enabling /install and /src routes ... we are in READ ONLY mode right now for script sources e.g. no storing of new versions and of course no postings yet... presuming anyone is still sessioned... Minification of script source is also skipped. oujs - Meta View now receiving.

This will be a longer test for stability.

Martii commented 8 years ago

Disabling prior routes and previous route... CPU leak detected as well... this wasn't happening before the distro upgrade so it shouldn't be us but at least I know where to look in depth.

~20% nominal and ~41% peak memory usage... loses stability after about 0.75 hours.

Martii commented 8 years ago

Manually enabling / route

Martii commented 8 years ago

Renabled /meta , /install and /src routes for testing alternate script serving method instead of res.pipe method... memory consumption has dropped to 13% on initial start... CPU is still a little high still though... this might be promising but time will tell.

Martii commented 8 years ago

Creeping up again (~31%)... ~~and CPU is sometimes above 100% which doesn't make much sense to me.~~ ref for strikeout

Martii commented 8 years ago

60% memory usage

Martii commented 8 years ago

I know this is currently against STYLEGUIDE.md but manually trying ES6 let in this area... restarted server.

Martii commented 8 years ago

Nominal around 34-35%

Martii commented 8 years ago

Temporarily reenable all routes (e.g. you can log in and do as usual before this issue)... if this crashes again I'm pretty much left with disabling the /install and /src routes... /meta should be okay since I trimmed that up in a prior issue a while back.

P.S. toobusy-js still active... you may still get these... working on this next.

JasonBarnabe commented 8 years ago

Traffic to Greasy Fork has increased 10-fold over the past couple days, with Chrome script update checks going from 10% of traffic to 90%. I imagine this may be the thing affecting you as well.

Martii commented 8 years ago

Traffic to Greasy Fork has increased 10-fold over the past couple days, with Chrome script update checks going from 10% of traffic to 90%. I imagine this may be the thing affecting you as well.

Yah I checked your site as well and got a too busy a few times... I'm hoping that TamperMonkey is still caching @requires somewhat (and all other @keys too) Cc: @derjanb ... that was an issue back in the USO days. Thanks for the heads up.

Martii commented 8 years ago

Died again... was doing well between 40-52% memory usage ... restarted.

Martii commented 8 years ago

Died again... setting lag to 60ms

Martii commented 8 years ago

I'm probably going to move the git master/branch check to the /admin route so it is explicitly called and doesn't accidentally trip the server by a visit to /about ... I've done that a few times during heavy traffic/mem loads myself since inception and there's little error protection in that package (even though there should be). Need slumber again first. ;)

Martii commented 8 years ago

Here's a 7 day current activity graph on what I've been dealing with combating:

activity

Notice April 5th is when the shiz hit the fan... that's when at least 3 events happened and is dragging the server under.

I appreciate everyones patience on this and eventually things should get better.

Martii commented 8 years ago

So this is a snapshot of the duration when I cut off all scripts that didn't have an appropriate @updateURL as well as the 9 scripts that I added one to:

postactivity

... the red arrow is roughly when I cut off those scripts.... we've returned somewhat to nominal.

derjanb commented 8 years ago

@Martii I'm so sorry. It seems that under certain conditions (which in theory never should occur) the update check runs into an endless loop. Affected people should be able to work-around the issue by setting the "Script Update" -> "Check Interval" option to any other value than "Never".

I'm going to prepare a new stable release now. Sorry again for any inconvenience. :(

derjanb commented 8 years ago

http://tampermonkey.net/changelog.php?version=4.0.25&show=dhdg

Waiting for the Chrome store to publish it...

derjanb commented 8 years ago

OK, it's published. Chrome should auto-update Tampermonkey soon, but you can also drag and drop the new release[1] to Chrome's extension page.

Sorry again for any inconvenience.

[1] https://tampermonkey.net/test/versions/stable/4.0.25/fa9adf10f06/dhdgffkkebhmkfjojejmpbldmpobfkfo_main.crx

Martii commented 8 years ago

OK, it's published.

Thanks for looking into this.

Sorry again for any inconvenience.

Mistakes happen... just glad you are receptive to investigating. Thank you so much. :)

Related to my patch at #955 ... it would seem by the CPU usage that my "forced GC" worked e.g. the running out of memory issue which is caused by overloading the CPU bandwidth... that is unexpected at this time... but a pleasant surprise. See todays current graph: Cc: @mikeal for node

postactivity4

We are still spiking a bit (see greenish as nominal and redish as spiked below) but hopefully that will subside in about a week after everyone affected updates and when I up the lag time a little at a time.

postactivity4a

Martii commented 8 years ago

@derjanb One important question though... will TM users need to reinstall Userscripts to have TM update check with the Accept header or is this automatically retried after a period?

Ref:

http://tampermonkey.net/documentation.php?version=4.0.25#_updateURL (not currently mentioned)

derjanb commented 8 years ago

@Martii The good news is that after more people installed the latest TM version your traffic should be back at the March level. Even if users will import a self-modified Tampermonkey settings file, the update check will now never run more than once per hour except it was triggered by a user action. The bad news is that checking the meta data first is broken since the TM 3.12 release. :( So, thanks for bringing this up. I hope it's not a big issue since 3.12 was released 2015-10-24.

Sorry again. Some large parts of TM 4.0 were re-written from scratch and it looks like I should have done even more testing.

Martii commented 8 years ago

@derjanb

the update check will now never run more than once per hour except it was triggered by a user action.

Good to know.

Some large parts of TM 4.0 were re-written from scratch...

Hmm... now that you mention that... just checked my Chromium, which I use a couple of times a day, and it had TM v3.12.58 installed and not updated... uninstalled and reinstalled from the Chrome Web Store and now it's v4.0.25 ... I rarely tweak settings in Chromium but if it's not updating the extension itself... hmm... time to check Chrome under Windows...

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36 (Distro Chromium)

... uh oh ...

same TM v3.12.58 on:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36 (Windows Chrome)

... more testing.

I test the browsers with OUJS almost daily with oujs - Meta View... that sends out the Accept header with the appropriate value to see if any browser is failing with XHR... it used to be GM_XHR as a Unit Test but wanted to keep it friendly across the history of browsers and .user.js engines. One never knows what an update brings sometimes. :)

Thanks for the intel.

derjanb commented 8 years ago

I rarely tweak settings in Chromium but if it's not updating... hmm.

In these days Chrome updates extensions only on re-start. This means the extension needs to be downloaded (5700000 weekly users * ~1MB) and then Chrome need to be restarted. And especially when a new major version is released then I'm releasing it only to a small but slowly increasing percentage of users. This helps to make not every TM user angry and giving a bad rating in case there are still some issues. In this case the percentage was 45% when I was notified of the update problem.

One never knows what an update brings sometimes. :)

I do have a lot of userscript environment tests as well. Unfortunately it's not easy to test the extension code (like updates, @connect, ...), but I believe I have to mange that somehow.

Martii commented 8 years ago

In these days Chrome updates extensions only on re-start.

Leaving a browser open isn't on my normal TODO list. ;) e.g. I usually close all browsers when I go AFK unless I'm monitoring via web somewhere.

I'm releasing it only to a small but slowly increasing percentage of users.

Good to know... suppose this could have been a bit worse had everyone gotten it all at once... but we're all the better for surprises like these... hardens up sites and extensions/add-ons I think. ;) :)

Martii commented 8 years ago

Capping isn't working. Have one more idea to control the memory usage but will take a bit to rewrite that section. In the meantime here's the last 6 hours that I've been dealing with. Note the green smudge arrow is when we don't serve scripts with improper or missing @updateURLs:

postactivity5

Martii commented 8 years ago

Guhh.. rebooted the server a few times and the environment variables never showed up... finally got it to work with those... however ... the project itself is BUSY right now until further notice.

derjanb commented 8 years ago

@Martii Is there any news on this topic. Is the amount of traffic shrinking?

Martii commented 8 years ago

@derjanb No change... I'm temporarily blocking the offenders in the firewall and working on a different solution... can't chat right now but will later. (up to 11 btw)

Martii commented 8 years ago

@derjanb Still getting hammered but there is a request limiter in place now. A few brief spikes today but they were quashed. Unblocked those IPs in the firewall for the test data... seems we are more hardened now.

The 11 known IPs are the major issue... the remainder spikes from Chrome based UA's could be installs ... could not be... I don't know exactly how your TM text/x-userscript works when text/x-userscript-meta fails. Perhaps you could enlighten TM users on your "wiki" page for the UserScript metadata block keys with the logic process. There are few repeats but now those get quashed as well so the 100's of tiny stings we are getting get sent to file 13... err 429 ;) :)

postactivity7

OpenUserJS / OpenUserJS.org

Server down/stalled... #944

425