Closed Martii closed 8 years ago
Found a break in the system... either a hiccup on whatever is causing this... or my distro update on laptop which may have a compatible client to connect to the latest Debian... or sizzle... although I did try Windows VM (virtual machine) and PM (physical machine), Debian VM, ArchLinux VM, ArchLinux PM, and other Linux PM too and those failed... so not entirely sure. (too many inet issues today everywhere)
I have already seen a 503 with toobusy-js on login that is probably GH's issue (still guessing here)... leaving this open for a more detailed investigation over the next few days. Apologies for this unscheduled outage... definitely out of my control at this time.
Btw dist-upgrade yielded no further updates. :\
Looks like it's still down.
No news yet.
PENDING!... got 5 "too busy's" on login... but I have now have access... we'll see how long this stays up on the VPS.
One server restart detected... investigating.
503 ...
I know... it's going to take a bit to resolve this... something is chewing up memory and causing the VPS to crash... this was before the 503 addition of toobusy-js... I'm probably going to take the server down, do some recompilations, and see if that helps. e.g. that is why it's PENDING status right now.
Patience please. :)
Downgrading node didn't help... seems that the malloc (or whichever lowlevel lib is being used) isn't freeing up memory in the distro/VPS.
I'm going to try disabling script minification just to be sure... with an environment variable to be added... don't worry I'll have it pass-through to the unminified so it doesn't break scripts.
Still losing memory, although slower, with disabling of script minification... e.g. the VPS is going to crash again... watching it right now go down, down, up, down, down, down, up, down, down, down, etc... until eventually there is zero free memory.
So systematically ruled out our project and reaffirmed this is a distro/VPS issue. :\
And there it goes. :\
Have to AFK for a few hours... will be back to try some other things as soon as I can. :\ Leaving the site OFFLINE for the moment.
@sizzlemctwizzle and anyone watching,
So I've put up a constant 503 on all routes at the moment... it's not very pretty but it will at least let everyone know that "we're busy... try again later" (better than nothing). This is hard-coded into the app.js with a manual FORCE_BUSY='true'
in the env
and not here on GH dev yet... still running some tests to see if this portion stays up. So far we are around a constant ~6% memory usage... will monitor this for a few hours... sleep in between... waking up... seeing if Debian has an update that fixes this before I make more reports, etc.
I've tried many different versions of node all result in the same issue with this kernel image on the VPS.. e.g. memory gets eaten up. Using the precompiled _node_s, the server lasts for less than 5 minutes... with a manual build from _node_s source I can sometimes get about 45 minutes of uptime. NEITHER OPTION IS SUITABLE as I can't babysit the server that constantly.
I've also looked into backing out/rolling back the last dist-upgrade and of course the old packages aren't available on the official repos... so that will fail.
Only three options are left that I can think of...
Just a sidenote... all script sources are intact as far as I can see in local pro. e.g. this is not a DB issue. (also made the HOST label here on GH as you might have noticed already)
~6.5% peak memory usage with styling applied to 503's
Manually enabling /about routes to test stability
Reinstalled all deps, and their deps, and so on... no dist-upgrade available.
~6.5% nominal and ~15% peak memory usage with /about routes ... no leaks detected
Manually enabling /users route to test stability
~7.1% nominal and ~8.6% peak memory usage with /users route ... slightly slower to release memory on /users/username/comments ... this will be cumulative during testing.
Manually enabling /forum route to test stability
~6.4% nominal and ~7.4% peak memory usage with /forum route
Manually enabling all other discussions except /scripts route to issue discussions to test stability
~7.1% nominal and ~8.8% peak memory usage for global discussions
Manually enabling /group route excluding api search to test stability
~7.5% nominal and ~8.9% peak memory usage for /groups route
Manually enabling /libs route excluding general / route ... this doesn't include script installations just yet but does show Source Code tab ... to test stability
~7.5% nominal and ~7.7% peak memory usage for /libs route
Manually enabling /scripts route excluding general / route ... this also doesn't include script installations just yet but does show Source Code tab... to test stability
~9.3% nominal and ~14% peak (spiked) with average ~9.5% peak memory usage ... slow to release on spikes
Manually enabling /meta route (doesn't include oujs - Meta View since that sends the header with a .user.js) to test stability
~9.8% nominal and ~16.7% peak (spiked) with average 10% peak memory usage ... fast to release on spikes and nominal
Manually enabling /install and /src routes ... we are in READ ONLY mode right now for script sources e.g. no storing of new versions and of course no postings yet... presuming anyone is still sessioned... Minification of script source is also skipped. oujs - Meta View now receiving.
This will be a longer test for stability.
Disabling prior routes and previous route... CPU leak detected as well... this wasn't happening before the distro upgrade so it shouldn't be us but at least I know where to look in depth.
~20% nominal and ~41% peak memory usage... loses stability after about 0.75 hours.
Manually enabling / route
Renabled /meta , /install and /src routes for testing alternate script serving method instead of res.pipe
method... memory consumption has dropped to 13% on initial start... CPU is still a little high still though... this might be promising but time will tell.
Creeping up again (~31%)... and CPU is sometimes above 100% which doesn't make much sense to me. ref for strikeout
60% memory usage
I know this is currently against STYLEGUIDE.md but manually trying ES6 let
in this area... restarted server.
Nominal around 34-35%
Temporarily reenable all routes (e.g. you can log in and do as usual before this issue)... if this crashes again I'm pretty much left with disabling the /install and /src routes... /meta should be okay since I trimmed that up in a prior issue a while back.
P.S. toobusy-js still active... you may still get these... working on this next.
Traffic to Greasy Fork has increased 10-fold over the past couple days, with Chrome script update checks going from 10% of traffic to 90%. I imagine this may be the thing affecting you as well.
Traffic to Greasy Fork has increased 10-fold over the past couple days, with Chrome script update checks going from 10% of traffic to 90%. I imagine this may be the thing affecting you as well.
Yah I checked your site as well and got a too busy a few times... I'm hoping that TamperMonkey is still caching @requires
somewhat (and all other @
keys too) Cc: @derjanb ... that was an issue back in the USO days. Thanks for the heads up.
Died again... was doing well between 40-52% memory usage ... restarted.
Died again... setting lag to 60ms
I'm probably going to move the git master/branch check to the /admin route so it is explicitly called and doesn't accidentally trip the server by a visit to /about ... I've done that a few times during heavy traffic/mem loads myself since inception and there's little error protection in that package (even though there should be). Need slumber again first. ;)
Here's a 7 day current activity graph on what I've been dealing with combating:
Notice April 5th is when the shiz hit the fan... that's when at least 3 events happened and is dragging the server under.
I appreciate everyones patience on this and eventually things should get better.
So this is a snapshot of the duration when I cut off all scripts that didn't have an appropriate @updateURL
as well as the 9 scripts that I added one to:
... the red arrow is roughly when I cut off those scripts.... we've returned somewhat to nominal.
@Martii I'm so sorry. It seems that under certain conditions (which in theory never should occur) the update check runs into an endless loop. Affected people should be able to work-around the issue by setting the "Script Update" -> "Check Interval" option to any other value than "Never".
I'm going to prepare a new stable release now. Sorry again for any inconvenience. :(
See also: https://greasyfork.org/en/forum/discussion/comment/22528#Comment_22528
http://tampermonkey.net/changelog.php?version=4.0.25&show=dhdg
Waiting for the Chrome store to publish it...
OK, it's published. Chrome should auto-update Tampermonkey soon, but you can also drag and drop the new release[1] to Chrome's extension page.
Sorry again for any inconvenience.
OK, it's published.
Thanks for looking into this.
Sorry again for any inconvenience.
Mistakes happen... just glad you are receptive to investigating. Thank you so much. :)
Related to my patch at #955 ... it would seem by the CPU usage that my "forced GC" worked e.g. the running out of memory issue which is caused by overloading the CPU bandwidth... that is unexpected at this time... but a pleasant surprise. See todays current graph: Cc: @mikeal for node
We are still spiking a bit (see greenish as nominal and redish as spiked below) but hopefully that will subside in about a week after everyone affected updates and when I up the lag time a little at a time.
@derjanb
One important question though... will TM users need to reinstall Userscripts to have TM update check with the Accept
header or is this automatically retried after a period?
Ref:
@Martii The good news is that after more people installed the latest TM version your traffic should be back at the March level. Even if users will import a self-modified Tampermonkey settings file, the update check will now never run more than once per hour except it was triggered by a user action. The bad news is that checking the meta data first is broken since the TM 3.12 release. :( So, thanks for bringing this up. I hope it's not a big issue since 3.12 was released 2015-10-24.
Sorry again. Some large parts of TM 4.0 were re-written from scratch and it looks like I should have done even more testing.
@derjanb
the update check will now never run more than once per hour except it was triggered by a user action.
Good to know.
Some large parts of TM 4.0 were re-written from scratch...
Hmm... now that you mention that... just checked my Chromium, which I use a couple of times a day, and it had TM v3.12.58 installed and not updated... uninstalled and reinstalled from the Chrome Web Store and now it's v4.0.25 ... I rarely tweak settings in Chromium but if it's not updating the extension itself... hmm... time to check Chrome under Windows...
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36 (Distro Chromium)
... uh oh ...
same TM v3.12.58 on:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36 (Windows Chrome)
... more testing.
I test the browsers with OUJS almost daily with oujs - Meta View... that sends out the Accept
header with the appropriate value to see if any browser is failing with XHR... it used to be GM_XHR as a Unit Test but wanted to keep it friendly across the history of browsers and .user.js engines. One never knows what an update brings sometimes. :)
Thanks for the intel.
I rarely tweak settings in Chromium but if it's not updating... hmm.
In these days Chrome updates extensions only on re-start. This means the extension needs to be downloaded (5700000 weekly users * ~1MB) and then Chrome need to be restarted. And especially when a new major version is released then I'm releasing it only to a small but slowly increasing percentage of users. This helps to make not every TM user angry and giving a bad rating in case there are still some issues. In this case the percentage was 45% when I was notified of the update problem.
One never knows what an update brings sometimes. :)
I do have a lot of userscript environment tests as well. Unfortunately it's not easy to test the extension code (like updates, @connect
, ...), but I believe I have to mange that somehow.
In these days Chrome updates extensions only on re-start.
Leaving a browser open isn't on my normal TODO list. ;) e.g. I usually close all browsers when I go AFK unless I'm monitoring via web somewhere.
I'm releasing it only to a small but slowly increasing percentage of users.
Good to know... suppose this could have been a bit worse had everyone gotten it all at once... but we're all the better for surprises like these... hardens up sites and extensions/add-ons I think. ;) :)
Capping isn't working. Have one more idea to control the memory usage but will take a bit to rewrite that section. In the meantime here's the last 6 hours that I've been dealing with. Note the green smudge arrow is when we don't serve scripts with improper or missing @updateURL
s:
Guhh.. rebooted the server a few times and the environment variables never showed up... finally got it to work with those... however ... the project itself is BUSY right now until further notice.
@Martii Is there any news on this topic. Is the amount of traffic shrinking?
@derjanb No change... I'm temporarily blocking the offenders in the firewall and working on a different solution... can't chat right now but will later. (up to 11 btw)
@derjanb Still getting hammered but there is a request limiter in place now. A few brief spikes today but they were quashed. Unblocked those IPs in the firewall for the test data... seems we are more hardened now.
The 11 known IPs are the major issue... the remainder spikes from Chrome based UA's could be installs ... could not be... I don't know exactly how your TM text/x-userscript
works when text/x-userscript-meta
fails. Perhaps you could enlighten TM users on your "wiki" page for the UserScript metadata block keys with the logic process. There are few repeats but now those get quashed as well so the 100's of tiny stings we are getting get sent to file 13... err 429 ;) :)
I'm unable to get into the VPS to restart it and it's spinning in a web browser. NOTE: This is purely a VPS issue with our provider and not the project nor the node configuration.
Messaged @sizzlemctwizzle Cc: @jonleibowitz
Last script update on local pro at 2016-04-05T12:07:05.214Z
Refs:
425
q.XMLHTTPREQUEST.RETRIES
in .user.js