cjlee112 / spnet

selected papers network web engine
http://thinking.bioinformatics.ucla.edu/2011/07/02/open-peer-review-by-a-selected-papers-network/
GNU General Public License v2.0
40 stars 11 forks source link

spnetwork down? #114

Open johncarlosbaez opened 9 years ago

johncarlosbaez commented 9 years ago

When I go to

https://selectedpapers.net/

I now get this error message:

Not Found

The requested URL /cgi-bin/spnet-autostart.cgi was not found on this server.

cjlee112 commented 9 years ago

hi john, I restarted the server, and added some logging to try to figure out why the server keeps running out of memory, despite our code monitoring its memory usage very closely.

cjlee112 commented 9 years ago

Note to self: maybe cherrypy.engine.exit() still doesn't enable watchmem.py process to terminate because the web server thread is not "daemonized"? Maybe I need to use daemon=True in web.Server.start() as well as in watchmem.py?

cjlee112 commented 9 years ago

OK, our new logging code has revealed what's going on. Our memory-watching and restart code is in fact working properly, but apparently we're not running that code frequently enough to forestall webfaction killing our processes. Specifically, it checks every 60 seconds and shuts down the server if the memory exceeds 150MB. The logging data show that all that is working correctly. The problem is that occasionally the server memory usage shoots from below our threshold (less than 150MB) to over the kill threshold (512 MB) in less than one minute, and webfaction kills all our processes before our code has a chance to shut itself down. Note that if webfaction did nothing (i.e. didn't kill all our processes), there would be no problem at all, because our code would shut down (freeing all the memory) and restart within less than 60 seconds.

So, in terms of an immediate fix to stop this pattern from recurring, it's just a question of ensuring that our watchmem code intervenes BEFORE webfaction's kill code.

(in terms of the deeper problem of why Python never releases memory, this is something lots of other people have encountered, e.g. see http://revista.python.org.ar/2/en/html/memory-fragmentation.html, http://stackoverflow.com/questions/3737268/memory-consumption-in-cherrypy, so we just have to work around it. Longer term, we should probably switch the web server to using WSGI or some other deployment option -- rather than keeping Python permanently in memory as we are currently doing).

metaweta commented 9 years ago

I'm getting the same error right now; it's been down for at least four hours.

cjlee112 commented 9 years ago

restarted fine, by simply killing the watchmem.py process, which apparently was stalled, the exact opposite of the problem this issue was created to track (process getting killed unexpectedly).

glangmead commented 9 years ago

I'm getting the same error, now and during my attempts over the last day.

cjlee112 commented 9 years ago

restarted the server, thanks!

cben commented 8 years ago

Down again, same error.

cben commented 8 years ago

When Webfaction kills everything, doesn't it run anything to let you serve again without manual intervention? Sounds barbarous.

Perhaps you could self-police better using an rlimit instead of periodic checking? Or just check every 5 seconds.

Do you have any uptime monitor? https://uptimerobot.com is unsophisticated but free.

cjlee112 commented 8 years ago

@cben Thanks! Restarted. It would take a bit of work to get webfaction's autostart to work for us, since we'd have to write code to check whether our mongodb is up, if not start the DB, then start the web server... I just don't have time to deal with this now. When python eats up too much memory, webfactions kills ALL processes, not just Python... grr.

anpc commented 8 years ago

Down again.

cjlee112 commented 8 years ago

grrr, something has changed either on webfaction or Google+'s data feed that is causing this to occur way more frequently. Also the autostart.cgi that is supposed to be able to restart us automatically is not working for me... I've made a defensive change to restart the server hourly to prevent Python's memory usage from growing and growing. I'll return to this issue once webfaction fixes their autostart support for me.

GeekyPeas commented 8 years ago

Still down?

cjlee112 commented 8 years ago

@GeekyPeas Thanks. Restarted.

GeekyPeas commented 8 years ago

Sorry, but I think it is still down...

cjlee112 commented 8 years ago

@GeekyPeas Thanks again.
@cben I have now implemented an rlimit to try to stop this from happening, but I'm not sure that will accomplish much. The problem is Linux doesn't support rlimit on RSS, only VSZ, whereas Webfaction is watching (and killing us for) RSS. So setting a limit on VSZ may not catch the RSS problem that Webfaction is nuking us for. Sigh.

cjlee112 commented 8 years ago

Grr. Various problems are preventing our fixes from working:

Sigh.

cjlee112 commented 8 years ago

and of course webfaction hasn't done anything about my request that they fix their autostart to actually work.

cben commented 8 years ago

uptimerobot has "keyword" monitor type that greps the response for a string you specify.

evelynmitchell commented 8 years ago

Down again.