Open johncarlosbaez opened 9 years ago
hi john, I restarted the server, and added some logging to try to figure out why the server keeps running out of memory, despite our code monitoring its memory usage very closely.
Note to self: maybe cherrypy.engine.exit() still doesn't enable watchmem.py process to terminate because the web server thread is not "daemonized"? Maybe I need to use daemon=True in web.Server.start() as well as in watchmem.py?
OK, our new logging code has revealed what's going on. Our memory-watching and restart code is in fact working properly, but apparently we're not running that code frequently enough to forestall webfaction killing our processes. Specifically, it checks every 60 seconds and shuts down the server if the memory exceeds 150MB. The logging data show that all that is working correctly. The problem is that occasionally the server memory usage shoots from below our threshold (less than 150MB) to over the kill threshold (512 MB) in less than one minute, and webfaction kills all our processes before our code has a chance to shut itself down. Note that if webfaction did nothing (i.e. didn't kill all our processes), there would be no problem at all, because our code would shut down (freeing all the memory) and restart within less than 60 seconds.
So, in terms of an immediate fix to stop this pattern from recurring, it's just a question of ensuring that our watchmem code intervenes BEFORE webfaction's kill code.
(in terms of the deeper problem of why Python never releases memory, this is something lots of other people have encountered, e.g. see http://revista.python.org.ar/2/en/html/memory-fragmentation.html, http://stackoverflow.com/questions/3737268/memory-consumption-in-cherrypy, so we just have to work around it. Longer term, we should probably switch the web server to using WSGI or some other deployment option -- rather than keeping Python permanently in memory as we are currently doing).
I'm getting the same error right now; it's been down for at least four hours.
restarted fine, by simply killing the watchmem.py process, which apparently was stalled, the exact opposite of the problem this issue was created to track (process getting killed unexpectedly).
I'm getting the same error, now and during my attempts over the last day.
restarted the server, thanks!
Down again, same error.
When Webfaction kills everything, doesn't it run anything to let you serve again without manual intervention? Sounds barbarous.
Perhaps you could self-police better using an rlimit instead of periodic checking? Or just check every 5 seconds.
Do you have any uptime monitor? https://uptimerobot.com is unsophisticated but free.
@cben Thanks! Restarted. It would take a bit of work to get webfaction's autostart to work for us, since we'd have to write code to check whether our mongodb is up, if not start the DB, then start the web server... I just don't have time to deal with this now. When python eats up too much memory, webfactions kills ALL processes, not just Python... grr.
Down again.
grrr, something has changed either on webfaction or Google+'s data feed that is causing this to occur way more frequently. Also the autostart.cgi that is supposed to be able to restart us automatically is not working for me... I've made a defensive change to restart the server hourly to prevent Python's memory usage from growing and growing. I'll return to this issue once webfaction fixes their autostart support for me.
Still down?
@GeekyPeas Thanks. Restarted.
Sorry, but I think it is still down...
@GeekyPeas Thanks again.
@cben I have now implemented an rlimit to try to stop this from happening, but I'm not sure that will accomplish much. The problem is Linux doesn't support rlimit on RSS, only VSZ, whereas Webfaction is watching (and killing us for) RSS. So setting a limit on VSZ may not catch the RSS problem that Webfaction is nuking us for. Sigh.
Grr. Various problems are preventing our fixes from working:
Sigh.
and of course webfaction hasn't done anything about my request that they fix their autostart to actually work.
uptimerobot has "keyword" monitor type that greps the response for a string you specify.
Down again.
When I go to
https://selectedpapers.net/
I now get this error message: