Open stevekinney opened 10 years ago
Update: Power cycling worked, which leads me to believe that we do have a memory leak on our hands.
Hi Steve,
A little more information about the system in its failing state will help determine where to look. The next time you experience this failure, could you run the following commands and share the contents of the created files?
$ COLUMNS=512 top -bcn 1 > top.txt
$ free -t -m > free.txt
$ ps -aeF > ps.txt
Also, knowing the time since last deployment would help @mzgoddard and I estimate severity. Do you know how long it had been since you last deployed prior to the incident you reported here?
@jugglinmike So, it looks like we're coming across a daily memory leak issue. I rebooted the server yesterday and it was out of memory again today.
Here is the console message:
The last deployment was the last merge into master. But it's run out of memory since the last time I rebooted the server, which was yesterday.
Thoughts?
/cc @escoleman3 @kgotchet @jlefeber @mzgoddard
@stevekinney The next time this happens (tomorrow morning, by the sound of it) and before rebooting the server, could you grab the stats I mentioned in my previous comment?
Yup, I couldn't log in because the key on the server was from my CEE iMac, which I don't have anymore. So, I need @escoleman3 to pop in my personal key. I rebooted because someone needed to use it in the next two hours.
So, @jugglinmike—the server went down twice today. I believe @escoleman3 reset it once this morning. I'm including the information you requested.
Thanks @stevekinney . @mzgoddard and I have run through the data, and we think we understand the problem. This is our theory:
It looks like the "top" server is failing occasionally and leaving its child processes (the activity servers) orphaned. The forever
module is correctly restarting the top-level server, and it is spawning new activity servers. This repeats over time, until the environment is filled with zombie servers.
This highlights two separate problems:
#1
is likely caused by a memory leak, and resolving it may require additional forensics. #2
can be resolved if by maintaining a list of child process IDs on disk and killing those processes on startup.
#1
is definitely the trickier problem, but (if we've interpreted all this correctly), resolving #2
will result in improved application behavior: the app will continue to fail intermittently, but it will immediately restart itself cleanly. The site will suffer little downtime (and it will be resolved automatically), but it will kick active users and lose saved activity results.
I'm going to begin work on a fix for #2
tomorrow, as it seems to be the low-hanging fruit here.
Does this make sense to you?
@jugglinmike I noticed the site was down today, but the server was up. So I ran the
deploy.sh
script and got this error that I think we talked about in #65.I'm going to try to power cycle the server in the mean time.