Build Process Potentially Leaking Memory

stevekinney commented 10 years ago

@jugglinmike I noticed the site was down today, but the server was up. So I ran the deploy.sh script and got this error that I think we talked about in #65.

I'm going to try to power cycle the server in the mean time.

Tracing dependencies for: socket.io-client
Compressed CSS output to 78%.
Compressed CSS output to 78%.
Compressed CSS output to 78%.
FATAL ERROR: Evacuation Allocation failed - process out of memory
./deploy.sh: line 19:  2475 Aborted                 (core dumped) grunt build

stevekinney commented 10 years ago

Update: Power cycling worked, which leads me to believe that we do have a memory leak on our hands.

jugglinmike commented 10 years ago

Hi Steve,

A little more information about the system in its failing state will help determine where to look. The next time you experience this failure, could you run the following commands and share the contents of the created files?

$ COLUMNS=512 top -bcn 1 > top.txt
$ free -t -m > free.txt
$ ps -aeF > ps.txt

Also, knowing the time since last deployment would help @mzgoddard and I estimate severity. Do you know how long it had been since you last deployed prior to the incident you reported here?

stevekinney commented 10 years ago

@jugglinmike So, it looks like we're coming across a daily memory leak issue. I rebooted the server yesterday and it was out of memory again today.

Here is the console message:

screen shot 2014-09-18 at 9 00 41 am

The last deployment was the last merge into master. But it's run out of memory since the last time I rebooted the server, which was yesterday.

Thoughts?

/cc @escoleman3 @kgotchet @jlefeber @mzgoddard

jugglinmike commented 10 years ago

@stevekinney The next time this happens (tomorrow morning, by the sound of it) and before rebooting the server, could you grab the stats I mentioned in my previous comment?

stevekinney commented 10 years ago

Yup, I couldn't log in because the key on the server was from my CEE iMac, which I don't have anymore. So, I need @escoleman3 to pop in my personal key. I rebooted because someone needed to use it in the next two hours.

stevekinney commented 10 years ago

So, @jugglinmike—the server went down twice today. I believe @escoleman3 reset it once this morning. I'm including the information you requested.

https://gist.github.com/stevekinney/be2a2de91aa864306577

jugglinmike commented 10 years ago

Thanks @stevekinney . @mzgoddard and I have run through the data, and we think we understand the problem. This is our theory:

It looks like the "top" server is failing occasionally and leaving its child processes (the activity servers) orphaned. The forever module is correctly restarting the top-level server, and it is spawning new activity servers. This repeats over time, until the environment is filled with zombie servers.

This highlights two separate problems:

The top-level server is failing on a regular basis
The children are left running

#1 is likely caused by a memory leak, and resolving it may require additional forensics. #2 can be resolved if by maintaining a list of child process IDs on disk and killing those processes on startup.

#1 is definitely the trickier problem, but (if we've interpreted all this correctly), resolving #2 will result in improved application behavior: the app will continue to fail intermittently, but it will immediately restart itself cleanly. The site will suffer little downtime (and it will be resolved automatically), but it will kick active users and lose saved activity results.

I'm going to begin work on a fix for #2 tomorrow, as it seems to be the low-hanging fruit here.

Does this make sense to you?

councilforeconed / interactive-activities

Build Process Potentially Leaking Memory #118