ckelner / ice-custom

Netflix Ice. Customized version for TWC specific changes. See: https://github.com/TheWeatherCompany/ice for contributions back to Netflix.
0 stars 0 forks source link

Ice processor daemon stopped processing #4

Closed ckelner closed 9 years ago

ckelner commented 9 years ago

Contacted by Landon that latest data was not there.

Logged into server and checked service status for ice_processor, was not running:

$ sudo service ice-processor status
Checking ice-processor...                         Process dead but pidfile exists

Checked log and found it had no errors, last message was Jan 25:

2015-01-25 17:01:30,800 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring all files at 2015-01
2015-01-25 17:01:30,800 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - AWS usage processed.

No indication of why.

I started the processor again:

$ sudo service ice-processor start
Starting ice-processor...                         
PID: 1036
Ok

Disk space looks ok:

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_c6amibase-lv_root
                      6.7G  3.8G  2.5G  61% /
tmpfs                 1.8G     0  1.8G   0% /dev/shm
/dev/xvda1            485M   85M  375M  19% /boot

Not a whole lot of memory though:

$ free -m
             total       used       free     shared    buffers     cached
Mem:          3641       3567         74          0          2        170
-/+ buffers/cache:       3394        247
Swap:          815        178        637
ckelner commented 9 years ago

First order of business will be put ice on larger (more memory) servers.
Second order of business is to build some logic around making sure the processor daemon stays running (I thought the linux 'service' pattern was supposed to make sure of this, but need to do some reading and probably pick @clstokes brain for help)

ckelner commented 9 years ago

Third order of business, look at west coast server.

ckelner commented 9 years ago

Same story on west coast server, no indication of why it died.

ckelner commented 9 years ago

I am rolling out m3.larges in place of c3.larges. The m3s have twice as much memory as the c3s.

ckelner commented 9 years ago

If for some reason that doesn't suffice, will look at moving to r3.large

ckelner commented 9 years ago

memory snapshot from the west-2 server this morning:

free -m
             total       used       free     shared    buffers     cached
Mem:          3641       3197        444          0        113        354
-/+ buffers/cache:       2729        912
Swap:          815        362        453

and the processes themselves:

ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     31698  0.1 26.1 5485080 973932 ?      SNl  01:13   1:17 java -Xmx2G -Xms2048m -XX:MaxPermSize=2048m -jar ice-processor.jar port=1234
tomcat    1707  0.1 44.3 5653424 1654456 ?     Sl   Jan23  12:58 /usr/lib/jvm/java/bin/java -Djavax.sql.DataSource.Factory=org.apache.commons.dbcp.BasicDataSourceFactory -Xmx2G -Xms2048...
clstokes commented 9 years ago

If -Xmx2G was the same setting as before, then increasing the memory of the box isn't going to change anything. If memory is the problem, we need to increase the max memory of the JVM with that flag (ie. -Xmx4G)

Let's also add -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the script starting the JVM so we can see it's memory usage.

ckelner commented 9 years ago

@clstokes the box only has 3.75 Gb to give and two processes both consuming upwards of 2G. Does your statement still hold true given that?

I'll add those flags. I'm working on a script to push memory info into cloudwatch so we can get a better picture.

ckelner commented 9 years ago

I should say, the box only had 3.75Gb (c3.large) -- now running m3.large (7.5Gb memory).

clstokes commented 9 years ago

I didn't realize there were two 2G processes. Were both processes dead?

The max memory flag would still need to be increased if memory is indeed our problem.

ckelner commented 9 years ago

No, only one process was dead. Tomcat was still running but the process we daemonized (seen here: https://github.com/TheWeatherCompany/grid-config-mgmt/blob/master/provisioners/puppet/modules/grid-ice/files/ice-processor) was dead but I had no indication of why. The log just stopped. The daemon script reported: Process dead but pidfile exists

ckelner commented 9 years ago

@clstokes suggested we use boundary to monitor the memory. Example usage for config-mgmt here: https://github.com/TheWeatherCompany/grid-config-mgmt/blob/master/provisioners/puppet/roles/grid/prod-grid-console-web-east-runtime.pp#L154-L158

ckelner commented 9 years ago

Memory seems to be slowing creeping up screen shot 2015-01-30 at 8 26 25 pm

ckelner commented 9 years ago

I've got boundary alarms in place to alert on the memory once it reaches a certain threshold.

ckelner commented 9 years ago
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
tomcat    1807  0.2 26.5 5656172 2001568 ?     Sl   Jan29   7:22 /usr/lib/jvm/java/bin/java -Djavax.sql.DataSource.Factory=org.apache.commons.dbcp.BasicDataSourceFactory -Xmx2G -Xms2048
root      1829  0.3 30.6 5475700 2308772 ?     SNl  Jan29   9:27 java -Xmx2G -Xms2048m -XX:MaxPermSize=2048m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -jar ice-processor.ja
ckelner commented 9 years ago

screen shot 2015-02-02 at 9 25 38 am

ckelner commented 9 years ago

Looks like they've settled in around 65% ish for the time being. screen shot 2015-02-04 at 7 33 43 am

ckelner commented 9 years ago

We've added boundary HTTP status check as a means of monitoring the process as well. Everything has been stable for awhile now.