ckelner / ice-custom

Netflix Ice. Customized version for TWC specific changes. See: https://github.com/TheWeatherCompany/ice for contributions back to Netflix.
0 stars 0 forks source link

Ice stopped reporting data #14

Open ckelner opened 9 years ago

ckelner commented 9 years ago

Grid Billing Dashboard (Ice) Outage

Incident Report: 2015-05-04

Grid Billing Reports (emails) experienced a “bug” where expenditure was not being displayed to end users (zero values). This issue spawned debugging that lead to the discovery that the production Grid Billing Dashboard had ceased to update as seen below in comparison to staging: fixed1

Notification

Grid has no monitoring established for determining if the data displayed in Ice is accurate or up to date. Grid was not notified directly about this outage, rather notification came from an end-user about a dependent product (Grid Billing Reports).

History

In the past we’ve seen ice-processor (service/process) crash which causes “flat lining” on the dashboard (i.e. the date stays current, but shows zero value), and that may be easier to monitor by watching the process, however in this case the process was still running, but simply no longer functioning as expected.

Monitoring

Grid no longer has any sufficient monitoring for the Grid Billing Dashboard (Ice). Grid was using Boundary, but with the lack of automatic updates to alarms and dashboards the Boundary environment has gone stale. From Boundary we were monitoring ELB node health and using the HTTP Status plugin to monitor ice-processor, the service in Ice that failed in this outage, however the monitor wouldn’t have triggered in this scenario because the service was still running and responding.

Findings

Undetermined reason for failure. It appears that the ice-processor service/application had an internal error that caused it to stop processing AWS DBR data. The ice-processor process continued running both in us-west-1 and us-west-2 but remained non functional. For further detail see the Debugging section of this document.

Action Items

To solve the immediate issue, production servers were re-deployed. By re-deploying Grid Billing Dashboard (Ice) started processing data again and the May data showed up.

Long term monitoring solution

Cameron Stokes, Patrick Flaherty, and Chris Kelner met on 2015/05/06 @ 12pm: We decided that we would implement a roll up field asked by Landon Williams and monitor that field for $0.00 sum and send an alert email if it is $0.00 sum.

Original long term solution thoughts/ideas

Grid has read and been told several times over that “sometimes data gets wonky” with Ice. There hasn’t been any technical explanation as to why and we haven’t had the time or knowledge to understand it. Several users we’ve spoken to virtually and in person have suggest restarting Ice on a semi-regular basis as well as clearing the S3 bucket of any data and letting Ice rebuild itself. Grid doesn’t think that this is a “solution” but rather a band-aid, but perhaps it is time we give in and implement some sort of data dump and restart logic.

Monitoring the data is probably the best solution we can come up with. Ice-Miner already performs part of this job for us by requesting data from ice via HTTP requests to the Ice “API” (not really an API, but we treat it as such). See: https://github.com/TheWeatherCompany/ice-miner. Specifically we query Ice for data from the 1st of the month at midnight to the current date/time as seen here: https://github.com/TheWeatherCompany/ice-miner/blob/master/ice.py#L47-L50. Here we can see that Ice-Miner “detected” the outage and its logic could be used as part of a data monitoring solution:

…
Updating worksheet 'April Tracking' monthly spend...
Complete. Great success! @ 2015-04-30 19:35:48 EDT
POST failed to ice url: http://internal-grid-ice-1199471559.us-east-1.elb.amazonaws.com:8080/ice/dashboard/getData returned with status: 500 @ 2015-04-30 20:30:01 EDT
POST failed to ice url: http://internal-grid-ice-1199471559.us-east-1.elb.amazonaws.com:8080/ice/dashboard/getData returned with status: 500 @ 2015-04-30 21:30:01 EDT
...

Debugging

Checked to make sure ice-processor was still running:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
tomcat    1900  0.1 28.2 5665092 2129844 ?     Sl   Apr07  77:26 /usr/lib/jvm/java/bin/java -Djavax.sql.DataSource.Factory=org.apache.commons.dbcp.BasicDataSourceFactory -Xmx2G -Xms2048m -XX:MaxPermSize=2048m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -classpath :/usr/share/tomcat6/bin/bootstrap.jar:/usr/share/tomcat6/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/usr/share/tomcat6 -Dcatalina.home=/usr/share/tomcat6 -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat6/temp -Djava.util.logging.config.file=/usr/share/tomcat6/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start
root      1924  0.2 30.8 5474672 2329504 ?     SNl  Apr07  95:44 java -Xmx2G -Xms2048m -XX:MaxPermSize=2048m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -jar ice-processor.jar port=1234

Checked to make sure disk space was ok:

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_c6amibase-lv_root
                      6.5G  2.6G  3.7G  42% /
tmpfs                 3.6G     0  3.6G   0% /dev/shm
/dev/xvda1            477M  102M  350M  23% /boot
/dev/mapper/vg_ice-lv_ice
                       30G  1.5G   27G   6% /mnt/ice_data

Last log messages were from four days prior to the incident on us-east-1. Normally activity is present in the logs every few hours as Netflix Ice processes new copies of the AWS DBR.

2015-04-21 19:12:55,348 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file charge.manifest
Exception in thread "com.netflix.ice.processor.BillingFileProcessor" java.lang.NullPointerException
    at org.apache.catalina.webresources.AbstractArchiveResource.getContent(AbstractArchiveResource.java:177)
    at org.apache.catalina.loader.WebappClassLoaderBase.findResourceInternal(WebappClassLoaderBase.java:2562)
    at org.apache.catalina.loader.WebappClassLoaderBase.findClassInternal(WebappClassLoaderBase.java:2385)
    at org.apache.catalina.loader.WebappClassLoaderBase.findClass(WebappClassLoaderBase.java:854)
    at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1264)
    at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1147)
    at org.apache.log4j.spi.LoggingEvent.<init>(LoggingEvent.java:165)
    at org.apache.log4j.Category.forcedLog(Category.java:391)
    at org.apache.log4j.Category.log(Category.java:856)
    at org.slf4j.impl.GrailsLog4jLoggerAdapter.logMessage(GrailsLog4jLoggerAdapter.java:192)
    at org.slf4j.impl.GrailsLog4jLoggerAdapter.error(GrailsLog4jLoggerAdapter.java:167)
    at org.apache.commons.logging.impl.SLF4JLog.error(SLF4JLog.java:213)
    at org.codehaus.groovy.grails.exceptions.DefaultStackTraceFilterer.filter(DefaultStackTraceFilterer.java:106)
    at org.codehaus.groovy.grails.exceptions.DefaultStackTraceFilterer.filter(DefaultStackTraceFilterer.java:85)
    at org.slf4j.impl.GrailsLog4jLoggerAdapter.cleanIfException(GrailsLog4jLoggerAdapter.java:178)
    at org.slf4j.impl.GrailsLog4jLoggerAdapter.logMessage(GrailsLog4jLoggerAdapter.java:191)
    at org.slf4j.impl.GrailsLog4jLoggerAdapter.error(GrailsLog4jLoggerAdapter.java:167)
    at com.netflix.ice.common.Poller.doWork(Poller.java:63)
    at com.netflix.ice.common.Poller.access$000(Poller.java:28)
    at com.netflix.ice.common.Poller$1.run(Poller.java:88)
    at java.lang.Thread.run(Thread.java:745)
1304980.734: [GC [PSYoungGen: 609916K->17768K(649728K)] 1497789K->906971K(2048000K), 0.0291970 secs] [Times: user=0.04 sys=0.00, real=0.03 secs]

Additionally there was a May 03 2015 log on us-east-1, but the file was empty: screen shot 2015-05-11 at 10 56 51 pm

At this point, to no longer delay fixing production, new Grid Billing Dashboard (Ice) servers were deployed and the old servers preserved for continued debugging. The new servers worked fine without issue. Process info output for ice-reader:

cat /proc/1900/status
Name:   java
State:  S (sleeping)
Tgid:   1900
Pid:    1900
PPid:   1
TracerPid:  0
Uid:    91  91  91  91
Gid:    91  91  91  91
Utrace: 0
FDSize: 256
Groups: 91 
VmPeak:  5695544 kB
VmSize:  5665092 kB
VmLck:         0 kB
VmHWM:   2130356 kB
VmRSS:   2129844 kB
VmData:  5460128 kB
VmStk:        88 kB
VmExe:         4 kB
VmLib:     20476 kB
VmPTE:      4620 kB
VmSwap:     7784 kB
Threads:    54
SigQ:   0/58731
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000002
SigCgt: 2000000181005ccd
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed:   7fff
Cpus_allowed_list:  0-14
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:  0
voluntary_ctxt_switches:    2
nonvoluntary_ctxt_switches: 2

Process info output for ice-processor:

cat /proc/1924/status
Name:   java
State:  S (sleeping)
Tgid:   1924
Pid:    1924
PPid:   1
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
Utrace: 0
FDSize: 256
Groups: 
VmPeak:  5474836 kB
VmSize:  5474672 kB
VmLck:         0 kB
VmHWM:   2329744 kB
VmRSS:   2329504 kB
VmData:  5415880 kB
VmStk:        88 kB
VmExe:         4 kB
VmLib:     15500 kB
VmPTE:      4836 kB
VmSwap:      888 kB
Threads:    20
SigQ:   0/58731
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000003
SigCgt: 2000000181005ccc
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed:   7fff
Cpus_allowed_list:  0-14
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:  0
voluntary_ctxt_switches:    11
nonvoluntary_ctxt_switches: 8

Log files on the us-west-2 server were different for ice-processor:

[centos@ip-10-239-114-44 ~]$ ls -la /var/log/ice-processor/
total 25288
drwxr-xr-x.  2 root root    4096 May  3 03:38 .
drwxr-xr-x. 10 root root    4096 May  3 03:38 ..
-rw-r--r--.  1 root root 4970631 May  5 04:10 console.log
-rw-r--r--.  1 root root 2624231 Apr 12 03:19 console.log-20150412.gz
-rw-r--r--.  1 root root 5038224 Apr 19 03:17 console.log-20150419.gz
-rw-r--r--.  1 root root 6858184 Apr 26 03:37 console.log-20150426.gz
-rw-r--r--.  1 root root 6382134 May  3 03:38 console.log-20150503.gz
Looking at the contents of console.log was simply the RI pricing processing, no actual DBR processing.  This ranged from May 03 to May 05:
2015-05-03 04:02:40,823 [com.netflix.ice.basic.BasicReservationService] INFO  basic.BasicReservationService  - start polling reservation prices. it might take a while...
2015-05-03 04:02:42,810 [com.netflix.ice.basic.BasicReservationService] INFO  basic.BasicReservationService  - Setting RI price for ap-southeast-2 HEAVY c1.medium 669.0 0.049
2015-05-03 04:02:42,814 [com.netflix.ice.basic.BasicReservationService] INFO  basic.BasicReservationService  - Setting RI price for ap-southeast-2 HEAVY c1.medium.rhel 669.0 0.109
2015-05-03 04:02:42,814 [com.netflix.ice.basic.BasicReservationService] INFO  basic.BasicReservationService  - Setting RI price for ap-southeast-2 HEAVY c1.medium.sles 1021.0 0.06
...
2015-05-05 04:10:51,952 [com.netflix.ice.basic.BasicReservationService] INFO  basic.BasicReservationService  - Setting RI price for us-west-1 HEAVY t1.micro.windows 100.0 0.014
2015-05-05 04:10:51,952 [com.netflix.ice.basic.BasicReservationService] INFO  basic.BasicReservationService  - Setting RI price for us-west-1 HEAVY t1.micro.sqlserverweb 410.0 0.024
2015-05-05 04:10:51,952 [com.netflix.ice.basic.BasicReservationService] INFO  basic.BasicReservationService  - Setting RI price for us-west-1 HEAVY t1.micro.windows 440.0 0.0
2015-05-05 04:10:51,952 [com.netflix.ice.basic.BasicReservationService] INFO  basic.BasicReservationService  - Setting RI price for us-west-1 HEAVY t1.micro.sqlserverweb 978.0 0.0
There was nothing more in this log.
Looking at the contents of console.log-20150503.gz we start to get some insight.  We’ve trimmed the log some to help with verbosity, but included large chunks for clarity sake:
2015-05-01 10:10:43,430 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - AWS usage processed.
2015-05-01 10:15:43,431 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - trying to list objects in billing bucket twc_consolidated_billing using assume role, and external id ice 
2015-05-01 10:15:44,097 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - found 114 in billing bucket twc_consolidated_billing
2015-05-01 10:15:44,097 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-billing-csv-2013-02.csv
...
2015-05-01 10:15:44,097 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-billing-csv-2015-04.csv
2015-05-01 10:15:44,097 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-billing-detailed-line-items-2013-02.csv.zip
...
2015-05-01 10:15:44,098 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-billing-detailed-line-items-with-resources-and-tags-AWSMarketplace-2014-08.csv.zip
2015-05-01 10:15:44,098 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-2013-02.csv
2015-05-01 10:15:44,098 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-2013-03.csv
2015-05-01 10:15:44,098 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-2013-04.csv
...
2015-05-01 10:15:44,098 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-2015-03.csv
2015-05-01 10:15:44,098 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-2015-04.csv
2015-05-01 10:15:44,098 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-AWSMarketplace-2014-06.csv
2015-05-01 10:15:44,098 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-AWSMarketplace-2014-07.csv
2015-05-01 10:15:44,098 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-AWSMarketplace-2014-08.csv
2015-05-01 10:15:44,098 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file aws-programmatic-access-test-object
2015-05-01 10:15:44,098 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file charge.manifest
2015-05-01 10:15:44,161 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring 013328811177-aws-billing-detailed-line-items-2014-01.csv.zip...
2015-05-01 10:15:44,161 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring all files at 2014-01
2015-05-01 10:15:44,183 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring 013328811177-aws-billing-detailed-line-items-2014-02.csv.zip...
2015-05-01 10:15:44,183 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring all files at 2014-02
2015-05-01 10:15:44,194 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring 013328811177-aws-billing-detailed-line-items-2014-03.csv.zip...
...
2015-05-01 10:15:44,363 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring 013328811177-aws-billing-detailed-line-items-2015-04.csv.zip...
2015-05-01 10:15:44,363 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring all files at 2015-04
2015-05-01 10:15:44,363 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - AWS usage processed.
2015-05-01 10:20:44,363 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - trying to list objects in billing bucket twc_consolidated_billing using assume role, and external id ice 
2015-05-01 10:20:45,189 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - found 114 in billing bucket twc_consolidated_billing
2015-05-01 10:20:45,189 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-billing-csv-2013-02.csv
...
2015-05-01 10:20:45,189 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-billing-csv-2015-04.csv
2015-05-01 10:20:45,189 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-billing-detailed-line-items-2013-02.csv.zip
...
2015-05-01 10:20:45,190 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-billing-detailed-line-items-with-resources-and-tags-AWSMarketplace-2014-07.csv.zip
2015-05-01 10:20:45,190 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-billing-detailed-line-items-with-resources-and-tags-AWSMarketplace-2014-08.csv.zip
2015-05-01 10:20:45,190 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-2013-02.csv
...
2015-05-01 10:20:45,190 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-2015-04.csv
2015-05-01 10:20:45,190 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-AWSMarketplace-2014-06.csv
2015-05-01 10:20:45,190 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-AWSMarketplace-2014-07.csv
2015-05-01 10:20:45,190 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file 013328811177-aws-cost-allocation-AWSMarketplace-2014-08.csv
2015-05-01 10:20:45,190 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file aws-programmatic-access-test-object
2015-05-01 10:20:45,190 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - ignoring file charge.manifest
2015-05-01 10:20:45,258 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring 013328811177-aws-billing-detailed-line-items-2014-01.csv.zip...
2015-05-01 10:20:45,258 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring all files at 2014-01
...
2015-05-01 10:55:52,875 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring 013328811177-aws-billing-detailed-line-items-2015-04.csv.zip...
2015-05-01 10:55:52,875 [com.netflix.ice.processor.BillingFileProcessor] INFO  processor.BillingFileProcessor  - data has been processed. ignoring all files at 2015-04
Exception in thread "com.netflix.ice.processor.BillingFileProcessor" java.lang.NullPointerException
    at org.apache.catalina.webresources.AbstractArchiveResource.getContent(AbstractArchiveResource.java:177)
    at org.apache.catalina.loader.WebappClassLoaderBase.findResourceInternal(WebappClassLoaderBase.java:2562)
    at org.apache.catalina.loader.WebappClassLoaderBase.findClassInternal(WebappClassLoaderBase.java:2385)
    at org.apache.catalina.loader.WebappClassLoaderBase.findClass(WebappClassLoaderBase.java:854)
    at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1264)
    at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1147)
    at org.apache.log4j.spi.LoggingEvent.<init>(LoggingEvent.java:165)
    at org.apache.log4j.Category.forcedLog(Category.java:391)
    at org.apache.log4j.Category.log(Category.java:856)
    at org.slf4j.impl.GrailsLog4jLoggerAdapter.logMessage(GrailsLog4jLoggerAdapter.java:192)
    at org.slf4j.impl.GrailsLog4jLoggerAdapter.error(GrailsLog4jLoggerAdapter.java:167)
    at org.apache.commons.logging.impl.SLF4JLog.error(SLF4JLog.java:213)
    at org.codehaus.groovy.grails.exceptions.DefaultStackTraceFilterer.filter(DefaultStackTraceFilterer.java:106)
    at org.codehaus.groovy.grails.exceptions.DefaultStackTraceFilterer.filter(DefaultStackTraceFilterer.java:85)
    at org.slf4j.impl.GrailsLog4jLoggerAdapter.cleanIfException(GrailsLog4jLoggerAdapter.java:178)
    at org.slf4j.impl.GrailsLog4jLoggerAdapter.logMessage(GrailsLog4jLoggerAdapter.java:191)
    at org.slf4j.impl.GrailsLog4jLoggerAdapter.error(GrailsLog4jLoggerAdapter.java:167)
    at com.netflix.ice.common.Poller.doWork(Poller.java:63)
    at com.netflix.ice.common.Poller.access$000(Poller.java:28)
    at com.netflix.ice.common.Poller$1.run(Poller.java:88)
    at java.lang.Thread.run(Thread.java:745)
2015-05-02 04:02:40,823 [com.netflix.ice.basic.BasicReservationService] INFO  basic.BasicReservationService  - start polling reservation prices. it might take a while...
2015-05-02 04:02:42,887 [com.netflix.ice.basic.BasicReservationService] INFO  basic.BasicReservationService  - Setting RI price for ap-southeast-2 HEAVY c1.medium 669.0 0.049
2015-05-02 04:02:42,887 [com.netflix.ice.basic.BasicReservationService] INFO  basic.BasicReservationService  - Setting RI price for ap-southeast-2 HEAVY c1.medium.rhel 669.0 0.109

This is the same NullPointerException we saw in the us-east-1 logs. Rather than nothing occurring after the NPE as it happened in us-east-1, it appears as if ice-processor continues to run the RI polling in us-west-2. Running jstack on the ice-processor process doesn’t reveal anything significant: US-EAST-1:

sudo jstack 1924
2015-05-06 11:41:50
Full thread dump OpenJDK 64-Bit Server VM (24.75-b04 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00007f2490002000 nid=0x9b6 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"DestroyJavaVM" prio=10 tid=0x00007f24b8008800 nid=0x78a waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"http-nio-1234-Acceptor-0" daemon prio=10 tid=0x00007f24b819c000 nid=0x88e runnable [0x00007f2496749000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
    - locked <0x00000007841677a8> (a java.lang.Object)
    at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:688)
    at java.lang.Thread.run(Thread.java:745)

"http-nio-1234-ClientPoller-1" daemon prio=10 tid=0x00007f24b819a000 nid=0x88d runnable [0x00007f249684a000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
    - locked <0x0000000782493f80> (a sun.nio.ch.Util$2)
    - locked <0x0000000782493f70> (a java.util.Collections$UnmodifiableSet)
    - locked <0x0000000782493e38> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
    at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1050)
    at java.lang.Thread.run(Thread.java:745)

"http-nio-1234-ClientPoller-0" daemon prio=10 tid=0x00007f24b8197800 nid=0x88c runnable [0x00007f249694b000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
    - locked <0x0000000782495650> (a sun.nio.ch.Util$2)
    - locked <0x0000000782495640> (a java.util.Collections$UnmodifiableSet)
    - locked <0x0000000782495508> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
    at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1050)
    at java.lang.Thread.run(Thread.java:745)

"ContainerBackgroundProcessor[StandardEngine[Tomcat]]" daemon prio=10 tid=0x00007f24b8281800 nid=0x88b waiting on condition [0x00007f2496a4c000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1345)
    at java.lang.Thread.run(Thread.java:745)

"com.netflix.ice.basic.BasicReservationService" daemon prio=10 tid=0x00007f24749f0800 nid=0x889 waiting on condition [0x00007f2497082000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at com.netflix.ice.common.Poller.doWork(Poller.java:47)
    at com.netflix.ice.common.Poller.access$000(Poller.java:28)
    at com.netflix.ice.common.Poller$1.run(Poller.java:88)
    at java.lang.Thread.run(Thread.java:745)

"java-sdk-http-connection-reaper" daemon prio=10 tid=0x00007f24749a1000 nid=0x887 waiting on condition [0x00007f249718e000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at com.amazonaws.http.IdleConnectionReaper.run(IdleConnectionReaper.java:112)

"NioBlockingSelector.BlockPoller-1" daemon prio=10 tid=0x00007f24b8288000 nid=0x878 runnable [0x00007f24974c8000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
    - locked <0x0000000780e6ff10> (a sun.nio.ch.Util$2)
    - locked <0x0000000780e6ff00> (a java.util.Collections$UnmodifiableSet)
    - locked <0x0000000780e6fde8> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
    at org.apache.tomcat.util.net.NioBlockingSelector$BlockPoller.run(NioBlockingSelector.java:342)

"Thread-2" prio=10 tid=0x00007f24b822a800 nid=0x877 runnable [0x00007f2497bfa000]
   java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
    at java.net.ServerSocket.implAccept(ServerSocket.java:530)
    at java.net.ServerSocket.accept(ServerSocket.java:498)
    at grails.plugin.standalone.Launcher$1.run(Launcher.java:191)

"Service Thread" daemon prio=10 tid=0x00007f24b812e000 nid=0x86c runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f24b812c000 nid=0x86b waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f24b8129000 nid=0x86a waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f24b8127000 nid=0x869 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f24b80fd000 nid=0x7c1 in Object.wait() [0x00007f24ac2f5000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x0000000780507b80> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
    - locked <0x0000000780507b80> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" daemon prio=10 tid=0x00007f24b80fb000 nid=0x7bc in Object.wait() [0x00007f24ac3f6000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x0000000780507910> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:503)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
    - locked <0x0000000780507910> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=10 tid=0x00007f24b80f7000 nid=0x7ba runnable 

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f24b801e800 nid=0x78b runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f24b8020800 nid=0x78c runnable 

"VM Periodic Task Thread" prio=10 tid=0x00007f24b8139000 nid=0x86d waiting on condition 

JNI global references: 169

US-WEST-2:

sudo jstack 1924
2015-05-06 13:11:42
Full thread dump OpenJDK 64-Bit Server VM (24.75-b04 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00007f2340002000 nid=0x6eb1 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"DestroyJavaVM" prio=10 tid=0x00007f2364008800 nid=0x793 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"http-nio-1234-Acceptor-0" daemon prio=10 tid=0x00007f2364281800 nid=0x891 runnable [0x00007f235a9a1000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
    - locked <0x0000000783f0ccd0> (a java.lang.Object)
    at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:688)
    at java.lang.Thread.run(Thread.java:745)

"http-nio-1234-ClientPoller-1" daemon prio=10 tid=0x00007f236427f800 nid=0x890 runnable [0x00007f235aaa2000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
    - locked <0x00000007827e8638> (a sun.nio.ch.Util$2)
    - locked <0x00000007827e8628> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000007827e84f0> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
    at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1050)
    at java.lang.Thread.run(Thread.java:745)

"http-nio-1234-ClientPoller-0" daemon prio=10 tid=0x00007f236427e000 nid=0x88f runnable [0x00007f235aba3000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
    - locked <0x00000007827e9d08> (a sun.nio.ch.Util$2)
    - locked <0x00000007827e9cf8> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000007827e9bc0> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
    at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1050)
    at java.lang.Thread.run(Thread.java:745)

"ContainerBackgroundProcessor[StandardEngine[Tomcat]]" daemon prio=10 tid=0x00007f23642c5000 nid=0x88e waiting on condition [0x00007f235aca4000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1345)
    at java.lang.Thread.run(Thread.java:745)

"java-sdk-http-connection-reaper" daemon prio=10 tid=0x00007f2320d3c800 nid=0x88a waiting on condition [0x00007f235b4c8000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at com.amazonaws.http.IdleConnectionReaper.run(IdleConnectionReaper.java:112)

"NioBlockingSelector.BlockPoller-1" daemon prio=10 tid=0x00007f23642b1800 nid=0x876 runnable [0x00007f235b7cb000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
    - locked <0x000000078002dcf8> (a sun.nio.ch.Util$2)
    - locked <0x000000078002dce8> (a java.util.Collections$UnmodifiableSet)
    - locked <0x000000078002dbb0> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
    at org.apache.tomcat.util.net.NioBlockingSelector$BlockPoller.run(NioBlockingSelector.java:342)

"Thread-2" prio=10 tid=0x00007f2364223800 nid=0x875 runnable [0x00007f235befd000]
   java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
    at java.net.ServerSocket.implAccept(ServerSocket.java:530)
    at java.net.ServerSocket.accept(ServerSocket.java:498)
    at grails.plugin.standalone.Launcher$1.run(Launcher.java:191)

"Service Thread" daemon prio=10 tid=0x00007f2364130000 nid=0x86e runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f236412e000 nid=0x86b waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f236412b000 nid=0x864 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f2364129000 nid=0x861 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f23640fd000 nid=0x7c0 in Object.wait() [0x00007f23605f4000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000007800169f8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
    - locked <0x00000007800169f8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" daemon prio=10 tid=0x00007f23640fb000 nid=0x7bf in Object.wait() [0x00007f23606f5000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x0000000780016788> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:503)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
    - locked <0x0000000780016788> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=10 tid=0x00007f23640f7000 nid=0x7b1 runnable 

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f236401e800 nid=0x794 runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f2364020800 nid=0x795 runnable 

"VM Periodic Task Thread" prio=10 tid=0x00007f236413b000 nid=0x86f waiting on condition 

JNI global references: 270

Because the decommissioned (but still running) production servers and the new production servers share an s3 bucket, once the new servers started processing the Amazon DBR, the decommissioned servers process ice-reader was able to pick up the newly processed files out of S3 and display accurate data in the UI to the end-user even though the ice-reader processes were still in a funk state. See screenshot below of us-east-1 decommissioned server which before deploying new prod servers did not show May data is now showing May data: fixed2

At this point it simply looks like the ice-processor has some sort of internal error which we have no good way to account for other than attempting to monitor the data itself, see section Long term solution thoughts/ideas.

2015/05/06: Met with Cameron Stokes, outcome of meeting below:

ckelner commented 9 years ago

See official issue opened with Netflix: https://github.com/Netflix/ice/issues/171